Street address parsing in Python, again.
John Nagle
nagle at animats.com
Fri Jun 4 14:28:44 EDT 2010
More information about the Python-list mailing list
Fri Jun 4 14:28:44 EDT 2010
- Previous message (by thread): tallying occurrences in list
- Next message (by thread): Street address parsing in Python, again.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I'm still struggling with street address parsing in Python. (Previous discussion: http://www.velocityreviews.com/forums/t720759-usable-street-address-parser-in-python.html) I need something good enough to reliably extract street name and number. That gives me something I can match against databases. There are several parsers available in Perl, and various online services that have a street name database. The online parsers are good, but I need to encode some big databases, and the online ones are either rate-limited or expensive. The parser at PyParsing: http://pyparsing.wikispaces.com/file/view/streetAddressParser.py seems to work on about 80% of addresses. Addresses with "pre-directionals" and street types before the name seem to give the most trouble: 487 E. Middlefield Rd. -> streetnumber = 487, streetname = E. MIDDLEFIELD 487 East Middlefield Road -> streetnumber = 487, streetname = EAST MIDDLEFIELD 226 West Wayne Street -> streetnumber = 226, streetname = WEST WAYNE (Those are all Verisign offices) New Orchard Road -> streetnumber = , streetname = NEW 1 New Orchard Road -> streetnumber = 1 , streetname = NEW (IBM corporate HQ) 390 Park Avenue -> streetnumber =, streetname = 390 (Alcoa corporate HQ) None of those addresses are exotic or corner cases, but they're all mis-parsed. There's a USPS standard on this which might be helpful. http://pe.usps.com/text/pub28/28c2_003.html That says "When parsing the Delivery Address Line into the individual components, start from the right-most element of the address and work toward the left. Place each element in the appropriate field until all address components are isolated." PyParsing works left to right, and trying to do look-ahead to achieve the effect of right-to-left isn't working. It may be necessary to split the input, reverse the tokens, and write a parser that works in reverse. John Nagle
- Previous message (by thread): tallying occurrences in list
- Next message (by thread): Street address parsing in Python, again.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list