soundex (revisited)
Greg Jorgensen
gregj at pobox.com
Mon Dec 25 04:55:04 EST 2000
More information about the Python-list mailing list
Mon Dec 25 04:55:04 EST 2000
- Previous message (by thread): soundex (revisited)
- Next message (by thread): soundex (revisited)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
"Daniel Klein" <DanielK at aracnet.com> wrote in message news:Var16.263$LU6.109277 at typhoon.aracnet.com... > After seeing the post from several days ago on soundex, I gave it whirl to > see if I could come up with something different (and possibly better), > following the rules laid down by Knuth: > > def get_soundex(name, digits = 3): > soundexcodes = "01230120022455012623010202" > # ABCDEFGHIJKLMNOPQRSTUVWXYZ > instring = name.upper() > soundex = instring[0] > last = soundex > instring = instring[1:] > for char in instring: > if 65 <= ord(char) <= 90: > sx = soundexcodes[ord(char) - 65] > if int(sx) and char != last: > soundex += sx > last = char > if len(soundex) < (digits + 1): soundex = (soundex + ("0" * digits)) > return soundex[:digits + 1] I see a few problems, mainly in the handling of consecutive consonants. You are checking for consecutive characters, but the Soundex algorithm specifies that consecutive character codes be treated as a single code. Both 'mm' and 'mn' are considered consecutive codes because both 'm' and 'n' are coded as 5. You can (and probably should) use the isalpha() string method to check for alpha characters, rather than the 'magic numbers' 65 through 90. Likewise ord(char) - ord('A') is a bit more clear. Here's a version I wrote. I'm open to any criticisms, suggestions, etc. I compared my version to the module announced here a while back (I think mine is a lot more readable; it is certainly shorter). I also compared it to a Perl version I found and I think my implementation is more robust and smaller. def soundex(name, len=4): """ soundex module conforming to Knuth's algorithm implementation 2000-12-24 by Gregory Jorgensen public domain """ # digits holds the soundex values for the alphabet digits = '01230120022455012623010202' sndx = '' fc = '' # translate alpha chars in name to soundex digits for c in name.upper(): if c.isalpha(): if not fc: fc = c # remember first letter d = digits[ord(c)-ord('A')] # duplicate consecutive soundex digits are skipped if not sndx or (d != sndx[-1]): sndx += d # replace first digit with first alpha character sndx = fc + sndx[1:] # remove all 0s from the soundex code sndx = sndx.replace('0','') # return soundex code padded to len characters return (sndx + (len * '0'))[:len] -- Greg Jorgensen Deschooling Society Portland, Oregon, USA gregj at pobox.com
- Previous message (by thread): soundex (revisited)
- Next message (by thread): soundex (revisited)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list