Message 252925 - Python tracker

Message252925

Author Lewis Haley
Recipients Lewis Haley
Date 2015-10-13.11:07:40
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1444734460.53.0.30170524569.issue25391@psf.upfronthosting.co.za>
In-reply-to
Content
Consider the following snippet:

import difflib

first = u'location,location,location'
for second in (
    u'location.location.location',  # two periods (no commas)
    u'location.location,location',  # period after first
    u'location,location.location',  # period after second
    u'location,location,location',  # perfect match
):
    edit_dist = difflib.SequenceMatcher(None, first, second).ratio()
    print("comparing %r vs. %r gives edit dist: %g" % (first, second, edit_dist))

I would expect the second and third tests to give the same result, but in reality:

comparing u'location,location,location' vs. u'location.location.location' gives edit dist: 0.923077
comparing u'location,location,location' vs. u'location.location,location' gives edit dist: 0.653846
comparing u'location,location,location' vs. u'location,location.location' gives edit dist: 0.961538
comparing u'location,location,location' vs. u'location,location,location' gives edit dist: 1

The same results are received from Python 3.4.

From experimenting, it seems that when the period comes after the first "location", the longest match found is the final two "locations" from the first string against the first two "locations" from the second string.

In [31]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').ratio()
Out[31]: 0.6538461538461539

In [32]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').get_matching_blocks()
Out[32]: [Match(a=0, b=9, size=17), Match(a=26, b=26, size=0)]

In [33]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').ratio()Out[33]: 0.9615384615384616

In [34]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').get_matching_blocks()
Out[34]: 
[Match(a=0, b=0, size=17),
 Match(a=18, b=18, size=8),
 Match(a=26, b=26, size=0)]

Using `quick_ratio` instead of `ratio` gives (what I consider to be) the correct result.
History
Date User Action Args
2015-10-13 11:07:40Lewis Haleysetrecipients: + Lewis Haley
2015-10-13 11:07:40Lewis Haleysetmessageid: <1444734460.53.0.30170524569.issue25391@psf.upfronthosting.co.za>
2015-10-13 11:07:40Lewis Haleylinkissue25391 messages
2015-10-13 11:07:40Lewis Haleycreate