Message 252925 - Python tracker

Message252925

Author	Lewis Haley
Recipients	Lewis Haley
Date	2015-10-13.11:07:40
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1444734460.53.0.30170524569.issue25391@psf.upfronthosting.co.za>
In-reply-to

Content
Consider the following snippet: import difflib first = u'location,location,location' for second in ( u'location.location.location', # two periods (no commas) u'location.location,location', # period after first u'location,location.location', # period after second u'location,location,location', # perfect match ): edit_dist = difflib.SequenceMatcher(None, first, second).ratio() print("comparing %r vs. %r gives edit dist: %g" % (first, second, edit_dist)) I would expect the second and third tests to give the same result, but in reality: comparing u'location,location,location' vs. u'location.location.location' gives edit dist: 0.923077 comparing u'location,location,location' vs. u'location.location,location' gives edit dist: 0.653846 comparing u'location,location,location' vs. u'location,location.location' gives edit dist: 0.961538 comparing u'location,location,location' vs. u'location,location,location' gives edit dist: 1 The same results are received from Python 3.4. From experimenting, it seems that when the period comes after the first "location", the longest match found is the final two "locations" from the first string against the first two "locations" from the second string. In [31]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').ratio() Out[31]: 0.6538461538461539 In [32]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').get_matching_blocks() Out[32]: [Match(a=0, b=9, size=17), Match(a=26, b=26, size=0)] In [33]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').ratio()Out[33]: 0.9615384615384616 In [34]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').get_matching_blocks() Out[34]: [Match(a=0, b=0, size=17), Match(a=18, b=18, size=8), Match(a=26, b=26, size=0)] Using `quick_ratio` instead of `ratio` gives (what I consider to be) the correct result.

Content

Consider the following snippet:

import difflib

first = u'location,location,location'
for second in (
    u'location.location.location',  # two periods (no commas)
    u'location.location,location',  # period after first
    u'location,location.location',  # period after second
    u'location,location,location',  # perfect match
):
    edit_dist = difflib.SequenceMatcher(None, first, second).ratio()
    print("comparing %r vs. %r gives edit dist: %g" % (first, second, edit_dist))

I would expect the second and third tests to give the same result, but in reality:

comparing u'location,location,location' vs. u'location.location.location' gives edit dist: 0.923077
comparing u'location,location,location' vs. u'location.location,location' gives edit dist: 0.653846
comparing u'location,location,location' vs. u'location,location.location' gives edit dist: 0.961538
comparing u'location,location,location' vs. u'location,location,location' gives edit dist: 1

The same results are received from Python 3.4.

From experimenting, it seems that when the period comes after the first "location", the longest match found is the final two "locations" from the first string against the first two "locations" from the second string.

In [31]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').ratio()
Out[31]: 0.6538461538461539

In [32]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').get_matching_blocks()
Out[32]: [Match(a=0, b=9, size=17), Match(a=26, b=26, size=0)]

In [33]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').ratio()Out[33]: 0.9615384615384616

In [34]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').get_matching_blocks()
Out[34]: 
[Match(a=0, b=0, size=17),
 Match(a=18, b=18, size=8),
 Match(a=26, b=26, size=0)]

Using `quick_ratio` instead of `ratio` gives (what I consider to be) the correct result.

History
Date	User	Action	Args
2015-10-13 11:07:40	Lewis Haley	set	recipients: + Lewis Haley
2015-10-13 11:07:40	Lewis Haley	set	messageid: <1444734460.53.0.30170524569.issue25391@psf.upfronthosting.co.za>
2015-10-13 11:07:40	Lewis Haley	link	issue25391 messages
2015-10-13 11:07:40	Lewis Haley	create