Consider the following snippet:
import difflib
first = u'location,location,location'
for second in (
u'location.location.location', # two periods (no commas)
u'location.location,location', # period after first
u'location,location.location', # period after second
u'location,location,location', # perfect match
):
edit_dist = difflib.SequenceMatcher(None, first, second).ratio()
print("comparing %r vs. %r gives edit dist: %g" % (first, second, edit_dist))
I would expect the second and third tests to give the same result, but in reality:
comparing u'location,location,location' vs. u'location.location.location' gives edit dist: 0.923077
comparing u'location,location,location' vs. u'location.location,location' gives edit dist: 0.653846
comparing u'location,location,location' vs. u'location,location.location' gives edit dist: 0.961538
comparing u'location,location,location' vs. u'location,location,location' gives edit dist: 1
The same results are received from Python 3.4.
From experimenting, it seems that when the period comes after the first "location", the longest match found is the final two "locations" from the first string against the first two "locations" from the second string.
In [31]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').ratio()
Out[31]: 0.6538461538461539
In [32]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').get_matching_blocks()
Out[32]: [Match(a=0, b=9, size=17), Match(a=26, b=26, size=0)]
In [33]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').ratio()Out[33]: 0.9615384615384616
In [34]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').get_matching_blocks()
Out[34]:
[Match(a=0, b=0, size=17),
Match(a=18, b=18, size=8),
Match(a=26, b=26, size=0)]
Using `quick_ratio` instead of `ratio` gives (what I consider to be) the correct result. |