Issue 31889: difflib SequenceMatcher ratio() still have unpredictable behavior

Issue31889

Created on 2017-10-28 10:03 by Siltaar, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (3)
msg305153 - (view) Author: Simon Descarpentries (Siltaar) Date: 2017-10-28 10:03
I, it's my 1st post here. I'm a French computer-science engineer with 10 years XP and manager at Acoeuro.com SSLL compagny. I suggested a better regexp integration on python-ideas a few months ago failing to convince getting things done.

Despite issue https://bugs.python.org/issue25391 closed in 2010, nothing seems visible on https://docs.python.org/3/library/difflib.html to help users anticipating that a string greatly matching at 199 characters length, won't at all at the 200th one. It's an inconsistent behavior that looks like a bug.

#!/usr/bin/env python3

from difflib import SequenceMatcher

a = 'ab'*400                                                                                    
b = 'ba'*400

for i in range(1, 400):
    diff_ratio = SequenceMatcher(None, a=a[:i], b=b[:i]).ratio()
    print('%3.i %.2f' % (i, diff_ratio), end=' ')
    not i % 10 and print('')

EOF

At 199c I have a 0.99 ratio, and 0.00 at 200c. The results are nearly the same with strings made of emails like in https://github.com/Siltaar/drop_alternatives especially comparing strings like : 

"performantsetdvelopperducontenusimilairepourvosprochainsTweets.Suiveznous@TwitterBusinesspourdautresinfosetactus.TesterlesPublicitsTwitterbusiness.twitter.com|@TwitterBusiness|Aide|SedsinscrireLemailsadresse@gggTwitter,Inc.MarketStreet,SuiteSanFrancisco,CA"

"rducontenusimilairepourvosprochainsTweets.Suiveznous@TwitterBusinesspourprofiterdautresinfosetactus.TesterlesPublicitésTwitterbusiness.twitter.com@TwitterBusinessAideSedésinscrireTwitterInternationalCompanyOneCumberlandPlace,FenianStreetDublin,DAXIRELAND"

Fortunately, I didn't experienced the problem using quick_ratio().

The documentation is not clear about ratio / quick_ratio / real_quick_ratio ; but looks like unstable. With in addition an inconsistent behavior it looks like worthing some fix.
msg305530 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-11-04 01:42
Pass "autojunk=False" to your SequenceMatcher constructor and the ratio you get back will continue to increase as `i` increases.

The docs:

"""
Automatic junk heuristic: SequenceMatcher supports a heuristic that automatically treats certain sequence items as junk. The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. This heuristic can be turned off by setting the autojunk argument to False when creating the SequenceMatcher.
"""

Note in particular the "at least 200 items long" there.  That's why you see a change in behavior at i == 200.  Before then, the autojunk heuristic is not triggered.

If we had it all to do over again, I'd make autojunk=False the default.  But - for backward compatibility - it's too late to change that now.

As to quick_ratio() and real_quick_ratio(), NOTHING about them is defined - or intended to be defined - beyond what the docs already say:

"""
quick_ratio()
Return an upper bound on ratio() relatively quickly.

real_quick_ratio()
Return an upper bound on ratio() very quickly.
"""

Nothing in the code you posted violates those promises (e.g., 0.99 is certainly an upper bound with respect to 0.0).  If they always returned 1.0, regardless of inputs, that would meet the docs' promises too.  Only the result of ratio() is defined.

As to why they exist, the docs explain that:

"""
ratio()
Return a measure of the sequences’ similarity as a float in the range [0, 1].

...

This is expensive to compute if get_matching_blocks() or get_opcodes() hasn’t already been called, in which case you may want to try quick_ratio() or real_quick_ratio() first to get an upper bound.
"""

So if the expense of calling ratio() isn't a concern in your context, there's no reason to call quick_ratio() or real_quick_ratio().  They exist so it's possible to code a cheap "early out" path in contexts where calling ratio() all the time may be prohibitively expensive.

For example, consider an app that has no use at all for a ratio < 0.5.  If quick_ratio() (or real_quick_ratio()) return something less than 0.5, then it's guaranteed that ratio() would too, since the quicker versions return upper bounds on what ratio() would return.

This is valuable where it matters, and completely useless where it doesn't matter ;-)

In any case, since `autojunk=False` repairs the behavior you pointed out, I'm closing this as Won't-Fix.
msg305740 - (view) Author: Simon Descarpentries (Siltaar) Date: 2017-11-07 12:05
Hi,

I missed the part of the doc you pointed out, being focused on ratio()
function family.

Thanks for your gentle reply.
History
Date User Action Args
2022-04-11 14:58:53adminsetgithub: 76070
2020-01-11 01:04:50terry.reedylinkissue39249 superseder
2017-11-07 12:05:04Siltaarsetmessages: + msg305740
2017-11-04 01:42:35tim.peterssetstatus: open -> closed
resolution: wont fix
messages: + msg305530

stage: resolved

2017-11-03 20:08:26terry.reedysetnosy: + tim.peters

versions: + Python 3.7, - Python 3.5

2017-10-28 10:03:33Siltaarcreate