Most Effective Way to Build Up a Histogram of Words?
Alex Martelli
aleaxit at yahoo.com
Thu Oct 12 16:31:43 EDT 2000
More information about the Python-list mailing list
Thu Oct 12 16:31:43 EDT 2000
- Previous message (by thread): Most Effective Way to Build Up a Histogram of Words?
- Next message (by thread): Historical details (was: Programmer-Wanna-Be (Is Python for me?))
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
"Stephen Kloder" <stephenk at cc.gatech.edu> wrote in message news:39E605AE.495A8406 at cc.gatech.edu... > June Kim wrote: > > > > > and then how could I sort the dictionary according to the > > frequency order? > > > > Rather than creating a list-of-tuples data structure, you might want to use > Python's built-in sorting features: > words=histogram.keys() > words.sort(lambda x,y:cmp(histogram[y],histogram[x])) > for w in words: > print "%s : %d"%(w,histogram[w]) If you have a *small* number of words, this is OK. But, if the words are many, then its speed is a big problem: the lambda is called O(N log N) times, a large extra effort compared to using Python's built-in comparison. Consider a simple case, far smaller than what the original poster mentioned as the typical size for the files to be histogrammed. Preparing the histogram and checking the sizes of things...: >>> filedata=open('c:/fp.htm').read() >>> len(filedata) 502953 >>> words=filedata.split() >>> len(words) 17809 >>> hist={} >>> for word in words: hist[word] = 1 + hist.get(word, 0) >>> len(hist) 6358 Now, define the sorting functions based on the two different approaches: >>> def sort1(hist): sorted = [ (count, word) for word, count in hist.items() ] sorted.sort() return sorted >>> def sort2(hist): words = hist.keys() words.sort(lambda x,y, h=hist: cmp(h[x], h[y])) return words Finally, measure performance. profile is somewhat uncalibrated in favour of the version with the least number of calls, so it exaggerates things, but...: >>> import profile >>> profile.run("hh=sort1(hist)") 3 function calls in 0.664 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.360 0.360 0.360 0.360 <pyshell#18>:1(sort1) 1 0.000 0.000 0.360 0.360 <string>:1(?) 1 0.304 0.304 0.664 0.664 profile:0(hh=sort1(hist)) 0 0.000 0.000 profile:0(profiler) >>> profile.run("hh=sort2(hist)") 60538 function calls in 7.448 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 3.666 3.666 7.414 7.414 <pyshell#23>:1(sort2) 60535 3.748 0.000 3.748 0.000 <pyshell#23>:3(<lambda>) 1 0.029 0.029 7.443 7.443 <string>:1(?) 1 0.005 0.005 7.448 7.448 profile:0(hh=sort2(hist)) 0 0.000 0.000 profile:0(profiler) >>> Although not actually as bad as one order of magnitude when sorting a few thousand entries, the difference is already quite perceptible -- and when the entries grow in numbers, the ratio favours the "sort *without* an explicit comparison function by more and more. I think this is called a "schwartzian transform" in the Perl world, but the approach is just as valid in Python: for performance, when sorting a largish number of entries, do not use the feature of sort that lets you pass an explicit comparison function (which will get called O(N log N) times); rather, prepare a list of tuples (or other linear structure) suitable for natural (lexicographical) sorting, sort it "bare", and unravel it again if need be. Preparation and unraveling are O(N) tasks, and the part that remains O(N log N), the actual sort, runs as fast as possible by this approach. Alex
- Previous message (by thread): Most Effective Way to Build Up a Histogram of Words?
- Next message (by thread): Historical details (was: Programmer-Wanna-Be (Is Python for me?))
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list