space-efficient top-N algorithm
Terry Reedy
tjreedy at udel.edu
Mon Feb 10 10:20:01 EST 2003
More information about the Python-list mailing list
Mon Feb 10 10:20:01 EST 2003
- Previous message (by thread): space-efficient top-N algorithm
- Next message (by thread): space-efficient top-N algorithm
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> David Garamond <lists at zara.6.isreserved.com> wrote in message news:<mailman.1044791875.2092.python-list at python.org>... > > I am processing referrer log files with Python. The log files are in > > ASCII text and line-based, similar to a webserver log but produced by > > our banner management (web bug, more precisely) software. Each line > > contains, among other, a URL, a timestamp, and an IP address. One of the > > desired report is the list of top 50 recurring URLs within a certain > > period (e.g. monthly or weekly). > > What I currently do is just read the log files line by line, parse it > > into its elements, and store the URL string as the key of a dictionary > > and the number of occurence as the value. In the end I would just get > > back to the dictionary and print the keys in descending order of the values. The distribution of counts should tend to be fairly consistent from period to period. A few urls with high counts, more with medium counts, and lots with few counts, down to one. With a bit of experience, you should have a pretty good idea of how many counts are required to be in top 50. So I would consider scanning dict with iteritems to extract candidates, aiming for, say, the top 100. IE, keep items with count above threshhold. If get too few or start to get too many, restart with adjusted threshhold. Then sort that reduced list. Terry J. Reedy
- Previous message (by thread): space-efficient top-N algorithm
- Next message (by thread): space-efficient top-N algorithm
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list