Python Web Stats

See also Sisynala: http://mithrandr.moria.org/code/sisynala/ , written in Python.

I've been trying to find a good OpenSource web log analysis tool. Here are tools I found so far:

AWStats -- written in Perl
Analog -- written in C
Webalizer -- written in C

IMO, all of them are very inflexible and completely tied to their web interfaces (which suck for the most part). None of them provide a programer's interface, which is what I want.

Here's the details of what I'm looking for:

must be OpenSource, preferably with a Python / BSD or LGPL style license rather than GPL <BR> (well, not everyone necessarily agrees with that: I consider The GPL a good thing -- Main.IanBicking <BR>I agree for applications, but I'm thinking about something that can be used as a library without dictating the license terms of the applications that use it. --TavisRudd

Of course, the GPL only affects proprietary programs that are distributed, so I don't see a big problem...? I wouldn't be happy with people making my work proprietary when I gave it to them in good faith, and that's exactly what the GPL prevents --IanBicking)

must provide a programmer's interface, not just a web interface
must be completely portable
must be easy to install and configure
must work with Apache combined log format + IIS format
- Actually, IIS uses the W3C logfile format (http://www.w3.org/TR/WD-logfile.html), so W3C compliance would do. --ChristianPackmann
should be able to work with the Common Log Format
should be possible to write new parser modules for new log formats
fast parsing of the log-files
preferably written in Python, or providing a Python interface
provides reverse DNS lookup
process log files split by load balancing mechanisms or log rotation
reports:
- pages views
- unique visits
- unique human visits
- referrers
- authentificated users
- robots (+ can filter them out)
- file mime types
- browsers
- os
- http errors
- 404 errors,
- kewords/phrases from search engines
- entry pages
- exit pages
- domains/countries,
allows filtering by anything: IP address, mime type, domain, whatever
can provide the following summaries:
- hourly, daily, weekly, monthly, yearly summary of all logged variables
- ranked (highest-to-lowest + vice versa) listings of all logged variables
- event-based summary periods -- e.g., you'd want to reset logs after doing a lot of search engine submittal, which may not be aligned with a weekly or monthly summary, but you'd like grouped together in a larger unit than daily.
can use cookie vars to trace and summarize traffic flows (i.e. userid cookies) [see mod_session]

Flexibility and extensibility matter more to me than raw performance.

Unless a package that meets these requirements exists, I'm proposing that we start a project to build one.

-- TavisRudd - 03 Nov 2001

Implemention thoughts

the parsing and log reading classes should be completely separate from the rest of the classes
some (configurable) manager to see (a) where you left off parsing last time and (b) where all possible logfiles are. This should deal both with a logfile growing since the last parsing, and with logfile rotation.
the raw log data should be parsed and then stored in a simple DBM style database (there may be faster formats, depending on how the data is queried -- usually I imagine it would be sequentially based on a start/stop time)
all derived data should calculated from the DBM store and in turn stored in a DBM format
the dates should be handled using mx.DateTime
all config settings should be managed using the SettingsManager API used in Cheetah

Rough Class layout:

SettingsManager
Parser & associated classes for getting raw data out of the logs
Storage classes for getting data in and out of the internal data store
Classes for calculating all the basic derived data (reverse dns lookup, etc.)
Classes for calculating the advanced derived data and summaries -- with some sort of caching mechanism for storing the results.
Interface classes for putting it all together

-- TavisRudd - 03 Nov 2001