Python Web Stats

See also Sisynala: http://mithrandr.moria.org/code/sisynala/ , written in Python.

I've been trying to find a good OpenSource web log analysis tool. Here are tools I found so far:

IMO, all of them are very inflexible and completely tied to their web interfaces (which suck for the most part). None of them provide a programer's interface, which is what I want.

Here's the details of what I'm looking for:

  • must be OpenSource, preferably with a Python / BSD or LGPL style license rather than GPL <BR> (well, not everyone necessarily agrees with that: I consider The GPL a good thing -- Main.IanBicking <BR>I agree for applications, but I'm thinking about something that can be used as a library without dictating the license terms of the applications that use it. --TavisRudd

Of course, the GPL only affects proprietary programs that are distributed, so I don't see a big problem...? I wouldn't be happy with people making my work proprietary when I gave it to them in good faith, and that's exactly what the GPL prevents --IanBicking)

  • must provide a programmer's interface, not just a web interface

  • must be completely portable

  • must be easy to install and configure

  • must work with Apache combined log format + IIS format

  • should be able to work with the Common Log Format

  • should be possible to write new parser modules for new log formats

  • fast parsing of the log-files

  • preferably written in Python, or providing a Python interface

  • provides reverse DNS lookup

  • process log files split by load balancing mechanisms or log rotation

  • reports:

    • pages views

    • unique visits

    • unique human visits

    • referrers

    • authentificated users

    • robots (+ can filter them out)

    • file mime types

    • browsers

    • os

    • http errors

    • 404 errors,

    • kewords/phrases from search engines

    • entry pages

    • exit pages

    • domains/countries,

  • allows filtering by anything: IP address, mime type, domain, whatever

  • can provide the following summaries:

    • hourly, daily, weekly, monthly, yearly summary of all logged variables

    • ranked (highest-to-lowest + vice versa) listings of all logged variables

    • event-based summary periods -- e.g., you'd want to reset logs after doing a lot of search engine submittal, which may not be aligned with a weekly or monthly summary, but you'd like grouped together in a larger unit than daily.

  • can use cookie vars to trace and summarize traffic flows (i.e. userid cookies) [see mod_session]

Flexibility and extensibility matter more to me than raw performance.

Unless a package that meets these requirements exists, I'm proposing that we start a project to build one.

-- TavisRudd - 03 Nov 2001


Implemention thoughts

  • the parsing and log reading classes should be completely separate from the rest of the classes

  • some (configurable) manager to see (a) where you left off parsing last time and (b) where all possible logfiles are. This should deal both with a logfile growing since the last parsing, and with logfile rotation.

  • the raw log data should be parsed and then stored in a simple DBM style database (there may be faster formats, depending on how the data is queried -- usually I imagine it would be sequentially based on a start/stop time)

  • all derived data should calculated from the DBM store and in turn stored in a DBM format

  • the dates should be handled using mx.DateTime

  • all config settings should be managed using the SettingsManager API used in Cheetah

Rough Class layout:

  • SettingsManager

  • Parser & associated classes for getting raw data out of the logs

  • Storage classes for getting data in and out of the internal data store

  • Classes for calculating all the basic derived data (reverse dns lookup, etc.)

  • Classes for calculating the advanced derived data and summaries -- with some sort of caching mechanism for storing the results.

  • Interface classes for putting it all together

-- TavisRudd - 03 Nov 2001