Spambayes + HTTP proxy server
Paul Paterson
hamonlypaulpaterson at houston.rr.com
Sun Feb 2 16:23:36 EST 2003
More information about the Python-list mailing list
Sun Feb 2 16:23:36 EST 2003
- Previous message (by thread): Help a C++ coder see the light
- Next message (by thread): Spambayes + HTTP proxy server
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
"Skip Montanaro" <skip at pobox.com> wrote in message news:mailman.1044210485.12265.python-list at python.org... > > Sorry for the too quick post. In rearranging things I lost the spam return. > Just to be sure it was actually filtering something, I searched for "sex" at > Google. It let that page in, allowed the safersex and SEX.ETC pages > through, but blocked HBO's Sex and the City and janesguide. Note that this > is using my current hammmie.db file, which has only been trained on my ham > and spam email collections. I don't expect it to necessarily do a very good > job with web pages given no training. > > Skip > > import os > > from proxy3_filter import * > import proxy3_options > > from spambayes import hammie, Options, mboxutils > dbf = os.path.expanduser(Options.options.hammiefilter_persistent_storage_file) > > class SpambayesFilter(BufferAllFilter): > hammie = hammie.open(dbf, 1, 'r') > > def filter(self, s): > if self.reply.split()[1] == '200': > prob = self.hammie.score("%s\r\n%s" % (self.serverheaders, s)) > print "| prob: %.5f" % prob > if prob >= Options.options.spam_cutoff: > print self.serverheaders > print "text:", s[0:40], "...", s[-40:] > return "not authorized" > return s > > from proxy3_util import * > > register_filter('*/*', 'text/html', SpambayesFilter) > This looks great - I'm giving this a go now. I think that, as you say, the key now is to train on a corpus of web pages rather than spam/ham. I notice that Spambayes has a proxy server which can be used for easy training. I'll take a look at this and see if it can be used to train on web pages too.
- Previous message (by thread): Help a C++ coder see the light
- Next message (by thread): Spambayes + HTTP proxy server
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list