Help beautify ugly heuristic code
Carlos Ribeiro
carribeiro at gmail.com
Wed Dec 8 19:13:57 EST 2004
More information about the Python-list mailing list
Wed Dec 8 19:13:57 EST 2004
- Previous message (by thread): Help beautify ugly heuristic code
- Next message (by thread): Help beautify ugly heuristic code
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 8 Dec 2004 15:39:15 -0800, Lonnie Princehouse <finite.automaton at gmail.com> wrote: > Regular expressions. > > It takes a while to craft the expressions, but this will be more > elegant, more extensible, and considerably faster to compute (matching > compiled re's is fast). I think that this problem is probably a little bit harder. As the OP noted, each ISP uses a different notation. I think that a better solution is to use a statistical approach, possibly using a custom Bayesian filter that could "learn" a little bit about some patters. The basic idea is as follows: -- break the URL in pieces, using not only the dots, but also hyphens and underscores in the name. -- classify each part, using REs to identify common patterns: frequent strings (com, gov, net, org); normal words (sequences of letters); normal numbers; combinations of numbers & letters; common substrings can also be identified (such as isp, in the middle of one of the strings). -- check these pieces against the Bayesian filter, pretty much as it's done for spam. I think that this approach is promising. It relies on the fact that real servers usually do not have numbers in their names; however, exact identification either by a match or a regular expression is very difficult. I'm willing to try it, but first, more data is needed. -- Carlos Ribeiro Consultoria em Projetos blog: http://rascunhosrotos.blogspot.com blog: http://pythonnotes.blogspot.com mail: carribeiro at gmail.com mail: carribeiro at yahoo.com
- Previous message (by thread): Help beautify ugly heuristic code
- Next message (by thread): Help beautify ugly heuristic code
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list