Indexing HTML!
Martin Christensen
knightsofspamalot-factotum at gvdnet.dk
Sat Dec 28 17:15:47 EST 2002
More information about the Python-list mailing list
Sat Dec 28 17:15:47 EST 2002
- Previous message (by thread): Indexing HTML!
- Next message (by thread): Indexing HTML!
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>> "John" == John <johng2001 at rediffmail.com> writes: John> I have been struggling for the past few days to get this done. I John> have a few small document (HTML) collections, each of which will John> be exposed on an independent simplistic intranet site (Apache on John> Linux). I need some indexing solutions. As I wrote yesterday in a different thread, I'm working on a keyword-based search engine to work on top of relational databases. For this purpose I have some code that you might be able to continue working from. I've built the foundation for an index such as the one you're looking for, and with a bit of hacking you should be able to make use of it, I'd guess. The full-text indexer is prepared to handle different 'text munchers' that apply different filters to texts to prepare them for processing. Such filters might remove HTML code. It should be fairly easy to code using regular expressions, I should think. In its entirety it probably won't be the best choice for your use because of its focus on relational databases, but what code I have I'll gladly share (after I clean it up I'm releasing it under the GPL), and in a couple of weeks I can give you a technical report to help you better grok that code. While you're at it, a paper you might want to look at is 'Inverted Files Versus Signature Files for Text Indexing' by Zobel, Moffat and Ramamohanarao (http://www.cs.arizona.edu/people/tods/accepted/1998/ZobelInverted.ps). In its current state, my index does not handle updating of existing data at all, nor removal of data, both of which are absolutely necessary unless you want to recreate the index every time something changes (which might be realistic in your case, as you describe it). Let me know if it sounds interesting to you. I'm sure many people would benefit from your work if you develop it in the direction that you yourself need. Martin - -- Homepage: http://www.cs.auc.dk/~factotum/ GPG public key: http://www.cs.auc.dk/~factotum/gpgkey.txt -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: Using Mailcrypt+GnuPG <http://www.gnupg.org> iEYEARECAAYFAj4OIpMACgkQYu1fMmOQldU6PgCgkmIhUV288cKuqc0bWshda4NL PccAoMLMh8PSKawMPsBOqR/xEOElejji =LJJW -----END PGP SIGNATURE-----
- Previous message (by thread): Indexing HTML!
- Next message (by thread): Indexing HTML!
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list