Efficient posting-list
Terry Reedy
tjreedy at udel.edu
Mon Jun 3 09:37:03 EDT 2002
More information about the Python-list mailing list
Mon Jun 3 09:37:03 EDT 2002
- Previous message (by thread): Efficient posting-list
- Next message (by thread): Efficient posting-list
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
"Shagshag13" <shagshag13 at yahoo.fr> wrote in message news:adffdk$10mep9$1 at ID-146704.news.dfncis.de... > A posting list is something like that (hope i'm not too wrong) : > > [key_0] -> ... > ... > [key_i] -> [id_j, informations] -> [id_k, informations] -> ... > ... > [key_n] -> ... > > where in information retrieval : > - key are often words, > - id are document id (so we have key_i in document id_j and in document id_k and ...) > - informations are often in document frequency, but might be more... > > I'm looking for an efficient way of implementing this in full python, by now i use a dict > for keys and a python list containing node objects for [id_j, informations]. > But i have two troubles this is very slow to populate, and too memory consuming do you > have an idea for optimizing this ? > > (keep in mind that : there are more than 500,000 text keys ; id are numbered from 1 to > 150,000 ; length of a posting list might be from 1 to 500,000...) For English text, one would almost certainly reduce the number of text keys by deleting some suffixes, so that 'statistic', 'statistics', 'statistical', 'statistically', and maybe 'statistician' would be one key. But I have no idea of word structure of non-Indo-European languages and what one should do to reduce key number. By 'length of a posting list' do you mean number of text keys? Ie, size of key dict? The length of list attached to each key is at most number of documents. You do not specify 'node object'. This is crucial since these take up most of memory. (This a good reason not to index the 2000 or so most common keys.) For pure Python, tuple is most space efficient. If each node really is (integer id, integer count), one could devise a system using a pair of array (module) objects or 2-dimensional numerical python arrays; which store actual integers rather than pointers to int PyObjects. When filled, they would have to be copied to larger arrays in much the same manner as done automatically by Python lists. Terry J. Reedy
- Previous message (by thread): Efficient posting-list
- Next message (by thread): Efficient posting-list
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list