how to store great array?
Quinn Dunkan
quinn at bolivar.ugcs.caltech.edu
Mon Jan 29 15:13:58 EST 2001
More information about the Python-list mailing list
Mon Jan 29 15:13:58 EST 2001
- Previous message (by thread): how to store great array?
- Next message (by thread): how to store great array?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 29 Jan 2001 13:56:15 GMT, paulus132 at my-deja.com <paulus132 at my-deja.com> wrote: >For 2 mio Documents, I have a "Year" field (4 characters) associated >with each of them. I try to store this in Python, to report him over >other records. > >It is possible ? Wich is the faster method ? > >The easiest way is a dictionnary { doc_num : year } but how estimate the >necessary memory ? With old Basic's, the FREE(0) Builtin returns the >number of free memory. Is there comparable module/class/function to >survey memory use in Python ? > >Another way is to construct a list doc_years = ["1934", "1942", "1897", >....], and access it with doc_years[doc_num]. Is doc_years[2000000] as >fast as doc_years[456] ? Yes. >Another way is to construct an unique string with all years >concatenated, and access it with all_years[4*doc_num:4*doc_num+4]. But I >think that 7.6 Mo length is too big ! How long can be a string ? Array access is constant, and at the C level, arrays are accessed just like your complicated subscript. So just use the normal indexing, it'll turn into that at the C / assembly level. A string can be as long as you have memory for. Even if the years are all in a similar range and python interns the strings (effectively compressing the data), the normal reference overhead will take your memory back again. Note that in python terminology, a list is what everyone else calls an array. So you might have been confused into thinking python lists are linked-lists aka cons cells, which have linear access (so doc_years[200000] would take longer than doc_years[2]). But python's lists are arrays. Don't ask me why this confusing terminology was chosen, probably because some Dutch people doing usability tests thought 'list' sounded nicer for ABC a million years ago. Dict access is also constant, as is usual for a hash table (internally, it gets turned into simple array access). But if your indices will be inteeger, then a plain array aka list will have less overhead. Python experts could tell you the exact memory usage. It will be something like bytes_in_string+null + pointer + type_field + maybe_a_cached_hash + interned_flag. So with ~4 bytes overhead for each 5 byte string, you're looking at 9*2_000_000 / 1024 / 1024 = 17MB for 2 million elements. Since every platform has a different way of reporting free memory (and most modern ones lie through their teeth about it), I'm not aware of any standard way to see available memory. For unix-likes, you could use the resource module to see *consumed* memory. Windows may have special functions in the win32 modules. Best bet, unless you're working on an embedden platform, is to just try it and see what happens. If you run out of memory, try using the array module to eliminate the per-element overhead. If you're *still* out of memory, come up with a simple compression scheme so you don't lose your O(1) access and write a C module :) But we're not on Commodore 64s anymore, you know :)
- Previous message (by thread): how to store great array?
- Next message (by thread): how to store great array?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list