[Python-Dev] More compact dictionaries with faster iteration
Christian Tismer
tismer at stackless.com
Wed May 15 13:32:39 CEST 2013
More information about the Python-Dev mailing list
Wed May 15 13:32:39 CEST 2013
- Previous message: [Python-Dev] Pickling failure on Enums
- Next message: [Python-Dev] More compact dictionaries with faster iteration
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Raymond, On 08.01.13 15:49, Maciej Fijalkowski wrote: > On Mon, Dec 10, 2012 at 3:44 AM, Raymond Hettinger > <raymond.hettinger at gmail.com> wrote: >> The current memory layout for dictionaries is >> unnecessarily inefficient. It has a sparse table of >> 24-byte entries containing the hash value, key pointer, >> and value pointer. >> >> Instead, the 24-byte entries should be stored in a >> dense table referenced by a sparse table of indices. >> >> For example, the dictionary: >> >> d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'} >> >> is currently stored as: >> >> entries = [['--', '--', '--'], >> [-8522787127447073495, 'barry', 'green'], >> ['--', '--', '--'], >> ['--', '--', '--'], >> ['--', '--', '--'], >> [-9092791511155847987, 'timmy', 'red'], >> ['--', '--', '--'], >> [-6480567542315338377, 'guido', 'blue']] >> >> Instead, the data should be organized as follows: >> >> indices = [None, 1, None, None, None, 0, None, 2] >> entries = [[-9092791511155847987, 'timmy', 'red'], >> [-8522787127447073495, 'barry', 'green'], >> [-6480567542315338377, 'guido', 'blue']] >> >> Only the data layout needs to change. The hash table >> algorithms would stay the same. All of the current >> optimizations would be kept, including key-sharing >> dicts and custom lookup functions for string-only >> dicts. There is no change to the hash functions, the >> table search order, or collision statistics. >> >> The memory savings are significant (from 30% to 95% >> compression depending on the how full the table is). >> Small dicts (size 0, 1, or 2) get the most benefit. >> >> For a sparse table of size t with n entries, the sizes are: >> >> curr_size = 24 * t >> new_size = 24 * n + sizeof(index) * t >> >> In the above timmy/barry/guido example, the current >> size is 192 bytes (eight 24-byte entries) and the new >> size is 80 bytes (three 24-byte entries plus eight >> 1-byte indices). That gives 58% compression. >> >> Note, the sizeof(index) can be as small as a single >> byte for small dicts, two bytes for bigger dicts and >> up to sizeof(Py_ssize_t) for huge dict. >> >> In addition to space savings, the new memory layout >> makes iteration faster. Currently, keys(), values, and >> items() loop over the sparse table, skipping-over free >> slots in the hash table. Now, keys/values/items can >> loop directly over the dense table, using fewer memory >> accesses. >> >> Another benefit is that resizing is faster and >> touches fewer pieces of memory. Currently, every >> hash/key/value entry is moved or copied during a >> resize. In the new layout, only the indices are >> updated. For the most part, the hash/key/value entries >> never move (except for an occasional swap to fill a >> hole left by a deletion). >> >> With the reduced memory footprint, we can also expect >> better cache utilization. >> >> For those wanting to experiment with the design, >> there is a pure Python proof-of-concept here: >> >> http://code.activestate.com/recipes/578375 >> >> YMMV: Keep in mind that the above size statics assume a >> build with 64-bit Py_ssize_t and 64-bit pointers. The >> space savings percentages are a bit different on other >> builds. Also, note that in many applications, the size >> of the data dominates the size of the container (i.e. >> the weight of a bucket of water is mostly the water, >> not the bucket). >> >> >> Raymond >> _______________________________________________ >> Python-Dev mailing list >> Python-Dev at python.org >> http://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com > One question Raymond. > > The compression ratios stay true provided you don't overallocate entry > list. If you do overallocate you don't really gain that much (it all > depends vastly on details), or even loose in some cases. What do you > think should the strategy be? > What is the current status of this discussion? I'd like to know whether it is a considered alternative implementation. There is also a discussion in python-ideas right now where this alternative is mentioned, and I think especially for small dicts as **kwargs, it could be a cheap way to introduce order. Is this going on, somewhere? I'm quite interested on that. cheers - chris -- Christian Tismer :^) <mailto:tismer at stackless.com> Software Consulting : Have a break! Take a ride on Python's Karl-Liebknecht-Str. 121 : *Starship* http://starship.python.net/ 14482 Potsdam : PGP key -> http://pgp.uni-mainz.de phone +49 173 24 18 776 fax +49 (30) 700143-0023 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
- Previous message: [Python-Dev] Pickling failure on Enums
- Next message: [Python-Dev] More compact dictionaries with faster iteration
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list