Issue21507
Created on 2014-05-14 12:32 by lebedov, last changed 2022-04-11 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| set_length_hint.patch | vstinner, 2014-05-14 22:18 | review | ||
| Messages (8) | |||
|---|---|---|---|
| msg218524 - (view) | Author: Lev Givon (lebedov) | Date: 2014-05-14 12:32 | |
Not sure if this is indicative of a bug, but I noticed that a frozenset created from a set seems to occupy a different amount of memory than a frozenset created from some other iterable. I observed this behavior with Python 2.7.5 and with Python 3.4.0 on Ubuntu 14.04 x86_64: >>> from sys import getsizeof >>> x = range(100) >>> s = set(x) >>> f0 = frozenset(x) >>> f1 = frozenset(s) >>> getsizeof(s) 8424 >>> getsizeof(f0) 8424 >>> getsizeof(f1) 4328 >>> f0==f1 True Original question on StackOverflow available at https://stackoverflow.com/questions/23618259/memory-occupied-by-set-vs-frozenset-in-python-2-7 |
|||
| msg218582 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2014-05-14 22:18 | |
frozenset constructor has different implementations depending on the input type: set (or frozenset), dict or iterator. The constructor preallocates the frozenset for set and dict, but not for generic iterator and so the set may have a suboptimal size. Attached patch set_length_hint.patch optimizes also the 3rd case using operator.length_hint (PyObject_LengthHint in C). Since it is an optimization, I prefer to only apply it to Python 3.5 to limit the risk of regression. |
|||
| msg218589 - (view) | Author: Josh Rosenberg (josh.r) * ![]() |
Date: 2014-05-15 02:08 | |
I think the argument against using PyObject_LengthHint for the general iterable case is that for inputs other than sets or dicts, the assumption is that significant deduplication will occur. With your patch, if I run: myset = frozenset([0] * 1000000) it will allocate space for, if I'm reading it correctly, two million entries, and use exactly one of them. Obviously not the common case, but preallocating when you have no idea how much duplication will occur seems like a bad idea. With a set or dict, you know it's already deduplicated, so preallocation is always a good thing, but for the general case, you'll be using more memory than necessary much of the time. |
|||
| msg218593 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2014-05-15 07:03 | |
I agree with Josh's arguments. Similar idea was already proposed and rejected (issue17338). |
|||
| msg218595 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2014-05-15 07:12 | |
"I think the argument against using PyObject_LengthHint for the general iterable case is that for inputs other than sets or dicts, the assumption is that significant deduplication will occur." Oh... I'm dumb :) Sorry. Another option for frozenset only: we may adjust the internal size when the frozenset is created from an iterator. |
|||
| msg218638 - (view) | Author: Josh Rosenberg (josh.r) * ![]() |
Date: 2014-05-16 01:01 | |
Not sure how much that really helps. If I understand you correctly, it would be a memory optimization that would require a round of rehashing to use? If you wanted to make a change that got "guaranteed" better performance, you might add support for dict's dict_keys and dict_items objects (since those have the same uniqueness guarantees as a set or dict, and I could easily see someone creating sets from them to remove the link to the original dict). If you wanted to get crazy, you might add preallocation support for collections.abc abstract classes equivalent to the built-ins, specifically, Set, Mapping, KeysView and ItemsView. That's probably not a good idea though, since it ends up calling out to Python code more (slowing you down for the general case) and it's probably not as safe, since a user-defined implementation of those abstract classes might not strictly adhere to the same uniqueness or hashability contracts the built-ins provide. |
|||
| msg218654 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2014-05-16 11:25 | |
AFAIU, Lev merely posted this because he feared it might be indicative of a bug. Since it isn't a bug but the by-product of a feature, I propose to close this issue as "not a bug". Regardless, thank you for posting this report! We appreciate the concern. |
|||
| msg218754 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2014-05-18 19:47 | |
I restore the original title, my title was a mistake. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:58:03 | admin | set | github: 65706 |
| 2014-05-19 19:52:06 | rhettinger | set | status: open -> closed |
| 2014-05-19 19:48:52 | rhettinger | set | assignee: rhettinger |
| 2014-05-18 19:47:55 | vstinner | set | messages:
+ msg218754 title: set and frozenset constructor should use operator.length_hint to guess the size of the iterator -> memory used by frozenset created from set differs from that of frozenset created from other iterable |
| 2014-05-16 11:25:09 | pitrou | set | resolution: not a bug messages: + msg218654 |
| 2014-05-16 01:01:30 | josh.r | set | messages: + msg218638 |
| 2014-05-15 07:12:01 | vstinner | set | messages: + msg218595 |
| 2014-05-15 07:03:28 | serhiy.storchaka | set | messages: + msg218593 |
| 2014-05-15 02:08:21 | josh.r | set | nosy:
+ josh.r messages: + msg218589 |
| 2014-05-14 22:19:08 | vstinner | set | nosy:
+ pitrou, serhiy.storchaka |
| 2014-05-14 22:18:50 | vstinner | set | files:
+ set_length_hint.patch title: memory used by frozenset created from set differs from that of frozenset created from other iterable -> set and frozenset constructor should use operator.length_hint to guess the size of the iterator |
| 2014-05-14 13:14:45 | pitrou | set | nosy:
+ rhettinger |
| 2014-05-14 12:32:52 | lebedov | create | |

