Issue26314
Created on 2016-02-08 20:42 by gregory.p.smith, last changed 2022-04-11 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| interned_set_27.diff | gregory.p.smith, 2016-02-08 23:52 | review | ||
| interned_set_27-gps02.diff | gregory.p.smith, 2016-03-11 23:19 | review | ||
| Messages (12) | |||
|---|---|---|---|
| msg259885 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2016-02-08 20:42 | |
The implementation of string interning uses a dict [1]. It would consume less memory and be a bit simpler if it used a set. Identifier strings in a program are interned. If you have a large program with a lot of code, this makes for a large dictionary. Experimenting with changing this to use a set on 2.7 found ~22k savings on an interactive interpreter startup. Measuring it on a huge application showed a few hundred k saved. [1]: https://hg.python.org/cpython/file/3.5/Objects/unicodeobject.c#l1579 |
|||
| msg259899 - (view) | Author: Josh Rosenberg (josh.r) * ![]() |
Date: 2016-02-08 23:39 | |
Presumably it would involve using private set APIs to make this work, right? Since normally you can't look up the actual value in a set, just check for existence? |
|||
| msg259900 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2016-02-08 23:52 | |
Here's an example patch against 2.7 by nnorwitz that we're currently testing. |
|||
| msg260052 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-02-10 22:38 | |
Why did you remove _Py_ReleaseInternedStrings()? It looks like a big memory leak when Python is embedded. |
|||
| msg260063 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2016-02-11 00:52 | |
Because it was only called from within an "#ifdef __INSURE__" which we weren't using. I called it an "example" patch for a reason. Updating that function to deal with the set instead of dict seems wise. Ironically... a few days after we did this we may have just found reason to put _Py_ReleaseInternedStrings() back and use it when compiled using clang sanitizers. (untested) |
|||
| msg261614 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2016-03-11 23:19 | |
updated patch: Don't Py_CLEAR the dummy sentinel value in PySet_Fini. The interned set uses it. Otherwise it causes problems when re-initializing after a Fini as happens in processes the setup and tear down an embedded interpreter multiple times. |
|||
| msg261634 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2016-03-12 05:26 | |
Dicts are hard optimized for string keys due to using them for globals and attributes lookup. Sets are optimized too, but rather for accident. We can be sure that dicts will be always optimized, but set optimization can be eliminated for the sake of simplification or if it will be proven that special casing strings has enough large negative effect on non-string items. |
|||
| msg261674 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2016-03-13 06:40 | |
I agree with Serhiy that this should not be done. IIRC, there were some discussions on python-dev or python-ideas about using sets for interning and the judgment was the dicts are conceptually the right type and that using sets would be hack. The other thought was that while a set is generally two-thirds the size of dict, the bulk of the space is for the interned strings themselves. |
|||
| msg261676 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2016-03-13 07:02 | |
The space for the strings is a fixed cost, the structure used to store them for efficient lookup is the only overhead that can be trimmed and is all in one contiguous allocation. regardless, i agree, this isn't a large savings. priority low, feel free to drop if it you want. |
|||
| msg261680 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2016-03-13 07:21 | |
Since interned strings table can only grow and contains exact strings, other data structure may be more appropriate (for example Modules/hashtable.c). But for now status quo is good to me. |
|||
| msg261691 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-03-13 10:03 | |
Serhiy Storchaka added the comment: > > Since interned strings table can only grow and contains exact strings, > other data structure may be more appropriate (for example > Modules/hashtable.c). > FYI this module is not well optimized. I took code and then adapted it for my needs in tracemalloc. If you want to use it outside, you should check again that parameters like used buckets/total buckets ratio are well chosen. |
|||
| msg261699 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2016-03-13 15:48 | |
GPS: priority low, feel free to drop if it you want. SS: But for now status quo is good to me. Marking as closed. Feel free to open a separate tracker item if you want to pursue the use of Modules/hashtable.c. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:58:27 | admin | set | github: 70502 |
| 2016-03-13 15:48:39 | rhettinger | set | status: open -> closed resolution: rejected messages: + msg261699 |
| 2016-03-13 10:03:21 | vstinner | set | messages: + msg261691 |
| 2016-03-13 07:21:42 | serhiy.storchaka | set | messages: + msg261680 |
| 2016-03-13 07:02:52 | gregory.p.smith | set | messages: + msg261676 |
| 2016-03-13 06:40:20 | rhettinger | set | assignee: rhettinger messages: + msg261674 |
| 2016-03-12 05:26:28 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg261634 |
| 2016-03-11 23:19:19 | gregory.p.smith | set | files:
+ interned_set_27-gps02.diff messages: + msg261614 |
| 2016-02-11 00:52:13 | gregory.p.smith | set | messages: + msg260063 |
| 2016-02-10 22:38:46 | vstinner | set | nosy:
+ vstinner messages: + msg260052 |
| 2016-02-08 23:52:01 | gregory.p.smith | set | keywords:
+ needs review, patch files: + interned_set_27.diff messages: + msg259900 |
| 2016-02-08 23:39:31 | josh.r | set | nosy:
+ josh.r messages: + msg259899 |
| 2016-02-08 21:23:04 | maciej.szulik | set | nosy:
+ maciej.szulik |
| 2016-02-08 20:49:52 | serhiy.storchaka | set | nosy:
+ rhettinger |
| 2016-02-08 20:42:46 | gregory.p.smith | create | |

