Issue 36694: Excessive memory use or memory fragmentation when unpickling many small objects

Issue 36694: Excessive memory use or memory fragmentation when unpickling many small objects

Issue36694

Created on 2019-04-21 17:11 by Ellenbogen, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
load.py	Ellenbogen, 2019-04-21 17:11
common.py	Ellenbogen, 2019-04-21 17:11
dump.py	Ellenbogen, 2019-04-21 18:58

Pull Requests
URL	Status	Linked	Edit
PR 13036	open	serhiy.storchaka, 2019-05-01 12:28

Messages (8)
msg340615 - (view)	Author: Paul Ellenbogen (Ellenbogen)	Date: 2019-04-21 17:11
Python encounters significant memory fragmentation when unpickling many small objects. I have attached two scripts that I believe demonstrate the issue. When you run "dumpy.py" it will generate a large list of namedtuples, then write that list to a file using pickle. Before it does so, it pauses for user input. Before exiting the script you can view the memory usage in htop or whatever your preferred method is. The "load.py" script loads the file written by dump.py. After loading the data is complete, it waits for user input. The memory usage at the point where the script is waiting for user input is (more than) twice as much in the "load" case as the "dump" case. The small objects in the list I am storing have 3 values, and I have tested three alternative representations: tuple, namedtuple, and a custom class. The namedtuple and custom class both have the memory use/fragmentation issue. The built in tuple type does not have this issue. Using optimize in pickletools doesn't seem to make a difference. Matthew Cowles from the python help list had some good suggestions, and found that the object size themselves, as observed by sys.getsizeof was different before and after pickling. Perhaps this is something other than memory fragmentation, or something in addition to memory fragmentation. Although high water mark is similar for both scripts, the pickling script settles down on a reasonably smaller memory footprint. I would still consider the long run memory waste of unpickling a bug. For example in my use case I will run one instance of the equivalent of pickling script, then run many many instances of the script that unpickles. These scripts were run with Python 3.6.7 (GCC 8.2.0) on Ubuntu 18.10.
msg340616 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-04-21 17:32
The difference is because in the first case all floats are the same float object 0.0, but in the second case they are different objects. For more reaĺistic comparison use different floats (for example random()).
msg340617 - (view)	Author: Paul Ellenbogen (Ellenbogen)	Date: 2019-04-21 18:58
Good point. I have created a new version of dump that uses random() instead. float reuse explains the getsizeof difference, but there is still a significant memory usage difference. This makes sense to me because the original code I saw this issue in is more analogous to random()
msg341179 - (view)	Author: Inada Naoki (methane) *	Date: 2019-05-01 04:41
Memory allocation pattern is: alloc 24 # float alloc 24 alloc 24 alloc 64 # temporary tuple alloc 72 <repeat> free 64 # free temporary tuples free 64 free 64 <repeat> This cause some sort of fragmentation. Some pools in arenas are unused. This prevents pymalloc to return arenas to OS. (Note that pymalloc manages memory as arena (256KiB) > pool (4KiB) > blocks (requested sizes <= 512). pymalloc can return the memory to OS only when arena is clean) But this is not too bad because many pools is free. Any allocation which size < 512 can reuse the free pools. If you run some code after unpickle, the pools will be reused efficiently. (In case of very bad fragmentation, many pools are dirty: some blocks in pools are used while many blocks in the pool is free. So only same size alloc request can use the pool.) There are two approach to fix this problem. 1. Investigate why temporary tuple is not freed until last stage of unpickle. 2. When there are too many free pools, return some by MADV_FREE or MADV_DONTNEED. I think (1) should be considered first. But (2) is more general solution.
msg341181 - (view)	Author: Inada Naoki (methane) *	Date: 2019-05-01 07:35
I confirmed this fragmentation is caused by memo in Unpickler. Pickler memos "reduce"-ed tuples while it is just a temporary object. I am not sure that this behavior is good.
msg341193 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-05-01 12:30
PR 13036 makes the C implementation no longer memoizing temporary objects. This decreases memory fragmentation and peak memory consumption on pickling and unpickling.
msg341251 - (view)	Author: Inada Naoki (methane) *	Date: 2019-05-02 06:26
I'm using 1/10 version of dump.py I removed total_size() because it creates some garbages. sys._debugmallocstats() after load.py: master: # arenas allocated total = 1,223 # arenas reclaimed = 5 # arenas highwater mark = 1,218 # arenas allocated current = 1,218 1218 arenas * 262144 bytes/arena = 319,291,392 # bytes in allocated blocks = 218,026,128 # bytes in available blocks = 149,024 23835 unused pools * 4096 bytes = 97,628,160 PR 13036: # arenas allocated total = 849 # arenas reclaimed = 3 # arenas highwater mark = 846 # arenas allocated current = 846 846 arenas * 262144 bytes/arena = 221,773,824 # bytes in allocated blocks = 217,897,968 # bytes in available blocks = 140,096 61 unused pools * 4096 bytes = 249,856 Now "arena allocated current" is same to after dump.py: # arenas allocated total = 847 # arenas reclaimed = 1 # arenas highwater mark = 846 # arenas allocated current = 846 846 arenas * 262144 bytes/arena = 221,773,824 # bytes in allocated blocks = 217,998,792 # bytes in available blocks = 131,112 38 unused pools * 4096 bytes = 155,648 It looks nice. Additionally, both of "time python dump.py" and "time python load.py" become slightly faster. master dump: dump (Note that this time includes not only dump, but also constructing data) real 0m3.539s user 0m3.266s sys 0m0.196s master load: real 0m1.408s user 0m1.292s sys 0m0.116s PR-13036 dump: real 0m2.758s user 0m2.598s sys 0m0.088s PR-13036 load: real 0m1.239s user 0m1.183s sys 0m0.056s Would pickle experts review the PR?
msg341284 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-05-02 16:43
Great work Inada-san! Thank you for your investigations! PR 13036 increases the chance of using borrowed references during pickling. Since this bug exists in the current code too (it just can be exposed in smaller number of cases), it should be fixed in any case. So I going to fix this bug before merging PR 13036, and fix it in a way that does not prevent the optimization.

History
Date	User	Action	Args
2022-04-11 14:59:14	admin	set	github: 80875
2019-05-02 16:43:18	serhiy.storchaka	set	messages: + msg341284
2019-05-02 06:27:04	methane	set	nosy: + pitrou
2019-05-02 06:26:50	methane	set	messages: + msg341251
2019-05-01 12:30:38	serhiy.storchaka	set	messages: + msg341193
2019-05-01 12:28:15	serhiy.storchaka	set	keywords: + patch stage: patch review pull_requests: + pull_request12956
2019-05-01 07:35:28	methane	set	messages: + msg341181
2019-05-01 04:41:39	methane	set	versions: + Python 3.8, - Python 3.6
2019-05-01 04:41:31	methane	set	messages: + msg341179
2019-04-27 08:41:19	methane	set	nosy: + methane
2019-04-21 18:58:54	Ellenbogen	set	files: - dump.py
2019-04-21 18:58:47	Ellenbogen	set	files: + dump.py
2019-04-21 18:58:29	Ellenbogen	set	files: - dump.py
2019-04-21 18:58:07	Ellenbogen	set	files: + dump.py messages: + msg340617
2019-04-21 17:32:32	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg340616
2019-04-21 17:11:43	Ellenbogen	set	files: + common.py
2019-04-21 17:11:38	Ellenbogen	set	files: + load.py
2019-04-21 17:11:20	Ellenbogen	create