Buffer read of large files in a compressed tarfile stream performs poorly.
The buffered read in tarfile _Stream is extending a bytes object.
It is much more efficient to use a list followed by a join.
Using a list can mean seconds instead of minutes.
This performance regression was introduced in b506dc32c1a.
How to test:
# create random tarfile 50Mb
dd if=/dev/urandom of=test.bin count=50 bs=1M
tar czvf test.tgz test.bin
# read with tarfile as stream (note pipe symbol in 'r|gz')
import tarfile
tfile = tarfile.open("test.tgz", 'r|gz')
for t in tfile:
file = tfile.extractfile(t)
if file:
print(len(file.read())) |