Message 320763 - Python tracker

Message320763

Author	hajoscher
Recipients	hajoscher
Date	2018-06-30.09:27:00
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1530350820.63.0.56676864532.issue34010@psf.upfronthosting.co.za>
In-reply-to

Content
Buffer read of large files in a compressed tarfile stream performs poorly. The buffered read in tarfile _Stream is extending a bytes object. It is much more efficient to use a list followed by a join. Using a list can mean seconds instead of minutes. This performance regression was introduced in b506dc32c1a. How to test: # create random tarfile 50Mb dd if=/dev/urandom of=test.bin count=50 bs=1M tar czvf test.tgz test.bin # read with tarfile as stream (note pipe symbol in 'r\|gz') import tarfile tfile = tarfile.open("test.tgz", 'r\|gz') for t in tfile: file = tfile.extractfile(t) if file: print(len(file.read()))

Content

Buffer read of large files in a compressed tarfile stream performs poorly.

The buffered read in tarfile _Stream is extending a bytes object. 
It is much more efficient to use a list followed by a join. 
Using a list can mean seconds instead of minutes. 

This performance regression was introduced in b506dc32c1a. 

How to test:
# create random tarfile 50Mb
dd if=/dev/urandom of=test.bin count=50 bs=1M
tar czvf test.tgz test.bin

# read with tarfile as stream (note pipe symbol in 'r|gz')
import tarfile
tfile = tarfile.open("test.tgz", 'r|gz')
for t in tfile:
    file = tfile.extractfile(t)
    if file:
        print(len(file.read()))

History
Date	User	Action	Args
2018-06-30 09:27:00	hajoscher	set	recipients: + hajoscher
2018-06-30 09:27:00	hajoscher	set	messageid: <1530350820.63.0.56676864532.issue34010@psf.upfronthosting.co.za>
2018-06-30 09:27:00	hajoscher	link	issue34010 messages
2018-06-30 09:27:00	hajoscher	create