Issue 33173: GzipFile's .seekable() returns True even if underlying buffer is not seekable
Created on 2018-03-28 16:28 by Walt Askew, last changed 2022-04-11 14:58 by admin.
Messages (5)
msg314613 - (view)
Author: Walt Askew (Walt Askew)
Date: 2018-03-28 16:28
Date: 2018-10-06 01:04
Date: 2018-10-06 06:03
Date: 2018-11-09 08:45
Date: 2018-11-09 08:55
The seekable method on gzip.GzipFile always returns True, even if the underlying buffer is not seekable. However, if seek is called on the GzipFile, the seek will fail unless the underlying buffer is seekable. This can cause consumers of the GzipFile object to mistakenly believe calling seek on the object is safe, when in fact it will lead to an exception.
For example, this led to a bug when I was trying to use requests & boto3 to stream & decompress an S3 upload like so:
resp = requests.get(uri, stream=True)
decompressed = gzip.GzipFile(fileobj=resp.raw)
boto3.client('s3').upload_fileobj(decompressed, Bucket=bucket, Key=key)
boto3 checks the seekable method on the the GzipFile, chooses a code path based on the file being seekable but later raises an exception when the seek call fails because the underlying HTTP stream is not seekable.
msg327206 - (view)
Author: Cheryl Sabella (cheryl.sabella) *
Date: 2018-10-06 01:04
Allowing for non seekable files was added in issue1675951. And under that issue in msg117131, the author of the change wrote: "The patch creates another problem with is not yet fixed: The implementation of .seekable() is becoming wrong. As one can now use non seekable files the implementation should check if the file object used for reading is really seekable." issue23529 made significant changes to the code and seekable() is again mentioned in msg239245 and subsequent comments. Nosying the devs who worked on those issues.msg327226 - (view) Author: Martin Panter (martin.panter) *
Date: 2018-10-06 06:03
If a change is made, it would be nice to bring the “gzip”, “bzip” and LZMA modules closer together. The current “bzip” and LZMA modules rely on the underlying “seekable” method without a fallback implementation, but also have a check for read mode. I think the seeking functionality in these modules is a misfeature. But since it is already here, it is probably best to leave it alone, and just document it. My comment about making “seekable” stricter is at <https://bugs.python.org/review/23529/diff/14296/Lib/gzip.py#oldcode550>. Even if the underlying stream is not seekable, GzipFile can still fast-forward. Here is a demonstration: >>> z = BytesIO(bytes.fromhex( ... "1F8B08000000000002FFF348CD29D051F05448CC55282E294DCE56C8CC53485448AFCA" ... "2C5048CBCC490500F44BF0A01F000000" ... )) >>> def seek(*args): raise UnsupportedOperation() ... >>> z.seek = seek # Make the underlying stream not seekable >>> f = GzipFile(fileobj=z) >>> f.read(10) b'Help, I am' >>> f.seek(20) # Fast forward 20 >>> f.read() b'a gzip file' >>> f.seek(0) # Rewind Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/proj/python/cpython/Lib/gzip.py", line 368, in seek return self._buffer.seek(offset, whence) File "/home/proj/python/cpython/Lib/_compression.py", line 137, in seek self._rewind() File "/home/proj/python/cpython/Lib/gzip.py", line 515, in _rewind super()._rewind() File "/home/proj/python/cpython/Lib/_compression.py", line 115, in _rewind self._fp.seek(0) File "/home/proj/python/cpython/Lib/gzip.py", line 105, in seek return self.file.seek(off) File "<stdin>", line 1, in seek io.UnsupportedOperationmsg329505 - (view) Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2018-11-09 08:45
I share Martin's opinion that this is a misfeature. User code can check seekable() and use seek() if it returns True or cache necessary data in memory if it returns False, because it is expected that seek() is more efficient. But in case of GzipFile it is not efficient, and can lead to decompression the whole content of the file and to much worse performance.msg329506 - (view) Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2018-11-09 08:55
And I share Martin's concern about fast-forward with an unseekable underlying file. If this works in current code, we can't simply return break it. This may mean that we can't change the implementation of GzipFile.seekable() at all, even if it lies in some cases.
History
Date
User
Action
Args
2022-04-11 14:58:59adminsetgithub: 77354
2018-11-09 08:55:32serhiy.storchakasetmessages:
+ msg329506
2018-11-09 08:45:28serhiy.storchakasetnosy:
+ serhiy.storchaka
messages: + msg329505
2018-10-06 06:03:22martin.pantersetmessages: + msg327226 2018-10-06 01:04:28cheryl.sabellasetnosy: + cheryl.sabella, pitrou, martin.panter
2018-03-28 23:46:24python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request6021 2018-03-28 16:28:47Walt Askewcreate
messages: + msg329505
2018-10-06 06:03:22martin.pantersetmessages: + msg327226 2018-10-06 01:04:28cheryl.sabellasetnosy: + cheryl.sabella, pitrou, martin.panter
messages:
+ msg327206
versions:
+ Python 3.8, - Python 3.6
2018-03-28 23:46:24python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request6021 2018-03-28 16:28:47Walt Askewcreate