Issue36304
Created on 2019-03-15 13:19 by janluke, last changed 2022-04-11 14:59 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| demonstrate_BOM_issue.py | janluke, 2019-03-15 13:19 | Demonstrate the issue | ||
| Messages (4) | |||
|---|---|---|---|
| msg337987 - (view) | Author: Gianluca (janluke) | Date: 2019-03-15 13:19 | |
When bz2 and lzma files are used in writing text mode (wrapped in a TextIOWrapper), the BOM of encodings such as utf-16 and utf-32 is not written. The gzip package works as expected (it writes the BOM). The code that demonstrate this behavior (tested with Python 3.7) is attached here and can also be found on stackoverflow: https://stackoverflow.com/questions/55171439/python-bz2-and-lzma-in-mode-wt-dont-write-the-bom-while-gzip-does-why?noredirect=1#comment97103212_55171439 |
|||
| msg338001 - (view) | Author: Gianluca (janluke) | Date: 2019-03-15 16:41 | |
As one can read in the stackoverflow answer, using _pyio.TextIOWrapper works as expected. So it looks like this is a bug of _io.TextIOWrapper. |
|||
| msg338045 - (view) | Author: Martin Panter (martin.panter) * ![]() |
Date: 2019-03-16 00:24 | |
I suspect this is caused by TextIOWrapper guessing if it is writing the start of a file versus in the middle, and being confused by “seekable” returning False. GzipFile implements some “seek” calls in write mode, but LZMAFile and BZ2File do not.
Using this test class:
class Writer(BufferedIOBase):
def writable(self):
return True
def __init__(self, offset):
self.offset = offset
def seekable(self):
result = self.offset is not None
print('seekable ->', result)
return result
def tell(self):
print('tell ->', self.offset)
return self.offset
def write(self, data):
print('write', repr(data))
a BOM is inserted when “tell” returns zero:
>>> t = io.TextIOWrapper(Writer(0), 'utf-16')
seekable -> True
tell -> 0
>>> t.write('HI'); t.flush() # Writes BOM
2
write b'\xff\xfeH\x00I\x00'
and not when “tell” returns a positive number:
>>> t = io.TextIOWrapper(Writer(1), 'utf-16')
seekable -> True
tell -> 1
>>> t.write('HI'); t.flush() # Omits BOM
2
write b'H\x00I\x00'
However the “io” and “_pyio” behaviours differ when “seekable” returns False:
>>> t = io.TextIOWrapper(Writer(None), 'utf-16')
seekable -> False
>>> t.write('HI'); t.flush() # io omits BOM
2
write b'H\x00I\x00'
>>> t = _pyio.TextIOWrapper(Writer(None), 'utf-16')
seekable -> False
>>> t.write('HI'); t.flush() # _pyio writes BOM
write b'\xff\xfeH\x00I\x00'
2
IMO the “_pyio” behaviour is more sensible: write a BOM because that’s what the UTF-16 codec produces.
|
|||
| msg338241 - (view) | Author: Gianluca (janluke) | Date: 2019-03-18 15:27 | |
In case the file is not seekable, we could decide based on the file mode: - if mode='w', write the BOM - if mode='a', don't write the BOM Of course, mode "a" doesn't guarantee we are in the middle of the file, but it seems a consistent behavior not writing the BOM if we are "appending" to the file. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:59:12 | admin | set | github: 80485 |
| 2019-03-18 16:16:40 | vstinner | set | nosy:
- vstinner |
| 2019-03-18 15:27:57 | janluke | set | messages: + msg338241 |
| 2019-03-16 00:24:42 | martin.panter | set | nosy:
+ martin.panter messages: + msg338045 |
| 2019-03-15 21:13:39 | terry.reedy | set | nosy:
+ lemburg, vstinner, benjamin.peterson, ezio.melotti |
| 2019-03-15 16:41:33 | janluke | set | messages: + msg338001 |
| 2019-03-15 13:19:03 | janluke | create | |
