Issue 36975: csv: undocumented UnicodeDecodeError on malformed file
Created on 2019-05-20 18:13 by alter-bug-tracer, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Messages (5)
msg342939 - (view)
Author: alter-bug-tracer (alter-bug-tracer) *
Date: 2019-05-20 18:13
Date: 2019-05-23 20:41
UnicodeDecodeError is thrown instead of csv.Error when parsing malformed inputs.
Examples:
1. file0
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte
Traceback (most recent call last):
File "csv_parser.py", line 6, in <module>
for row in reader:
File "/usr/local/lib/python3.8/csv.py", line 111, in __next__
row = next(self.reader)
File "/usr/local/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
2. file1
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 51: invalid start byte
Traceback (most recent call last):
File "csv_parser.py", line 6, in <module>
for row in reader:
File "/usr/local/lib/python3.8/csv.py", line 110, in __next__
self.fieldnames
File "/usr/local/lib/python3.8/csv.py", line 97, in fieldnames
self._fieldnames = next(self.reader)
File "/usr/local/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
(file0, file1 and csv_parser.py attached)
msg342999 - (view)
Author: Rémi Lapeyre (remi.lapeyre) *
Date: 2019-05-21 10:18
I don't understand the issue here, csv can raise many errors when an issue happens: >>> import csv >>> csv.reader(None) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: argument 1 must be an iterator Why would UnicodeDecodeError not be appropriate here?msg343023 - (view) Author: alter-bug-tracer (alter-bug-tracer) * Date: 2019-05-21 12:05
Shouldn't all of them be documented? Either that, or converted to csv.Error? Take, for example, the C++ std.msg343040 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2019-05-21 12:40
I don't think all errors can be documented, csv iterate over the object but has no idea what it is. When writing for example, anything could happen, from a socket timing out, permissions errors, the underlying media being removed not properly, the media having no more space, etc... ISTM that catching all those exceptions and hiding them behind csv.Error is bad practice and not recommended. In C++, uncaught exceptions are part of the function signature so it easier to do this but in Python we have no idea what the object you gave can raise when iterating over it.msg343324 - (view) Author: Brett Cannon (brett.cannon) *
Date: 2019-05-23 20:41
This isn't a bug because the CSV format isn't malformed (which would be appropriate for csv.Error), but the file itself isn't appropriate encoded (or the proper encoding wasn't specified (hence UnicodeDecodeError). So the exception is appropriate. And we do not document indirect exceptions that get raised by code, only those that are explicitly raised. So everything is as it's expected.
History
Date
User
Action
Args
2022-04-11 14:59:15adminsetgithub: 81156
2019-05-23 20:41:11brett.cannonsetstatus: open -> closed
messages: + msg342999
2019-05-21 10:16:01remi.lapeyresetfiles: + file1.txt 2019-05-21 10:15:50remi.lapeyresetfiles: + file0.txt 2019-05-21 10:15:40remi.lapeyresetfiles: + csv_parser.py 2019-05-20 18:13:43alter-bug-tracercreate
nosy:
+ brett.cannon
messages:
+ msg343324
resolution: not a bug
stage: resolved
messages: + msg342999
2019-05-21 10:16:01remi.lapeyresetfiles: + file1.txt 2019-05-21 10:15:50remi.lapeyresetfiles: + file0.txt 2019-05-21 10:15:40remi.lapeyresetfiles: + csv_parser.py 2019-05-20 18:13:43alter-bug-tracercreate