Issue 36975: csv: undocumented UnicodeDecodeError on malformed file

Created on 2019-05-20 18:13 by alter-bug-tracer, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5) msg342939 - (view) Author: alter-bug-tracer (alter-bug-tracer) * Date: 2019-05-20 18:13

UnicodeDecodeError is thrown instead of csv.Error when parsing malformed inputs.
Examples:
1. file0
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte
Traceback (most recent call last):
  File "csv_parser.py", line 6, in <module>
    for row in reader:
  File "/usr/local/lib/python3.8/csv.py", line 111, in __next__
    row = next(self.reader)
  File "/usr/local/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
2. file1
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 51: invalid start byte
Traceback (most recent call last):
  File "csv_parser.py", line 6, in <module>
    for row in reader:
  File "/usr/local/lib/python3.8/csv.py", line 110, in __next__
    self.fieldnames
  File "/usr/local/lib/python3.8/csv.py", line 97, in fieldnames
    self._fieldnames = next(self.reader)
  File "/usr/local/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)

(file0, file1 and csv_parser.py attached)

msg342999 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2019-05-21 10:18

I don't understand the issue here, csv can raise many errors when an issue happens:

>>> import csv
>>> csv.reader(None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument 1 must be an iterator

Why would UnicodeDecodeError not be appropriate here?

msg343023 - (view) Author: alter-bug-tracer (alter-bug-tracer) * Date: 2019-05-21 12:05

Shouldn't all of them be documented? Either that, or converted to csv.Error? Take, for example, the C++ std.

msg343040 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2019-05-21 12:40

I don't think all errors can be documented, csv iterate over the object but has no idea what it is. When writing for example, anything could happen, from a socket timing out, permissions errors, the underlying media being removed not properly, the media having no more space, etc...

ISTM that catching all those exceptions and hiding them behind csv.Error is bad practice and not recommended. In C++, uncaught exceptions are part of the function signature so it easier to do this but in Python we have no idea what the object you gave can raise when iterating over it.

msg343324 - (view) Author: Brett Cannon (brett.cannon) * (Python committer)

Date: 2019-05-23 20:41

This isn't a bug because the CSV format isn't malformed (which would be appropriate for csv.Error), but the file itself isn't appropriate encoded (or the proper encoding wasn't specified (hence UnicodeDecodeError). So the exception is appropriate.

And we do not document indirect exceptions that get raised by code, only those that are explicitly raised. So everything is as it's expected.

History Date User Action Args 2022-04-11 14:59:15adminsetgithub: 81156 2019-05-23 20:41:11brett.cannonsetstatus: open -> closed

nosy: + brett.cannon
messages: + msg343324

resolution: not a bug
stage: resolved

2019-05-21 12:40:28remi.lapeyresetmessages: + msg343040 2019-05-21 12:05:22alter-bug-tracersetmessages: + msg343023 2019-05-21 10:18:23remi.lapeyresetnosy: + remi.lapeyre
messages: + msg342999
2019-05-21 10:16:01remi.lapeyresetfiles: + file1.txt 2019-05-21 10:15:50remi.lapeyresetfiles: + file0.txt 2019-05-21 10:15:40remi.lapeyresetfiles: + csv_parser.py 2019-05-20 18:13:43alter-bug-tracercreate