Issue 27797: ASCII file with UNIX line conventions and enough lines throws SyntaxError when ASCII-compatible codec is declared
Created on 2016-08-19 07:38 by mjpieters, last changed 2022-04-11 14:58 by admin. This issue is now closed.
| Messages (2) | |||
|---|---|---|---|
| msg273087 - (view) | Author: Martijn Pieters (mjpieters) * | Date: 2016-08-19 07:37 | |
To reproduce, create an ASCII file with > io.DEFAULT_BUFFER_SIZE bytes (can be blank lines) and *UNIX line endings*, with the first two lines reading:
#!/usr/bin/env python
# -*- coding: cp1252 -*-
Try to run this as a script on Windows:
C:\Python35\python.exe encoding-problem-cp1252.py
File "encoding-problem-cp1252.py", line 2
SyntaxError: encoding problem: cp1252
Converting the file to use CRLF (Windows) line endings makes the problem go away.
This appears to be a fallout from issue #20731.
Demo file that reproduces this issue at 710 bytes: https://github.com/techtonik/testbin/raw/fbb8aec3650b45f690c4febfd621fe5d6892b14a/python/encoding-problem-cp1252.py
First reported by anatoly techtonik at https://stackoverflow.com/questions/39032416/python-3-5-syntaxerror-encoding-prob-em-cp1252
|
|||
| msg273110 - (view) | Author: Eryk Sun (eryksun) * ![]() |
Date: 2016-08-19 12:28 | |
In issue 20844 I suggested opening the file in binary mode, i.e. change the call to _Py_wfopen(filename, L"rb") in Modules/main.c. That would also entail documenting that PyRun_SimpleFileExFlags requires a FILE pointer that's opened in binary mode. After making this change, there's no problem parsing "encoding-problem-cp1252.py": >python --version Python 3.6.0a4+ >python encoding-problem-cp1252.py ok When fp_setreadl is called while parsing "encoding-problem-cp1252.py", 47 bytes in the FILE buffer have been read -- up to the end of the coding spec. Let's verify this in the debugger: 0:000> bp python35_d!fp_setreadl 0:000> g Breakpoint 0 hit python35_d!fp_setreadl: 00000000`662bee00 4889542410 mov qword ptr [rsp+10h],rdx ss:000000d7`6cfeead8=000000d76cfeeaf8 0:000> ;as /x fp @@(((python35_d!tok_state *)@rcx)->fp) 0:000> ;as /x ptr @@(((ucrtbased!__crt_stdio_stream_data *)${fp})->_ptr) 0:000> ;as /x base @@(((ucrtbased!__crt_stdio_stream_data *)${fp})->_base) 0:000> ?? ${ptr} - ${base} int64 0n47 ftell() should return 47, but instead it returns -1. You can see this by opening the file in Python 2 on Windows, which uses FILE streams: >>> f = open('encoding-problem-cp1252.py') >>> f.read(47) '#!/usr/bin/env python\n# -*- coding: cp1252 -*-\n' >>> f.tell() Traceback (most recent call last): File "<stdin>", line 1, in <module> IOError: [Errno 0] Error ftell starts by getting the file position from the OS and then subtracts the unread bytes in the buffer. The buffer has already undergone CRLF => LF translation, so ftell makes an assumption that the file uses CRLF line endings and thus subtracts 2 bytes for each unread LF. In this case the buffer happens to have 48 unread LFs, so ftell returns -1, with the only actual error being a fundamentally flawed design in the CRT's text mode. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:58:35 | admin | set | github: 71984 |
| 2019-03-29 11:15:53 | methane | set | status: open -> closed superseder: SyntaxError: encoding problem: iso-8859-1 on Windows resolution: duplicate stage: needs patch -> resolved |
| 2016-08-19 12:31:14 | vstinner | set | nosy:
+ vstinner |
| 2016-08-19 12:28:01 | eryksun | set | nosy:
+ eryksun messages: + msg273110 |
| 2016-08-19 07:55:50 | SilentGhost | set | stage: needs patch type: behavior versions: - Python 3.4 |
| 2016-08-19 07:38:02 | mjpieters | create | |
