Issue 3594: PyTokenizer_FindEncoding() never succeeds

Issue 3594: PyTokenizer_FindEncoding() never succeeds

Issue3594

Created on 2008-08-19 01:19 by brett.cannon, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fix_findencoding.diff	brett.cannon, 2008-08-19 05:25

Messages (8)
msg71397 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-19 01:19
Turns out that PyTokenizer_FindEncoding() never properly succeeds because the tok_state used by it does not have tok->filename set, which is an error condition in the tokenizer. This error has been masked by the one place the function is used, imp.find_module() because a NULL return is never checked for an error, but instead just assumes the default source encoding suffices.
msg71398 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-19 01:20
I have not bothered to check if this exists in 2.6, but I don't see why it would be any different.
msg71399 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-19 01:44
Turns out that the NULL return value can signal an error that manifests itself as SyntaxError("encoding problem: with BOM") thanks to the lack of tok->filename being set in Parser/tokenizer.c:fp_setreadl() which is called by check_coding_spec() and assumes that since tok->encoding was never set (because fp_setreadl() returned an error value) that it had something to do with the BOM. The only reason this was found is because my bootstrapping of importlib into Py3K, at some point, triggers a PyErr_Occurred() which finally notices the error.
msg71407 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-19 05:25
Attached is a patch that fixes where the error occurs. By opening the file by either file name or file descriptor, the problem goes away. Once this patch is accepted then PyErr_Occurred() should be added to all uses of PyTokenizer_FindEncoding().
msg72392 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-09-03 16:41
I don't understand the whole decoding machinery in the tokenizer, but the patch looks ok to me. (tested in debug mode under Linux and Windows)
msg72420 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2008-09-03 21:26
The patch also looks pretty harmless to me. :)
msg72477 - (view)	Author: Hyeshik Chang (hyeshik.chang) *	Date: 2008-09-04 03:35
pitrou, that's because Python source code can't be correctly tokenized when it's encoded in few odd encodings like iso-2022 or shift-jis which utilizes \, (, ) and " as second byte of two-byte character sequence. For example, '\x81\\' is HORIZONTAL BAR in shift-jis, exec('print "\x81\\"') fails. because of " is ignored by second byte of '\x81\\'.
msg72480 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-09-04 05:04
Committed in r66209.

History
Date	User	Action	Args
2022-04-11 14:56:37	admin	set	github: 47844
2008-09-04 05:04:57	brett.cannon	set	status: open -> closed resolution: accepted messages: + msg72480
2008-09-04 03:35:43	hyeshik.chang	set	nosy: + hyeshik.chang messages: + msg72477
2008-09-03 21:26:19	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg72420
2008-09-03 16:41:40	pitrou	set	nosy: + pitrou messages: + msg72392
2008-08-21 20:33:50	brett.cannon	set	keywords: + needs review
2008-08-21 18:35:14	brett.cannon	set	priority: critical -> release blocker
2008-08-19 05:25:15	brett.cannon	set	files: + fix_findencoding.diff keywords: + patch messages: + msg71407
2008-08-19 02:37:05	brett.cannon	link	issue3574 dependencies
2008-08-19 01:44:30	brett.cannon	set	messages: + msg71399
2008-08-19 01:20:05	brett.cannon	set	type: behavior messages: + msg71398
2008-08-19 01:19:38	brett.cannon	create