Issue 35107: untokenize() fails on tokenize output when a newline is missing

Issue35107

Created on 2018-10-29 22:26 by gregory.p.smith, last changed 2018-10-30 14:59 by terry.reedy.

Messages (7)
msg328876 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2018-10-29 22:26
The behavior change introduced in 3.6.7 and 3.7.1 via https://bugs.python.org/issue33899 has further consequences: ```python >>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../cpython/cpython-upstream/Lib/tokenize.py", line 332, in untokenize out = ut.untokenize(iterable) File ".../cpython/cpython-upstream/Lib/tokenize.py", line 266, in untokenize self.add_whitespace(start) File ".../cpython/cpython-upstream/Lib/tokenize.py", line 227, in add_whitespace raise ValueError("start ({},{}) precedes previous end ({},{})" ValueError: start (1,1) precedes previous end (2,0) ``` The same goes for using the documented tokenize API (`generate_tokens` is not documented): ``` tokenize.untokenize(tokenize.tokenize(io.BytesIO(b'#').readline)) ... ValueError: start (1,1) precedes previous end (2,0) ``` `untokenize()` is no longer able to work on output of `generate_tokens()` if the input to generate_tokens() did not end in a newline. Today's workaround: Always append a newline if one is missing to the line that the readline callable passed to tokenize or generate_tokens returns. Very annoying to implement.
msg328878 - (view)	Author: Ammar Askar (ammar2) *	Date: 2018-10-29 22:49
Looks like this is caused by this line here: https://github.com/python/cpython/blob/b83d917fafd87e4130f9c7d5209ad2debc7219cd/Lib/tokenize.py#L551-L558 which adds a newline token implicitly after comments. Since the input didn't terminate with a '\n', the code to add a newline at the end of input kicks in.
msg328879 - (view)	Author: Ammar Askar (ammar2) *	Date: 2018-10-29 23:21
fwiw I think there's more at play here than the newline change. This is the behavior I get on 3.6.5 (before the newline change is applied). # works as expected but check out this input: >>> t.untokenize(tokenize.generate_tokens(io.StringIO('#').readline)) '#' >>> t.untokenize(tokenize.generate_tokens(io.StringIO('x=1').readline)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "D:\Python365\lib\tokenize.py", line 272, in untokenize self.add_whitespace(start) File "D:\Python365\lib\tokenize.py", line 234, in add_whitespace .format(row, col, self.prev_row, self.prev_col)) ValueError: start (1,0) precedes previous end (2,0)
msg328880 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2018-10-30 00:39
Interesting! I have a 3.6.2 sitting around and cannot reproduce that "x=1" behavior. I don't know what the behavior _should_ be. It just feels natural that untokenize should be able to round trip anything tokenize or generate_tokens emits without raising an exception. I'm filing this as the "#" case came up within some existing code we had that happened to effectively test that particular round trip.
msg328882 - (view)	Author: Ammar Askar (ammar2) *	Date: 2018-10-30 00:56
Actually nevermind, disregard that, I was just testing it wrong. I think the simplest fix here is to add '#' to the list of characters here so we don't double insert newlines for comments: https://github.com/python/cpython/blob/b83d917fafd87e4130f9c7d5209ad2debc7219cd/Lib/tokenize.py#L659 And a test for round tripping a file ending with a comment but no newline will allow that particular branch to be tested. I'll make a PR this week if no one else gets to it.
msg328884 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-10-30 06:45
I am surprised, that removing the newline character adds a token: >>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#\n').readline))) [TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#\n'), TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')] >>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#').readline))) [TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'), TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'), TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
msg328927 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-10-30 14:59
It seems to me a bug that if '\n' is not present, tokenize adds both NL and NEWLINE tokens, instead of just one of them. Moreover, both tuples of the double correction look wrong. If '\n' is present, TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n') looks correct. If NL represents a real character, the length 0 string='' in the generated TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'), seems wrong. I suspect that the idea was to mis-represent NL to avoid '\n' being added by untokenize. In TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='') string='' is mismatched by length = 2-1 = 1. I am inclined to think that the following would be the correct added token, which should untokenize correctly TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='') ast.dump(ast.parse(s)) returns 'Module(body=[])' for both versions of 's', so no help there.

History
Date	User	Action	Args
2018-10-30 14:59:51	terry.reedy	set	messages: + msg328927
2018-10-30 06:45:57	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg328884
2018-10-30 00:56:47	ammar2	set	messages: + msg328882
2018-10-30 00:39:32	gregory.p.smith	set	messages: + msg328880
2018-10-29 23:21:18	ammar2	set	messages: + msg328879
2018-10-29 23:16:58	pablogsal	set	nosy: + pablogsal
2018-10-29 22:49:32	ammar2	set	messages: + msg328878
2018-10-29 22:26:38	gregory.p.smith	create