bpo-33416: Add end positions to Python AST by ilevkivskyi · Pull Request #11605

bpo-33416: Add end positions to Python AST by ilevkivskyi · Pull Request #11605 · python/cpython

The majority of this PR is tediously passing end_lineno and end_col_offset everywhere. Here are non-trivial points:

It is not possible to reconstruct end positions in AST "on the fly", some information is lost after an AST node is constructed, so we need two more attributes for every AST node end_lineno and end_col_offset.
I add end position information to both CST and AST. Although it may be technically possible to avoid adding end positions to CST, the code becomes more cumbersome and less efficient.
Since the end position is not known for non-leaf CST nodes while the next token is added, this requires a bit of extra care (see _PyNode_FinalizeEndPos). Unless I made some mistake, the algorithm should be linear.
For statements, I "trim" the end position of suites to not include the terminal newlines and dedent (this seems to be what people would expect), for example in the end line and end column for the class definition is (2, 8).
For end_col_offset I use the common Python convention for indexing, for example for pass the end_col_offset is 4 (not 3), so that [0:4] gives one the source code that corresponds to the node.
I added a helper function ast.get_source_segment(), to get source text segment corresponding to a given AST node. It is also useful for testing.

An (inevitable) downside of this PR is that AST now takes almost 25% more memory. I think however it is probably justified by the benefits.

https://bugs.python.org/issue33416

Conflicts:
	Parser/parsetok.c

Never mind, that was SyntaxError. The AST uses 0-based column offsets.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very excited for this change

		The UTF-8 offset is recorded because the parser uses UTF-8 internally.

		Note that the end positions are not required by the compiler and are
		therefore optional. The end offset if after the last symbol, for example

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	therefore optional. The end offset if after the last symbol, for example
	therefore optional. The end offset is after the last symbol, for example

		This is useful to "move code" to a different location in a file.
		Increment the line number and end line number of each node in the tree
		starting at node by n. This is useful to "move code" to a different
		location in a file..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	location in a file..
	location in a file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew! I don't have much to add, this all looks great. I don't think in these days 20% extra space to represent CST+AST for a single module is a big deal (though it would be nice to see what the difference is in absolute number of bytes for Lib/tkinter/init.py, which AFAIK is still the largest stdlib module).

		to new_node if possible, and return new_node.
		Copy source location (:attr:`lineno`, :attr:`col_offset`, :attr:`end_lineno`,
		and :attr:`end_col_offset`) from old_node to new_node if possible,
		and return new_node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only affect the node, or the whole tree? I've never used this so I don't know what to expect from context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one (unlike some others) is non-recursive, one can even copy from a node of a different kind.

		@@ -438,152 +446,188 @@ mod_ty _Py_Interactive(asdl_seq * body, PyArena *arena);
		mod_ty _Py_Expression(expr_ty body, PyArena *arena);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines up, I wonder if it would be a good idea to add a comment explaining that these macros also change the definition of the functions (not just the calls).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

		_fix(child, lineno, col_offset)
		_fix(node, 1, 0)
		_fix(child, lineno, col_offset, end_lineno, end_col_offset)
		_fix(node, 1, 0, 1, 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function looks a bit suspicious.Shouldn't it at least ensure that (end_lineno, end_col_offset) > (lineno, col_offset)? (Again, I guess I don't know the use case.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used to compile manually generated AST to bytecode. Although technically end positions are not needed for the compiler (only start positions are recorded in the byte code), it is probably good to have them fixed. The algorithm for fixing is quite naive (take either a position of nearest root-side node if known, otherwise use start of the file), but it never intended to be robust, it exist just to allow compilation in those case where a user doesn't care about line numbers.


		if end_lineno == lineno:
		source_line = source.splitlines()[lineno]
		return source_line.encode()[col_offset:end_col_offset].decode(coding)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I read this well, source is bytes, and the col_offsets are byte offsets? And a tab counts as one space? Is that explicit somewhere in the ast module docs? (This is easy to miss when writing code since most source code is ASCII only so the difference doesn't matter, but it would break whenever a token appears after a non-ASCII character in an identifier or string literal.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All you say is right, and yes it is already mentioned in the docs that they are UTF-8 byte offsets.

		last = lines[end_lineno].encode()[:end_col_offset].decode(coding)
		middle_lines = lines[lineno+1:end_lineno]

		return '\n'.join([first] + middle_lines + [last])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would avoid two list copies if you used .insert() and .append() to merge first and last into the middle_lines list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

		self.assertEqual(ast.get_source_segment(s_orig, cdef.body[0], padded=True),
		s_method)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I trust your tests. If they pass, they're good. :-)

		@@ -0,0 +1 @@
		Add end line and end column position information to the Python AST nodes. No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that this is a backwards incompatible API change for anything that creates AST nodes? (It's fine if you just call ast.parse(), but not if you ever create a new node.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually most of the things still work:

All existing manual AST creation code will work, ast.Constant(value=42, lineno=1, col_offset=0) is still valid.
Compilation of existing manually generated ASTs will still work, the end position are int? in Python.asdl, so they are allowed to be absent and/or be None.

However, some C-level compatibility is broken. I am not sure what is the source of truth for "official" C API. I checked docs.python.org and none of the functions I updated is listed there.

I will try to re-formulate this.

		}

		return Call(func, args, keywords, func->lineno, func->col_offset, c->c_arena);
		// Ideally call ends where closing parenthesis is placed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(When does a call not have a close par?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It (close par) was NULL for "fake" calls used in analyzing class definitions, but I just fixed this and removed the comment.

		get_last_end_pos(asdl_seq s, int end_lineno, int *end_col_offset)
		{
		int tot = asdl_seq_LEN(s);
		stmt_ty last = asdl_seq_GET(s, tot - 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a check that tot > 0?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this just in case (to be future proof against any AST optimizers).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a part of comments.

		:class:`AsyncFunctionDef` is now supported.


		.. function:: get_code_segment(source, node, *, padded=False, coding='utf-8')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think encoding is more common term in this context.

		The UTF-8 offset is recorded because the parser uses UTF-8 internally.

		Note that the end positions are not required by the compiler and are
		therefore optional. The end offset if after the last symbol, for example

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "optional" mean? Is it equal to None or an absent attribute?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually both are allowed. An optional attribute in Python.asdl can either be absent or None.

		If padded is `True`, the first line of a multi-line statement will
		be padded with spaces to match its original position.
		"""
		if not hasattr(node, 'lineno'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified:

try:
    # get loneno, end_lineno, etc.
except AttributeError:
    return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


		if end_lineno == lineno:
		source_line = source.splitlines()[lineno]
		return source_line.encode()[col_offset:end_col_offset].decode(coding)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coding should not be used here. Always use UTF-8.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right.

		lines = source.splitlines()

		if padded:
		padding = ' ' * len(lines[lineno].encode()[:col_offset].decode(coding))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep tabs and spaces (and maybe '\f'?) and replace all other characters with spaces. There may be similar code in C, look how the position of the caret is calculated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. It looks like there are bunch of corner cases around \r\n\f\t. I tried to update the function (including your suggestions) so that it mimics closely what happens in C code.

		self.assertEqual(len(stmt.attributes), 2)
		self.assertEqual(len(stmt.attributes), 4)
		self.assertEqual(str(stmt.attributes[0]), 'Field(int, lineno)')
		self.assertEqual(str(stmt.attributes[1]), 'Field(int, col_offset)')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for stmt.attributes[2] and stmt.attributes[3]?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


		int
		PyNode_AddChild(node n1, int type, char str, int lineno, int col_offset)
		PyNode_AddChild(node n1, int type, char str, int lineno, int col_offset,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not this a part of the C API? Should not we add a new function?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "official" source of truth for C API? I didn't find this function on docs.python.org.

		@@ -225,7 +229,9 @@ future_hack(parser_state *ps)

		int
		PyParser_AddToken(parser_state ps, int type, char str,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not this a part of the C API?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

		return text


		def _splitlines_no_ff(source):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to use .splitlines(keepends=True) (which considers \f as an endline) and then re-join those that aren't ending in \r or \n, I am not sure which approach is better.

@gvanrossum I just tried to compare the sizes of Lib/tkinter/__init__.py AST before and after this PR. Here are the numbers: before 5.8 Mbytes, after 7.2Mbytes (24% increase).

@gvanrossum I just tried to compare the sizes of Lib/tkinter/__init__.py AST before and after this PR. Here are the numbers: before 5.8 Mbytes, after 7.2Mbytes (24% increase).

How did you measure this? IIUC there's also a size increase for the CST, and during translation from CST to AST both are in memory. Is it possible to measure the peak memory usage during this translation?

I just looked at the difference in allocated memory reported in sys._debugmallocstats(). I just tried another way using tracemalloc. It reports that the line in question:

compile(source, filename, mode, PyCF_ONLY_AST)

allocates 3.7 Mbytes before and 4.6 Mbytes after this PR (still the same 24% relative increase). Also this number (24%) is quite reasonable taking into account that all nodes in CST have exactly the same size: 40 bytes before, 48 bytes after (at least on my 64-bit Linux).

I do think that's a very reasonable increase (we're not talking MicroPython here :-).

Also this number (24%) is quite reasonable taking into account that all nodes in CST have exactly the same size: 40 bytes before, 48 bytes after (at least on my 64-bit Linux).

That's sizeof what exactly? While the node types indeed are big unions, there are several different ones: struct _stmt, struct _expr and some minor ones (all defined in Include/Python-ast.h).

(I tried this myself on my Mac, and it looks like sizeof(struct _stmt) is 72 bytes, and sizeof(struct _expr) is 48 bytes. An int is 4 bytes.)

That's sizeof what exactly?

This is a size of _node. There is only one "main" struct in CST struct _node from node.h.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! Let's land this. I combed through the code (both the tedious parts and the interesting parts) and it all looks good to me. I am not worried about the 25% increase in CST and AST tree sizes.

@gvanrossum
Great, thanks! I will merge this now, so that we can move with the typed_ast PR, but there is one question that Serhiy raised and I would like to double check: does the fact that PyNode_AddChild and PyParser_AddToken are not documented on docs.python.org mean they are not part of C-API? Also I just checked the files where they are declared (node.h and parser.h) are not included in Python.h.

does the fact that PyNode_AddChild and PyParser_AddToken are not documented on docs.python.org mean they are not part of C-API?

I'm not sure. Their names don't have _ prefixes so I think they live in some nebulous area. It would be good to bring this up in another forum where more core devs can think about the issue (some of whome have probably thought about it more).

In the worst case scenario, if it's deemed an unacceptable backwards incompatibility, we can always add new functions that add end_lineno and end_col_offset arguments, and make the old ones call the new ones with some kind of default values computed. But I would only resort to this if during the alpha/beta release we get actual complaints.

OK, I will post to Python-Dev about this.

		to new_node if possible, and return new_node.
		Copy source location (:attr:`lineno`, :attr:`col_offset`, :attr:`end_lineno`,
		and :attr:`end_col_offset`) from old_node to new_node if possible,
		and return new_node.