Issue10382
Created on 2010-11-10 19:34 by belopolsky, last changed 2022-04-11 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue10382.diff | belopolsky, 2010-11-11 00:04 | review | ||
| issue10382a.diff | belopolsky, 2010-11-11 23:06 | review | ||
| Messages (5) | |||
|---|---|---|---|
| msg120930 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-10 19:34 | |
>>> ¡™£¢∞§¶•ªº
File "<stdin>", line 1
¡™£¢∞§¶•ªº
^
SyntaxError: invalid character in identifier
It looks like strlen() is used instead of number of characters in the decoded string.
|
|||
| msg120933 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-11 00:04 | |
I am attaching a patch that seems to fix the issue. Note that I considered fixing the problem in parsetok.c where offset is originally computed, but this is part of pgen which has to be compiled without unicode support.
The test case suitable to be included in unittests is:
try:
eval(b'\xc2\xa1'.decode('utf-8'))
except SyntaxError as err:
assert(err.offset == 1)
|
|||
| msg120941 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-11-11 08:53 | |
See also #2382: I wrote patches two years ago for this issue. |
|||
| msg120982 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-11-11 23:05 | |
haypo> See also #2382: I wrote patches two years ago for this issue. Yes, this is the same issue. I don't want to close this as a duplicate because #2382 contains a much more ambitious set of patches. What I am trying to achieve here is similar to the adjust_offset.patch there. I am attaching a patch that takes an alternative approach and computes the number of characters in the parser. I strongly believe that the buffer in the tokenizer always contains UTF-8 encoded text. If it is not so already, I would consider making it so by replacing a call to _PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). (if that matters) The patch still needs unittests and possibly has some off-by-one issues, but I would like to get to an agreement that this is the right level at which the problem should be fixed first. |
|||
| msg190931 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2013-06-10 20:37 | |
The latest patch at #2382 is simpler than mine, so I am closing this as duplicate. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:57:08 | admin | set | github: 54591 |
| 2013-06-10 20:37:57 | belopolsky | set | status: open -> closed superseder: [Py3k] SyntaxError cursor shifted if multibyte character is in line. resolution: duplicate messages: + msg190931 |
| 2010-11-11 23:06:14 | belopolsky | set | files: + issue10382a.diff |
| 2010-11-11 23:05:52 | belopolsky | set | messages: + msg120982 |
| 2010-11-11 08:53:41 | vstinner | set | messages: + msg120941 |
| 2010-11-11 01:37:09 | belopolsky | link | issue10384 dependencies |
| 2010-11-11 00:17:27 | belopolsky | set | nosy:
+ loewis |
| 2010-11-11 00:04:06 | belopolsky | set | files:
+ issue10382.diff messages: + msg120933 assignee: belopolsky |
| 2010-11-10 20:57:44 | belopolsky | set | nosy:
+ lemburg, vstinner, ezio.melotti |
| 2010-11-10 19:34:23 | belopolsky | create | |
