[Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
Eric V. Smith
eric at trueblade.com
Thu May 17 18:38:59 EDT 2018
More information about the Python-Dev mailing list
Thu May 17 18:38:59 EDT 2018
- Previous message (by thread): [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
- Next message (by thread): [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 5/17/2018 3:01 PM, Larry Hastings wrote:
>
>
> I fed this into tokenize.tokenize():
>
> b''' x = "\u1234" '''
>
> I was a bit surprised to see \Uxxxx in the output. Particularly because
> the output (t.string) was a *string* and not *bytes*.
For those (like me) who have no idea how to use tokenize.tokenize's
wacky interface, the test code is:
list(tokenize.tokenize(io.BytesIO(b''' x = "\u1234" ''').readline))
> Maybe I'm making a parade of my ignorance, but I assumed that string
> literals were parsed by the parser--just like everything else is parsed
> by the parser, hey it seems like a good place for it--and in particular
> that the escape sequence substitutions would be done in the tokenizer.
> Having stared at it a little, I now detect a whiff of "this design
> solved a real problem". So... what was the problem, and how does this
> design solve it?
I assume the intent is to not throw away any information in the lexer,
and give the parser full access to the original string. But that's just
a guess.
> BTW, my use case is that I hoped to use CPython's tokenizer to parse
> some Python-ish-looking text and handle double-quoted strings for me.
> *Especially* all the escape sequences--leveraging all CPython's support
> for funny things like \U{penguin}. The current behavior of the
> tokenizer makes me think it'd be easier to roll my own!
Can you feed the token text to the ast?
>>> ast.literal_eval('"\u1234"')
'ሴ'
Eric
- Previous message (by thread): [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
- Next message (by thread): [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list