[Python-Dev] PEP 460 reboot
Terry Reedy
tjreedy at udel.edu
Tue Jan 14 22:55:44 CET 2014
More information about the Python-Dev mailing list
Tue Jan 14 22:55:44 CET 2014
- Previous message: [Python-Dev] PEP 460 reboot
- Next message: [Python-Dev] PEP 460 reboot
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Let me answer you both since the issues are related.
On 1/14/2014 7:46 AM, Nick Coghlan wrote:
>> Guido van Rossum writes:
>> > And that is precisely my point. When you're using a format string,
Bytes interpolation uses a bytes format, or a byte string if you will,
but it should not be thought of as a character or text string. Certain
bytes (123 and 125) delimit a replacement field. The bytes in between
define, in my version, a format-spec after being ascii-decoded to text
for input to 3.x format(). The decoding and subsequent encoding would
not be needed if 2.7 format(ob, byte-spec) were available.
>> > all of the format string (not just the part between { and }) had
>> > better use ASCII or an ASCII superset.
I am not even sure what you mean here. The bytes outside of 123 and 125
are simply copied to the output string. There is no encoding or
interpretation involved.
It is true that the uninterpred bytes best not contain a byte pattern
mistakenly recognized as a replacement field. I plan to refine the
relational expression byte pattern used in byteformat to sharply reduce
the possibility of such errors. When such errors happen anyway, an
exception should be raised, and I plan to expand the error message to
give more diagnostic information.
>> And this (rightly) constrains the output to an ASCII superset as well.
What does this mean? I suspect I disagree. The bytes interpolated into
the output bytes can be any bytes.
>> Except that if you interpolate something like Shift JIS,
Bytes interpolation interpolates bytes, not encodings. A
self-identifying byte stream starts with bytes in a known encoding that
specifies the encoding of the rest of the stream. Neither part need be
encoded text. (Would that something like were standard for encoded text
streams, as well as for serialized images.)
>> [snip]
> Right, that's the danger I was worried about, but the problem is that
> there's at least *some* minimum level of ASCII compatibility that
> needs to be assumed in order to define an interpolation format at all
> (this is the point I originally missed).
I would put this sightly differently. To process bytes, we may define
certain bytes as metabytes with a special meaning. We may choose the
bytes that happen to be the ascii encoding of certain characters. But
once the special numbers are chosen, they are numbers, not characters.
The problem of metabytes having both a normal and special meaning is
similar to the problem of metacharacters having both a normal and
special meaning.
> For printf-style formatting,
> it's % along with the various formatting characters and other syntax
> (like digits, parentheses, variable names and "."), with the format
> method it's braces, brackets, colons, variable names, etc.
It is the bytes corresponding to these characters. This is true also of
the metabytes in an re module bytes pattern.
> The mini-language parser has to assume in encoding
> in order to interpret the format string,
This is where I disagree with you and Guido. Bytes processing is done
with numbers 0 <= n <= 255, not characters. The fact that ascii
characters can, for convenience, be used in bytes literals to indicate
the corresponding ascii codes does not change this. A bytes parser looks
for certain special numbers. Other numbers need not be given any
interpretation and need not represent encoded characters.
> and that's *all* done assuming an ASCII compatible format string
Since any bytes can be be regarded as an ascii-compatible latin-1
encoded string, that seems like a vacuous assumption. In any case, I do
not seen any particular assumption in the following, other than the
choice of replacement field delimiters.
>>> list(byteformat(bytes([1,2,10, 123, 125, 200]),
(bytes([50, 100, 150]),)))
[1, 2, 10, 50, 100, 150, 200]
> (which must make life interesting if you try to use an
> ASCII incompatible coding cookie for your source code - I'm actually
> not sure what the full implications of that *are* for bytes literals
> in Python 3).
An interesting and important question. The Python 2 manual says that the
coding cookie applies to only to comments and strings. To me, this
suggests that any encoding can be used. I am not sure how and when the
encoding is applied. It suggests that the sequence of bytes resulting
from a string literal is not determined by the sequence of characters
comprising the string literal, but also depends on the coding cookie.
The Python 3 manual says that the coding cookie applies to the whole
source file. To me, this says that the subset of unicode chars included
in the encoding *must* include the ascii characters. It also suggest to
me that the encoding must also ascii-compatible, in order to read the
encoding in the ascii-text coding cookie (unless there is a fallback to
the system encoding).
In any case, a 3.x source file is decoded to unicode. When the sequence
of unicode chars comprising a bytes literal is interpreted, the
resulting sequence of bytes depends only on the literal and not the file
encoding. So list(b'()'), for instance, should always be [123, 125] in
3.x. My comments above about byte processing assume that this is so.
--
Terry Jan Reedy
- Previous message: [Python-Dev] PEP 460 reboot
- Next message: [Python-Dev] PEP 460 reboot
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list