[Python-Dev] Format strings, Unicode, and Py2.7: need clarification

Steven D'Aprano steve at pearwood.info
Wed May 17 20:41:12 EDT 2017

Previous message (by thread): [Python-Dev] Format strings, Unicode, and Py2.7: need clarification
Next message (by thread): [Python-Dev] Format strings, Unicode, and Py2.7: need clarification
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, May 17, 2017 at 02:41:29PM -0700, Craig Rodrigues wrote:

> e = "{}".format(u"hi")
[...]
> type(e) == str

> The confusion for me is why is type(e) of type str, and not unicode?

I think that's one of the reasons why the Python 2.7 string model is (1) 
convenient to those using purely ASCII, but (2) ultimately broken.

You can see why it's broken if you do this:

py> "{}".format(u"hiµ")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in 
position 2: ordinal not in range(128)

So it tries to encode the Unicode string to ASCII, and if that succeeds, 
format returns a byte str. I'm not sure if that was a deliberate design 
choice for format, or just a side-effect of it calling str() on its 
arguments by default.

I'm not sure if I've answered your question or not. Are you looking for 
justification of this misfeature, or an explanation of the historical 
reasons why it exists, or something else?

(If you're looking for the same behaviour in Python 3 and 2.7, probably 
the best thing you can do is just religiously use unicode strings u'' in 
both. You might try:

from __future__ import unicode_literals

in 2.7, but I'm not sure that's enough.)

-- 
Steve

Previous message (by thread): [Python-Dev] Format strings, Unicode, and Py2.7: need clarification
Next message (by thread): [Python-Dev] Format strings, Unicode, and Py2.7: need clarification
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list