[Python-ideas] PEP 540: Add a new UTF-8 mode
Petr Viktorin
encukou at gmail.com
Wed Jan 11 06:22:57 EST 2017
More information about the Python-ideas mailing list
Wed Jan 11 06:22:57 EST 2017
- Previous message (by thread): [Python-ideas] PEP 540: Add a new UTF-8 mode
- Next message (by thread): [Python-ideas] PEP 540: Add a new UTF-8 mode
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 01/11/2017 11:46 AM, Stephan Houben wrote: > Hi INADA Naoki, > > (Sorry, I am unsure if INADA or Naoki is your first name...) > > While I am very much in favour of everything working "out of the box", > an issue is that we don't have control over external code > (be it Python extensions or external processes invoked from Python). > > And that code will only look at LANG/LC_TYPE and ignore any cleverness > we build into Python. > > For example, this may mean that a built-in Python string sort will give you > a different ordering than invoking the external "sort" command. > I have been bitten by this kind of issues, leading to spurious "diffs" if > you try to use sorting to put strings into a canonical order. AFAIK, this would not be a problem under PEP 538, which effectively treats the "C" locale as "C.UTF-8". Strings of Unicode codepoints and the corresponding UTF-8-encoded bytes sort the same way. Is that wrong, or do you have a better example of trouble with using "C.UTF-8" instead of "C"? > So my feeling is that people are ultimately not being helped by > Python trying to be "nice", since they will be bitten by locale issues > anyway. IMHO ultimately better to educate them to configure the locale. > (I realise that people may reasonably disagree with this assessment ;-) ) > > I would then recommend to set to en_US.UTF-8, which is slower and > less elegant but at least more widely supported. What about the spurious diffs you'd get when switching from "C" to "en_US.UTF-8"? $ LC_ALL=en_US.UTF-8 sort file.txt a a A A $ LC_ALL=C sort file.txt A A a a > By the way, I know a bit how Node.js deals with locales, and it doesn't try > to compensate for "C" locales either. But what it *does* do is that > Node never uses the locale settings to determine the encoding of a file: > you either have to specify it explicitly OR it defaults to UTF-8 (the > latter on output only). > So in this respect it is by specification immune against > misconfiguration of the encoding. > However, other stuff (e.g. date formatting) will still be influenced by > the "C" locale > as usual. I believe the main problem is that the "C" locale really means two very different things: a) Text is encoded as 7-bit ASCII; higher codepoints are an error b) No encoding was specified In both cases, treating "C" as "C.UTF-8" is not bad: a) For 7-bit "text", there's no real difference between these locales b) UTF-8 is a much better default than ASCII
- Previous message (by thread): [Python-ideas] PEP 540: Add a new UTF-8 mode
- Next message (by thread): [Python-ideas] PEP 540: Add a new UTF-8 mode
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-ideas mailing list