[Python-Dev] Python3 "complexity" - 2 use cases
Jim J. Jewett
jimjjewett at gmail.com
Fri Jan 10 22:23:18 CET 2014
More information about the Python-Dev mailing list
Fri Jan 10 22:23:18 CET 2014
- Previous message: [Python-Dev] Python3 "complexity"
- Next message: [Python-Dev] Python3 "complexity" - 2 use cases
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Steven D'Aprano wrote: >> I think that heuristics to guess the encoding have their role to play, >> if the caller understands the risks. Ben Finney wrote: > In my opinion, content-type guessing heuristics certainly don't belong > in the standard library. It would be great if there were never any need to guess. But in the real world, there is -- and often the user won't know any more than python does. So when it is time to guess, a source of good guesses is an important battery to include. The HTML5 specifications go through some fairly extreme contortions to document what browsers actually do, as opposed to what previous standards have mandated. They don't currently specify how to guess (though I think a draft once tried, since the major browsers all do it, and at the time did it similarly), but the specs do explicitly support such a step, and do provide an implementation note encouraging user-agents to do at least minimal auto-detection. http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding My own opinion is therefore that Python SHOULD provide better support for both of the following use cases: (1) Treat this file like it came from the web -- including autodetection and even overriding explicit charset declarations for certain charsets. We should explicitly treat autodetection like time zone data -- there is no promise that the "right answer" (or at least the "best guess") won't change, even within a release. I offer no opinion on whether chardet in particular is still too volatile, but the docs should warn that the API is driven by possibly changing external data. (2) Treat this file as "ASCII+", where anything non-ASCII will (at most) be written back out unchanged; it doesn't even need to be converted to text. At this time, I don't know whether the right answer is making it easy to default to surrogate-escape for all error-handling, adding more bytes methods, encouraging use of python's latin-1 variant, offering a dedicated (new?) codec, or some new suggestion. I do know that this use case is important, and that python 3 currently looks clumsy compared to python 2. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ
- Previous message: [Python-Dev] Python3 "complexity"
- Next message: [Python-Dev] Python3 "complexity" - 2 use cases
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list