[Python-Dev] Bytes path support
Isaac Morland
ijmorlan at uwaterloo.ca
Sat Aug 23 11:27:54 CEST 2014
More information about the Python-Dev mailing list
Sat Aug 23 11:27:54 CEST 2014
- Previous message: [Python-Dev] Bytes path support
- Next message: [Python-Dev] Bytes path support
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sat, 23 Aug 2014, Marko Rauhamaa wrote: > "Stephen J. Turnbull" <stephen at xemacs.org>: > >> Just read as bytes and decode piecewise in one way or another. For >> Oleg's HTML case, there's a well-understood structure that can be used >> to determine retry points > > HTML and XML are interesting examples since their encoding is initially > unknown: > > <?xml version="1.0"?> > ^ > +--- Now I know it is UTF-8 > > <?xml version="1.0" encoding="UTF-16"?> > ^ > +--- Now I know it was UTF-16 > all along! > > Then we have: > > > HTTP/1.1 200 OK > Content-Type: text/html; charset=ISO-8859-1 > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=utf-16"> > > See how deep you have to parse the TCP stream before you realize the > content encoding is UTF-16. For HTML it's not quite so bad. According to the HTML 4 standard: http://www.w3.org/TR/html4/charset.html The Content-Type header takes precedence over a <meta> element. I thought I read once that the reason was to allow proxy servers to transcode documents but I don't have a cite for that. Also, the <meta> element "must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters" so the initial UTF-16 example wouldn't be conformant in HTML. In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF (byte order mark) is used: http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration Not sure about XML. Of course this whole area is a bit of an "arms race" between programmers competing to get away with being as sloppy as possible and other programmers who have to deal with their mess. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist
- Previous message: [Python-Dev] Bytes path support
- Next message: [Python-Dev] Bytes path support
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list