[Python-Dev] Bytes path related questions for Guido
MRAB
python at mrabarnett.plus.com
Thu Aug 28 09:30:39 CEST 2014
More information about the Python-Dev mailing list
Thu Aug 28 09:30:39 CEST 2014
- Previous message: [Python-Dev] Bytes path related questions for Guido
- Next message: [Python-Dev] Bytes path related questions for Guido
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2014-08-28 05:56, Glenn Linderman wrote: > On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: >> Glenn Linderman writes: >> > On 8/26/2014 4:31 AM, MRAB wrote: >> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >> > >> Nick Coghlan writes: >> >> > > How about: >> > > >> > > replace_surrogate_escapes(s, replacement='\uFFFD') >> > > >> > > If you want them removed, just pass an empty string as the >> > > replacement. >> >> That seems better to me (I had too much C for breakfast, I think). >> >> > And further, replacement could be a vector of 128 characters, to do >> > immediate transcoding, >> >> Using what encoding? > > The vector would contain the transcoding. Each lone surrogate would map > to a character in the vector. > >> If you knew that much, why didn't you use >> (write, if necessary) an appropriate codec? I can't envision this >> being useful. > > If the data format describes its encoding, possibly containing data from > several encodings in various spots, then perhaps it is best read as > binary, and processed as binary until those definitions are found. > > But an alternative would be to read with surrogate escapes, and then > when the encoding is determined, to transcode the data. Previously, a > proposal was made to reverse the surrogate escapes to the original > bytes, and then apply the (now known) appropriate codec. There are not > appropriate codecs that can convert directly from surrogate escapes to > the desired end result. This technique could be used instead, for > single-byte, non-escaped encodings. On the other hand, writing specialty > codecs for the purpose would be more general. > There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct. If you picked the wrong encoding, the other codepoints could be wrong too.
- Previous message: [Python-Dev] Bytes path related questions for Guido
- Next message: [Python-Dev] Bytes path related questions for Guido
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list