[Python-ideas] Py3k invalid unicode idea
Dillon Collins
dillonco at comcast.net
Tue Oct 7 14:07:29 CEST 2008
More information about the Python-ideas mailing list
Tue Oct 7 14:07:29 CEST 2008
- Previous message: [Python-ideas] if-syntax for regular for-loops
- Next message: [Python-ideas] Py3k invalid unicode idea
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I don't know an awful lot about unicode, so rather than clog up the already lengthy threads on the 3k list, I figured I'd just toss this idea out over here. As I understand it, there is a fairly major issue regarding malformed unicode data being passed to python, particularly on startup and for filenames. This has lead to much discussion and ultimately the decision (?) to mirror a variety of OS functions to work with both bytes and unicode. Obviously this puts us on a slope of questionable friction to reverting back to 2.x where unicode wasn't "core". My thought is this: When passed invalid unicode, keep it invalid. This is largely similar to the UTF-8b ideas that were being tossed around, but a tad different. The idea would be to maintain invalid byte sequences by use of the private use area in the unicode spec, but be explicit about this conversion to the program. In particular, I'm suggesting the addition of the following (I'll use "surrogate" to refer to the invalid bytes in a unicode string): 1) Encoding 'raw'. Force all bytes to be converted to surrogate values. Decoding to raw converts the bytes back, and gives an error on valid unicode characters(!). This would enable applications to effectively interface with the system using bytes (by setting default encoding or the like), but not require any API changes to actually support the bytes type. 2) Error handler 'force' (or whatever). For decoding, when an invalid byte is encountered, replace with a surrogate. For encoding, write the invalid byte. 2a) Decoding invalid unicode or encoding a string with surrogates raises a UnicodeError (unless handler 'force' is specified or encoding is 'raw'). 3) string method 'valid'. 'valid()' would return False if the string contains at least one surrogate and True otherwise. This would allow programs to check if the string is correct, and handle it not. This would be of particular value when reading boot information like sys.argv as that would use the 'force' error handler in order to prevent boot failure. How the invalid bytes would be stored internally is certainly a matter of hot debate on the 3k list. As I mentioned before, I am not intimately familiar with unicode, so I don't have much to suggest. If I had to implement it myself now, I'd probably use a piece of the private use area as an escape (much like '\\' does). Finally, there seems to be much concern about internal invalid unicode wreaking havoc when tossed to external programs/libraries. I have to say that I don't really see what the problem is, because whenever python writes unicode, oughtn't it be buffered by "encode"? In that case you'd either get an error or would be explicitly allowing invalid strings (via 'raw' or 'force'). And besides, if python has to deal with bad unicode, these libraries should have to too ;). Even more finally, let me apologize in advance if I missed something on another list or this is otherwise too redundant.
- Previous message: [Python-ideas] if-syntax for regular for-loops
- Next message: [Python-ideas] Py3k invalid unicode idea
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-ideas mailing list