Flexible string representation, unicode, typography, ...
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Thu Aug 23 08:47:29 EDT 2012
More information about the Python-list mailing list
Thu Aug 23 08:47:29 EDT 2012
- Previous message (by thread): Unittest - testing for filenames and filesize
- Next message (by thread): Flexible string representation, unicode, typography, ...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This is neither a complaint nor a question, just a comment. In the previous discussion related to the flexible string representation, Roy Smith added this comment: http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42 Not only I agree with his sentence: "Clearly, the world has moved to a 32-bit character set." he used in his comment a very intersting word: "punctuation". There is a point which is, in my mind, not very well understood, "digested", underestimated or neglected by many developers: the relation between the coding of the characters and the typography. Unicode (the consortium), does not only deal with the coding of the characters, it also worked on the characters *classification*. A deliberatly simplistic representation: "letters" in the bottom of the table, lower code points/integers; "typographic characters" like punctuation, common symbols, ... high in the table, high code points/integers. The conclusion is inescapable, if one wish to work in a "unicode mode", one is forced to use the whole palette of the unicode code points, this is the *nature* of Unicode. Technically, believing that it possible to optimize only a subrange of the unicode code points range is simply an illusion. A lot of work, probably quite complicate, which finally solves nothing. Python, in my mind, fell in this trap. "Simple is better than complex." -> hard to maintained "Flat is better than nested." -> code points range "Special cases aren't special enough to break the rules." -> special unicode code points? "Although practicality beats purity." -> or the opposite? "In the face of ambiguity, refuse the temptation to guess." -> guessing a user will only work with the "optimmized" char subrange. ... Small illustration. Take an a4 page containing 50 lines of 80 ascii characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000), and you will see all the optimization efforts destroyed. >> sys.getsizeof('a' * 80 * 50) 4025 >>> sys.getsizeof('a' * 80 * 50 + '•') 8040 Just my 2 € (code point 0x20ac) cents. jmf
- Previous message (by thread): Unittest - testing for filenames and filesize
- Next message (by thread): Flexible string representation, unicode, typography, ...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list