[Python-Dev] When should pathlib stop being provisional?
Chris Angelico
rosuav at gmail.com
Wed Apr 6 02:25:05 EDT 2016
More information about the Python-Dev mailing list
Wed Apr 6 02:25:05 EDT 2016
- Previous message (by thread): [Python-Dev] When should pathlib stop being provisional?
- Next message (by thread): [Python-Dev] When should pathlib stop being provisional?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, Apr 6, 2016 at 3:37 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote: > Chris Angelico writes: > > > Outside of deliberate tests, we don't create files on our disks > > whose names are strings of random bytes; > > Wishful thinking. First, names made of control characters have often > been deliberately used by miscreants to conceal their warez. Second, > in some systems it's all too easy to create paths with components in > different locales (the place I've seen it most frequently is in NFS > mounts). I think that's much less true today, but perhaps that's only > because my employer figured out that it was much less pain if system > paths were pure ASCII so that it mostly didn't matter what encoding > users chose for their subtrees. Control characters are still characters, though. You can take a bytestring consisting of byte values less than 32, decode it as UTF-8, and have a series of codepoints to work with. If your employer has "solved" the problem by restricting system paths to ASCII, that's a fine solution for a single system with a single ASCII-compatible encoding; a better solution is to mandate UTF-8 as the file system encoding, as that's what most people are expecting anyway. > It remains important to be able to handle nearly arbitrary bytestrings > in file names as far as I can see. Please note that 100 million > Japanese and 1 billion Chinese by and large still prefer their > homegrown encodings (plural!!) to Unicode, while many systems are now > defaulting filenames to UTF-8. There's plenty of room remaining for > copying bytestrings to arguments of open and friends. Why exactly do they prefer these other encodings? Are they representing characters that Unicode doesn't contain? If so, we have a fundamental problem (no Python program is going to be able to cope with these, without a third party library or some stupid mess of local code); if not, you can always represent it as Unicode and encode it as UTF-8 when it reaches the file system. Re-encoding is something that's easy when you treat something as text, and impossible when you treat it as bytes. So far, you're still actually agreeing with me: paths are *text*, but sometimes we don't know the encoding (and that's a problem to be solved). ChrisA
- Previous message (by thread): [Python-Dev] When should pathlib stop being provisional?
- Next message (by thread): [Python-Dev] When should pathlib stop being provisional?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list