Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?
Terry Reedy
tjreedy at udel.edu
Mon Oct 31 19:10:02 EDT 2011
More information about the Python-list mailing list
Mon Oct 31 19:10:02 EDT 2011
- Previous message (by thread): Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?
- Next message (by thread): How do I pass a variable to os.popen?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 10/31/2011 3:54 PM, python at bdurham.com wrote: > Wondering if there's a fast/efficient built-in way to determine if a > string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or Tab? I presume you also want to disallow the other ascii control chars? > I know I can look at the chars of a string individually and compare them > against a set of legal chars using standard Python code (and this works If, by 'string', you mean a string of bytes 0-255, then I would, in Python 3, where bytes contain ints in [0,255], make a byte mask of 256 0s and 1s (not '0's and '1's). Example: mask = b'\0\1'*121 for c in b'\0\1help': print(mask[c]) 1 0 1 0 1 1 In your case, use \1 for forbidden and replace the print with "if mask[c]: <found illegal>; break" In 2.x, where iterating byte strings gives length 1 byte strings, you would need ord(c) as the index, which is much slower. > fine), but I will be working with some very large files in the 100's Gb > to several Tb size range so I'd thought I'd check to see if there was a > built-in in C that might handle this type of check more efficiently. > Does this sound like a use case for cython or pypy? Cython should get close to c speed, especially with hints. Make sure you compile something like the above as Py 3 code. -- Terry Jan Reedy
- Previous message (by thread): Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?
- Next message (by thread): How do I pass a variable to os.popen?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list