How to use Unicode regexes?
Martin von Loewis
loewis at informatik.hu-berlin.de
Sat Jul 28 03:44:10 EDT 2001
More information about the Python-list mailing list
Sat Jul 28 03:44:10 EDT 2001
- Previous message (by thread): How to use Unicode regexes?
- Next message (by thread): Typing system vs. Java
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
rhys tucker <rhystucker at rhystucker.fsnet.co.uk> writes: > Could somebody show me how to do Unicode regexes? I'm trying to > write a strings-like utility for windows - so I want to match ascii > and unicode characters in a binary file. Do I need one regex pattern > since ascii and Unicode are similar for ascii text characters or are > 2 regex patterns needed since they are different byte sizes? You can use exactly the same regular expression for both byte and Unicode strings, but this seems not to be your question. It is not clear to me what exactly you are trying to achieve. What do you mean by "unicode characters in a binary file"? In a binary file, there are no characters, only bytes. You need to know what encoding was used for the Unicode strings (UTF-8, UCS-2, ...) before being able to determine whether a certain Unicode string appears in a certain file. > The documentation suggest that I need to use \w pattern to match > Unicode and set UNICODE. I'm not sure what and how to set Unicode. Where does it say that? \w is about "alphanumeric characters", it says that \w matches all characters that are marked as alphanumeric in the Unicode character database if the UNICODE flag is set. To match Unicode strings, you don't need \w at all: >>> re.search(u"al", u"Hallo") <SRE_Match object at 0x81db868> This finds one Unicode strng in another; no need for \w or the UNICODE flag. To specify the UNICODE flag, either pass re.UNICODE as the second argument to re.compile, or wrap your entire expression into (?u...). > This is what I've done so far - it matches (some ?) ascii characters > but misses those unicode strings. It seems that you really are looking for UCS-2 strings in the file. The Unicode facilities in Python are then of no use for you: You need to understand how the encoding works, and formulate a pattern based on that. Regards, Martin
- Previous message (by thread): How to use Unicode regexes?
- Next message (by thread): Typing system vs. Java
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list