regexp search question
Francis Avila
francisgavila at yahoo.com
Wed Oct 22 20:14:36 EDT 2003
More information about the Python-list mailing list
Wed Oct 22 20:14:36 EDT 2003
- Previous message (by thread): regexp search question
- Next message (by thread): regexp search question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
"Paul Rubin" <http://phr.cx@NOSPAM.invalid> wrote in message news:7xsmlk97pt.fsf_-_ at ruckus.brouhaha.com... > I have a string s, possibly megabytes in size, and two regexps, p and q. > > I want to find the first occurence of q that occurs after the first > occurence of p. > > Is there a reasonable way to do it? > > g1 = re.search(p, s) > g2 = re.search(q, s[g1.end():]) > q_offset = g1.end() + g2.start() > > > is not a reasonable way, since it copies a ton of data around > (slicing an arbitrary sized chunk off s into a new temporary string). > > Most regexps libs I know of have a way to start the search at a > specified offset. Python's string.find and string.index methods > have a similar optional arg. But I don't see it described in the > re module docs. > > Am I missing something? Yes: you can specify an offset, but only in the search METHOD (of re objects), not the search function (for that, you just use slicing of the string, see?) Alternative 1: Instead of slicing the string, make a buffer object that references to a slice of the string (using the buffer() builtin) NOTE: Don't do this! Alternative 2: Compile a regular expression object for p and q, instead of doing a match. Since I don't know the implementation details or re, I don't know if the start/end args to REOBJECT.search will copy the string or use a buffer--so that may not be different from what you're doing. However, compiling the re will certainly be faster, if you do this search more than once. (NOTE: untested code!) p = re.compile(ppattern) q = re.compile(qpattern) matchp = p.search(somestring) pend = matchp.end() matchq = q.search(somestring, pend) qstart = matchq.start() Now I'm not sure if matchq.start() returns index from the substring or the whole string. You'll just have to try it and see... if counts from substring: offset = matchq.pos + matchq.start() # == matchp.end() + matchq.start(). else: offset = matchq.start() Alternative 3: You could probably combine p and q into a single regexp specifying that you match p, then q, with anything inbetween. Using groups (p is grp 1, q is grp 2), get your offset with matchpq.end(1) + matchpq.start(2) There are probably many other ways. > Thanks. No problem. -- Francis Avila
- Previous message (by thread): regexp search question
- Next message (by thread): regexp search question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list