Plex 0.1 - A Lexical Analysis Module
Tim Peters
tim_one at email.msn.com
Wed Feb 9 23:22:03 EST 2000
More information about the Python-list mailing list
Wed Feb 9 23:22:03 EST 2000
- Previous message (by thread): Plex 0.1 - A Lexical Analysis Module
- Next message (by thread): small python implementation or python subset ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
[posted & mailed] [on Greg Ewing's nascent "traditional DFA" lexer, at http://www.cosc.canterbury.ac.nz/~greg/python/Plex ] [Tim, suggests "direct computation of regular language intersection, complement, difference, and reversal, as well as regexpy unions and catenations" ] [Greg] > Tell me more about how to implement these, and I'll > see what I can do. If it's just a matter of gluing > NFAs together, it should be quite easy. Easy and (honest!) fun (btw, while it would be awfully nice for the DFA match function to get coded in C, the manipulations going into building a DFA are probably better left in Python -- "DFA compile time" isn't at all crucial for the most pressing uses). Here are implementation sketches. A core notion is the concept of a "complete" DFA. A DFA is complete iff for every state S and every symbol s, there's an explicit transition from S on s. If a DFA is not complete, it can be made complete by introducing a "sink state" K, which is just a non-accepting state that transitions to itself on every symbol. Then direct all the "missing" symbol+state transitions to K. A complete DFA is wholly explicit about what happens in every case, which is crucial in getting complementation to work correctly. I'm going to assume that all inputs are complete DFAs below (it greatly simplifies reasoning -- in practice, you can cheat; in particular, it's *usually* fine if an input is an NFA, so I'll mix terminology a little). Complement: For each state, if it's non-accepting make it accepting, and vice versa. Done! The transitions and set of start states don't change. Intersection of DFAs A and B: The usual construction is to simulate running A & B in parallel, building a new NFA whose states are the Cartesian product of A's states and B's states, and where a new state is accepting iff its constituent states were both accepting. The whole bit is obvious after a little thought. A much slicker way is to compute complement(union(complement(A), complement(B))) ! Not only pretty, but much harder to screw up <wink>. Union of DFAs A and B: Introduce a unique new start state S. Introduce epsilon transitions from S to all the start states in A and in B. Done. Difference of DFAs A and B (recognize everything recognized by A but not recognized by B): intersection(A, complement(B)). Reversal (accept strings that are the reversals of the strings accepted by the input DFA): (1) Reverse the arrows (that is, if S transitioned to T on symbol s, afterwards T transitions to S on s). (2) Set of start states becomes the old set of accepting states. (3) Set of accepting states becomes the old set of start states. Note that the result is very likely to be an NFA. Reversal is especially interesting because of a beautiful and unexpected result due to Brzozowski: modulo some fiddling to get rid of sink states and unreachable states, a *minimal* DFA can be constructed from an NFA A via make_dfa(reverse(make_dfa(reverse(A)))) where make_dfa is the usual NFA->DFA "subset" construction. Very pretty, impossible to get wrong <wink>, but generally (in my experience) slower than the usual minimization algorithms. A note on completeness: making an FA explicitly complete-- even if it's only for intermediate constructions --becomes extremely unattractive in a Unicode world. I didn't find any literature on this, so (probably re-) invented a notion of "NOT-set" transitions: in my "overly general" FSM classes, each NFA state has a symbol set N associated with it, and a target NOT-set T of states, such that any symbol *not* in N transitions to the states in T. The details proved delicate in some cases, but it all worked out quite prettily. Unfortunately, it complicates the runtime matcher too -- it's really an extension of the DFA concept. In partial compensation, the alphabet doesn't need to be specified at DFA construction time (indeed, can be specified at runtime -- or even left implicit forever!). Overall, it probably adds more complications than it's worth. computers-were-built-for-english-speakers-anyway<wink>-ly y'rs - tim
- Previous message (by thread): Plex 0.1 - A Lexical Analysis Module
- Next message (by thread): small python implementation or python subset ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list