ANN: Martel-0.2
Andrew Dalke
dalke at acm.org
Thu Aug 24 20:40:18 EDT 2000
More information about the Python-list mailing list
Thu Aug 24 20:40:18 EDT 2000
- Previous message (by thread): SMS & Python
- Next message (by thread): Redefining class methods
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
[Fredrik Lundh, Greg Ewing and Marc-Andre Lemburg may be interested in this post because of what I've done with their ideas and code.] Hello, This is my first announcement on this newsgroup for a project I've been working on called Martel. It's a parser generator for (nearly) regular language. Detailed information can be found in me recent conference poster under http://www.biopython.org/~dalke/Martel/ It is designed to handle many of the formats I need to parse -- database records and program output -- which are stateful but not very complex. The are stateful, meaning if the parsing was split up between a lexer and a parser, then there would be a lot of communication between them so the lexer can tokenize correctly. They aren't complex meaning they don't have balanced parens or other types of data structures with indefinite depth. Briefly, it uses a modified subset of Python's re expression syntax as the format description, and uses it to build the parser. The parser takes an input string and makes and expression tree for it. The tree is passed back to the caller using the SAX events from XML parsing, where the (?P<named>groups) are used to define the startElement() and endElement() names, and the leaves of the tree become the characters(). I need a briefer description than this, but haven't figured one out which is still understandable. I'm not even sure that's understandable :) Technically, the regular expression parsing is done with a modified version of Fredrik Lundh's sre_parse from 1.6a3(or 2?). The modifications allow access to all groups with the same name (instead of just the last one) and allow group name identifiers to have the same syntax as XML tag names. I also added support a new language syntax which I call "named group repeats" where '{}'s allows a string name inside, which is used as the repeat count. For example: r"Num atoms = (?P<num_atoms>\d+)\n((?P<atom_name>\w+)\n){num_atoms}" Building up regular expression strings as strings is error-prone, so I convert the regular expression output into an Expression tree. Expressions can be combined and otherwise manipulated using many of the same functions as Greg Ewing's Plex. The parser is built by making a tag table for Marc-Andre Lemburg's mxTextTools, which does the actual parsing. I did have to add some hacks for things like lookahead assertions and named group references. (The last, for example, doesn't support multiple threads.) The resulting system is very cool, if I say so myself :) Hmm, looks like I need an example. If you know the SWISS-PROT format, then the examples on my aforementioned poster should be helpful. I guess I should come up with one which is a little less domain specific. Umm, how about the following untested code: from Martel import * def word(name): return Group(name, Re("[^:\n]*")) format = Rep(word("name") + Str(":") + \ Alt( Group("no_passwd", Re("(?!\w{13})[^:]")), word("passwd") ) + Str(":") + \ word("uid") + Str(":") + \ word("gid") + Str(":") + \ word("homedir") + Str(":") + \ word("shell") + Str("\n") ) Making the parser and tying in the handler(s) then parsing the string root:gEMPivloT9av8:0:0:System Account:/root:/bin/sh idle:x:10:66:Eric Idle (disabled):/home/idle:/bin/noshell gives calls like: startElement("name") characters("root") endElement("name") characters(":") startElement("passwd") characters("gEMPivloT9av8") endElement("passwd") ... characters("\n") startElement("name") characters("idle") endElement("name") characters(":") startElement("no_passwd") characters("x") endElement("no_passwd") ... But that's not very interesting, since it doesn't show that I can parse optional fields, and even do format and version detection. Oh well, you should learn more about bioinformatics anyway! Andrew dalke at acm.org
- Previous message (by thread): SMS & Python
- Next message (by thread): Redefining class methods
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list