lisp is winner in DOM parsing contest! 8-]
Paul Rubin
http
Mon Jul 12 14:44:03 EDT 2004
More information about the Python-list mailing list
Mon Jul 12 14:44:03 EDT 2004
- Previous message (by thread): lisp is winner in DOM parsing contest! 8-]
- Next message (by thread): lisp is winner in DOM parsing contest! 8-]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
tfb+google at tfeb.org (Tim Bradshaw) writes: > I had a theory of doing this all lazily, so you wouldn't have to do > the (slow) parsing step up-front but would just lie and say `OK, I > parsed it', then actually doing the work only on demand. 1) I think you do need to do the parsing up front, since you have to provide basically random access to the contents of the tree. 2) The parsing needn't be slow. It should just make a very fast linear scan over the document, emitting an in-memory record every time it sees a tag that makes a new tree node. It would remember on a stack where it is in the tree structure as it scans, and whenever it found a tag that closes off a subtree (e.g. </table>), it could emit pointers to where that subtree's parent and previously seen sibling are, and update the parent to know where the last-seen child is. Finally, it could make another pass over the in-memory structure to vectorize the list of children at each node. This could be done with even less memory by making two passes over the document (one to count the children for each node and one to build the tree with a vector of children at each node, allocated to the now-known size), at the expense of some speed. The point is, not much processing needs to be done during the scan. It should certainly be possible to parse a 3 MB document in under a second (I assume we're talking about a C extension). It just baffles me why the libraries that are out there are so much slower.
- Previous message (by thread): lisp is winner in DOM parsing contest! 8-]
- Next message (by thread): lisp is winner in DOM parsing contest! 8-]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list