Q: how to extract only text from a html ?
D-Man
dsh8290 at rit.edu
Thu Nov 2 20:39:24 EST 2000
More information about the Python-list mailing list
Thu Nov 2 20:39:24 EST 2000
- Previous message (by thread): Q: how to extract only text from a html ?
- Next message (by thread): Q: how to extract only text from a html ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 02 Nov 2000 06:18:21 Alex Martelli wrote: | | Like for the Turing-completeness of C++ templates, I | think much of the dark fascination of RE's comes from | the fact that it's hard to find something you _cannot_ | do with them...:-). | It's not that hard, try to match parenthesis with unlimited nesting. Ok, maybe that's a little too difficult for you, how about parens with only 1 level of nesting. Ex: ((3 + 2) - (1 + 0)) For the HTML stripping, the following RE (adapted from matching C comments) may do the job. No guarantees though and I haven't tested it ;-) <[^>]+> The following sed/vi/perl command will replace all tags (read: text matched by the regex) with whitespace: s/<[^>]+>//g The python code is (I think): str = file.read() re.sub( "<[^>]+>", str, "" ) # I probably put the args in the wrong order On second thought here, suppose you have some text in your html file like this: <html><body> Some examples of tautologies are 3 < 5 ; 5 > 3 </body></html> the text "<5 ; 5>" will be matched as a tag. As a side comment, is <> a legal tag? This will not be matched by my re. -D
- Previous message (by thread): Q: how to extract only text from a html ?
- Next message (by thread): Q: how to extract only text from a html ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list