Unicode and rdf
Richard West
rwest004 at opti.cgi.net
Wed Mar 10 00:45:30 EST 2004
More information about the Python-list mailing list
Wed Mar 10 00:45:30 EST 2004
- Previous message (by thread): Unicode and rdf
- Next message (by thread): Unicode and rdf
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Almost forgot. I'm running Python 2.3.3. On Tue, 09 Mar 2004 23:41:30 -0600, Richard West <rwest004 at opti.cgi.net> wrote: > > >I'm trying to parse the rdf dumps from dmoz.org (Open Directory >Project) and am having great difficulty just getting Python to read >the files. The files are RDF in UTF-8 encoding according to the >dmoz.org web site, but I get the following error: > >UnicodeDecodeError: 'utf8' codec can't decode bytes in position >52376-52378: invalid data > >Here's a sample of code that will reproduce the problem: > > >import sys >import codecs >from xml.sax import make_parser, handler > >def main(): > f = codecs.open(sys.argv[1], 'r', 'utf-8') > parser = make_parser() > parser.setContentHandler(dmoz()) > parser.parse(f) > >class dmoz(handler.ContentHandler): > def startElement(self, name, attrs): > print('%s' % name) > >if(__name__=='__main__'): > main() > > >I'm working with the dump from February 23rd, 2004. On the dmoz.org >web site news pertaining to the rdf dumps, there is an entry from >March 3rd, 2003 which states that they are filtering the data to >"prevent UTF-8 and XML character encoding problems". So I am assuming >that the UTF-8 files I have are valid. I run into the problem with >both the structure.rdf.u8 file and the content.rdf.u8 file. > >What am I doing wrong? > > >-Richard > > >dmoz.org rdf dumps: http://rdf.dmoz.org/ > >dmoz.org rdf news: http://rdf.dmoz.org/rdf/Changes.html > >
- Previous message (by thread): Unicode and rdf
- Next message (by thread): Unicode and rdf
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list