shuffle the lines of a large file
Nick Craig-Wood
nick at craig-wood.com
Tue Mar 8 06:30:02 EST 2005
More information about the Python-list mailing list
Tue Mar 8 06:30:02 EST 2005
- Previous message (by thread): shuffle the lines of a large file
- Next message (by thread): shuffle the lines of a large file
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Raymond Hettinger <vze4rx4y at verizon.net> wrote: > >>> from random import random > >>> out = open('corpus.decorated', 'w') > >>> for line in open('corpus.uniq'): > print >> out, '%.14f %s' % (random(), line), > > >>> out.close() > > sort corpus.decorated | cut -c 18- > corpus.randomized Very good solution! Sort is truly excellent at very large datasets. If you give it a file bigger than memory then it divides it up into temporary files of memory size, sorts each one, then merges all the temporary files back together. You tune the memory sort uses for in memory sorts with --buffer-size. Its pretty good at auto tuning though. You may want to set --temporary-directory also to save filling up your /tmp. In a previous job I did a lot of stuff with usenet news and was forever blowing up the server with scripts which used too much memory. sort was always the solution! -- Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick
- Previous message (by thread): shuffle the lines of a large file
- Next message (by thread): shuffle the lines of a large file
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list