comparing multiple copies of terrabytes of data?
Josiah Carlson
jcarlson at uci.edu
Mon Oct 25 16:01:20 EDT 2004
More information about the Python-list mailing list
Mon Oct 25 16:01:20 EDT 2004
- Previous message (by thread): comparing multiple copies of terrabytes of data?
- Next message (by thread): comparing multiple copies of terrabytes of data?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Istvan Albert <ialbert at mailblocks.com> wrote: > > Dan Stromberg wrote: > > > Rather than cmp'ing twice, to verify data integrity, I was thinking we > > could speed up the comparison a bit, by using a python script that does 3 > > Use the cmp. So what if you must run it twice ... by the way I > really doubt that you could speed up the process in python > ... you'll probably end up with a much slower version In this case you would be wrong. Comparing data on a processor is trivial (and would be done using Python's C internals anyways if a strict string equality is all that matters), but IO is expensive. Reading Terabytes of data is going to be the bottleneck, so reducing IO is /the/ optimization that can and should be done. The code to do so is simple: def compare_3(fn1, fn2, fn3): f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)] b = 2**20 #tune this as necessary p = -1 good = 1 while f1.tell() < p: p = f1.tell() if f1.read(b) == f2.read(b) == f3.read(b): continue print "files differ" good = 0 break if good and f1.read(1) == f2.read(1) == f3.read(1) == '': print "files are identical" f1.close() #I prefer to explicitly close my file handles f2.close() f3.close() Note that it /may/ be faster to first convert the data into arrays (module array) to get 2, 4 or 8 byte block comparisons. - Josiah
- Previous message (by thread): comparing multiple copies of terrabytes of data?
- Next message (by thread): comparing multiple copies of terrabytes of data?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list