comparing multiple copies of terrabytes of data?
Eddie Corns
eddie at holyrood.ed.ac.uk
Tue Oct 26 07:26:04 EDT 2004
More information about the Python-list mailing list
Tue Oct 26 07:26:04 EDT 2004
- Previous message (by thread): comparing multiple copies of terrabytes of data?
- Next message (by thread): comparing multiple copies of terrabytes of data?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dan Stromberg <strombrg at dcs.nac.uci.edu> writes: >We will soon have 3 copies, for testing purposes, of what should be about >4.5 terrabytes of data. >Rather than cmp'ing twice, to verify data integrity, I was thinking we >could speed up the comparison a bit, by using a python script that does 3 >reads, instead of 4 reads, per disk block - with a sufficiently large >blocksize, of course. >My question then is, does python have a high-level API that would >facilitate this sort of thing, or should I just code something up based on >open and read? >Thanks! Taking a checksum of each file and comparing them would probably be much faster. A quick test with md5 versus cmp gave me a 10 times speedup. Though, unexpectedly, running 2 md5 processes in parallel was slower than 2 in sequence - could be the cheap'n'nasty HD in my desktop, normally I'd expect a gain here as 1 process CPUs while the other is IOing. Eddie
- Previous message (by thread): comparing multiple copies of terrabytes of data?
- Next message (by thread): comparing multiple copies of terrabytes of data?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list