DZone voters are special people. Be special. Login and vote now.
By adam.forster
via peterbe.com
Published: Apr 15 2007 / 09:17
To some Python users this is old-school old-news stuff but since I've never used it before I found it worth mentioning.
I have a script that scans a rather large tree of folders filled with files. Sometimes two different folders contain the same file names exactly. Sometimes, the file sizes as equal too. But in some of those cases, even though the file sizes and names are the same they are different files. But! If they are the same files just in different locations I want to find them. How to do that?
Comments
kupolov replied ago:
Size/Name and MD5 do NOT guarantee equality. Moreover CRC32 and Adler-32 are faster to calculate. CRC, SHA or MD5 are just hash-functions and used only to select candidates in a minimal amount of time. Byte-to-byte comparison is still required to check for equality.
Adam Forster replied ago:
kupolov, you are correct a byte-to-byte comparison is required. If you had read the post correctly you would have realised that the md5 is done on the files actual contents, so there is a byte-to-byte comparison; although I do agree it is not the fastest method.
kupolov replied ago:
May be I've missed something, but the article states:
md5.new(f1.read()).digest() == md5.new(f2.read()).digest()
This equation does not performs a byte-to-byte comparison, even if it the whole files are read. md5(f1)=md5(f2) does not mean f1=f2.
Voters For This Link (10)
Voters Against This Link (1)