Link Details

DZone voters are special people. Be special. Login and vote now.
Link 18797 thumbnail
User 201926 avatar

By adam.forster
via peterbe.com
Published: Apr 15 2007 / 09:17

To some Python users this is old-school old-news stuff but since I've never used it before I found it worth mentioning. I have a script that scans a rather large tree of folders filled with files. Sometimes two different folders contain the same file names exactly. Sometimes, the file sizes as equal too. But in some of those cases, even though the file sizes and names are the same they are different files. But! If they are the same files just in different locations I want to find them. How to do that?
  • 10
  • 1
  • 1068
  • 193

Comments

Add your comment
User 135274 avatar

kupolov replied ago:

1 votes Vote down Vote up Reply

Size/Name and MD5 do NOT guarantee equality. Moreover CRC32 and Adler-32 are faster to calculate. CRC, SHA or MD5 are just hash-functions and used only to select candidates in a minimal amount of time. Byte-to-byte comparison is still required to check for equality.

User 201926 avatar

Adam Forster replied ago:

0 votes Vote down Vote up Reply

kupolov, you are correct a byte-to-byte comparison is required. If you had read the post correctly you would have realised that the md5 is done on the files actual contents, so there is a byte-to-byte comparison; although I do agree it is not the fastest method.

User 135274 avatar

kupolov replied ago:

1 votes Vote down Vote up Reply

May be I've missed something, but the article states:
md5.new(f1.read()).digest() == md5.new(f2.read()).digest()
This equation does not performs a byte-to-byte comparison, even if it the whole files are read. md5(f1)=md5(f2) does not mean f1=f2.

Add your comment


Html tags not supported. Reply is editable for 5 minutes. Use [code lang="java|ruby|sql|css|xml"][/code] to post code snippets.

Voters For This Link (10)



Voters Against This Link (1)