Regular expression – Search repeated files across multiple directory

I downloaded some files related to a specific topic from the Internet. Now I want to check if there are any duplicates in the files. The problem is that the names of the files will be different, but the content may match .

Is there a way to implement some code that will traverse multiple folders and notify which files are duplicates?

If you are using linux/*nix system, you can use sha tools like sha512sum, Because md5 can be destroyed.

find /path -type f -print0 | xargs -0 sha512sum | awk'($1 in seen){print "duplicate: " $2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'

If you want to use Python, a simple implementation

import hashlib,os
def sha(filename):
''' function to get sha of file'''
d = hashlib.sha512()
try:
d.update(open(filename).read())
except Exception,e:
print e
else:
return d.hexdigest ()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):< br /> for files in f:
filename=os.path.join(r,files)
digest=sha(filename)
if not s.has_key(digest):
s[digest]=filename
else:
print "Duplicates: %s <==> %s "%( filename, s[digest])

If you think sha512sum Not enough, you can Using unix tools like diff or filecmp (Python)

I downloaded some files related to a specific topic from the internet. Now I want to check if there are any duplicates in the files The problem is that the name of the file will be different, but the content may match.

Is there a way to implement some code that will traverse multiple folders and notify which files are duplicates?

If you are using a linux/*nix system, you can use sha tools like sha512sum, because md5 can be destroyed.

find /path -type f -print0 | xargs -0 sha512sum | awk'($1 in seen){print "duplicate: "$2" and "seen[$1] }(!( $1 in seen)){seen[$1]=$2}'

If you want to use Python, a simple implementation

import hashlib,os< br />def sha(filename): 
''' function to get sha of file'''
d = hashlib.sha512()
try:
d.update( open(filename).read())
except Exception,e:
print e
else:
return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
for files in f:
filename=os.path.join(r,files)
digest=sha(filename)
if not s.has_key(digest):
s[digest]=filename
else :
print "Duplicates: %s <==> %s "%( filename, s[digest])

If you think sha512sum is not enough, you can use diff or filecmp (Python) Unix tools

Leave a Comment

Your email address will not be published.