Regular expression - Search repeated files across multiple directory - catalog, Content, cross, expression, file, lookup, multiple, press, regular, repetition

I downloaded some files related to a specific topic from the Internet. Now I want to check if there are any duplicates in the files. The problem is that the names of the files will be different, but the content may match .

Is there a way to implement some code that will traverse multiple folders and notify which files are duplicates?

If you are using linux/*nix system, you can use sha tools like sha512sum, Because md5 can be destroyed.

find /path -type f -print0 | xargs -0 sha512sum | awk'($1 in seen){print "duplicate: " $2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'

If you want to use Python, a simple implementation

import hashlib,os
def sha(filename): 
''' function to get sha of file'''
 d = hashlib.sha512()
 try:
 d.update(open(filename).read())
 except Exception,e:
 print e
 else:
 return d.hexdigest ()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):< br /> for files in f:
 filename=os.path.join(r,files)
 digest=sha(filename)
 if not s.has_key(digest):
 s[digest]=filename
 else:
 print "Duplicates: %s <==> %s "%( filename, s[digest])

If you think sha512sum Not enough, you can Using unix tools like diff or filecmp (Python)

I downloaded some files related to a specific topic from the internet. Now I want to check if there are any duplicates in the files The problem is that the name of the file will be different, but the content may match.

Is there a way to implement some code that will traverse multiple folders and notify which files are duplicates?

If you are using a linux/*nix system, you can use sha tools like sha512sum, because md5 can be destroyed.

find /path -type f -print0 | xargs -0 sha512sum | awk'($1 in seen){print "duplicate: "$2" and "seen[$1] }(!( $1 in seen)){seen[$1]=$2}'

If you want to use Python, a simple implementation

import hashlib,os< br />def sha(filename): 
''' function to get sha of file'''
 d = hashlib.sha512()
 try:
 d.update( open(filename).read())
 except Exception,e:
 print e
 else:
 return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
 for files in f:
 filename=os.path.join(r,files)
 digest=sha(filename)
 if not s.has_key(digest):
 s[digest]=filename
 else :
 print "Duplicates: %s <==> %s "%( filename, s[digest])

If you think sha512sum is not enough, you can use diff or filecmp (Python) Unix tools

Regular expression – Search repeated files across multiple directory

Leave a Comment Cancel reply