Checking Duplicate Files Against A Dictionary Of Filesizes And Names
This is pretty simple code - i've just completed Charles Severances Python for Informatics course, so if possible pls help me to keep it simple. I'm trying to find duplicate docume
Solution 1:
Your biggest mistake is that you're taking the hash of the filenames instead of the file content.
I have corrected that and also cleaned up the rest of the code:
import os
import hashlib
location = '/Users/jeff/desktop/typflashdrive'
doc_count = 0
dup_doc_count = 0
hash_vs_file = {}
for (dirname, dirs, files) in os.walk(location):
for filename in files:
file_path = os.path.join(dirname, filename)
file_hash = hashlib.md5(open(file_path).read()).hexdigest()
if filename.endswith('.doc'):
doc_count = doc_count + 1
if file_hash not in hash_vs_file:
hash_vs_file[file_hash] = [file_path]
else:
dup_doc_count += 1
hash_vs_file[file_hash].append(file_path)
print 'doc_count = ', doc_count
print 'dup_doc_count = ', dup_doc_count
for file_hash in hash_vs_file:
print file_hash
for file_path in hash_vs_file[file_hash]:
print file_path
print "\n\n\n"
Post a Comment for "Checking Duplicate Files Against A Dictionary Of Filesizes And Names"