Skip to content Skip to sidebar Skip to footer

Checking Duplicate Files Against A Dictionary Of Filesizes And Names

This is pretty simple code - i've just completed Charles Severances Python for Informatics course, so if possible pls help me to keep it simple. I'm trying to find duplicate docume

Solution 1:

Your biggest mistake is that you're taking the hash of the filenames instead of the file content.

I have corrected that and also cleaned up the rest of the code:

import os
import hashlib


location = '/Users/jeff/desktop/typflashdrive'
doc_count = 0
dup_doc_count = 0

hash_vs_file = {}

for (dirname, dirs, files) in os.walk(location):
    for filename in files:
        file_path = os.path.join(dirname, filename)
        file_hash = hashlib.md5(open(file_path).read()).hexdigest()
        if filename.endswith('.doc'):
            doc_count = doc_count + 1
            if file_hash not in hash_vs_file:
                hash_vs_file[file_hash] = [file_path]
            else:
                dup_doc_count += 1
                hash_vs_file[file_hash].append(file_path)

print 'doc_count = ', doc_count
print 'dup_doc_count = ', dup_doc_count

for file_hash in hash_vs_file:
    print file_hash
    for file_path in hash_vs_file[file_hash]:
        print file_path
    print "\n\n\n"

Post a Comment for "Checking Duplicate Files Against A Dictionary Of Filesizes And Names"