The background,

In the two folders, there may be files with the same name or with different names. Files with the same name may also have different contents. Therefore, you need to compare them and output the results to a Json file.

Second, the train of thought

Get the two directories first, and then walk through all the files in the directories respectively, add the file name to the file list, concatenate the string into a new file path, and add the path list.

Then open the files separately, compare the files by calculating the hash value of the files, and save the results in the dictionary first. Because the number of files in the two directories may be different, you need to consider that if there are more files in one directory than in the other, they must be different and need to be stored in a separate dictionary.

Finally, output the contents of the dictionary to a Json file.

Three, code,

import hashlib
import os
import json

dic = {}

def getHash(f) :
    """
    获取文件哈希值
    f:文件
    """
    line = f.readline()
    hash = hashlib.md5()
    while(line):
        hash.update(line)
        line = f.readline()
    return hash.hexdigest()


def IsHashEqual(f1,f2) :
    """ compare hash values f1: first file F2: second file """
    str1 = getHash(f1)
    str2 = getHash(f2)
    return str1 == str2


def CountFiles(path1, path2) :
    """ comparison file """
    path_1, path_2 = [], []
    file_dir1, file_dir2 = [], []
    # Obtain all files in path1
    for file in os.listdir(path1):
        file_dir1.append(file)
        # print(file)
        tmp_path1 = path1 + str(file)
        # print(path1)
        path_1.append(tmp_path1)
    
    # Obtain all files in path2
    for file in os.listdir(path2):
        file_dir2.append(file)
        # print(file)
        tmp_path2 = path2 + str(file)
        # print(path2)
        path_2.append(tmp_path2)
    
    len1, len2 = len(path_1), len(path_2)
    
    for i in range(min(len1, len2)):
        file1 = open(path_1[i], "rb")
        file2 = open(path_2[i], "rb")
        res = IsHashEqual(file1, file2)
        dic[file_dir1[i]] = res
        # print(dic)
    
    The number of files in the two paths is different
    if len1 < len2:
        for i in range(len1, len2):
            dic[file_dir2[i]] = False
    elif len1 > len2:
        for i in range(len2, len1):
            dic[file_dir1[i]] = False
            
    # Write json file
    js = json.dumps(dic)
    with open('test_data.json'.'w') as json_file:
        json_file.write(js)


if __name__ == '__main__':
    # f1 = open("D:/Code/Python/test1/0.py","rb")
    # f2 = open("D:/Code/Python/test2/0.py","rb")
    # print(IsHashEqual(f1,f2))
    path1 = "D:/Code/Python/test1/"
    path2 = "D:/Code/Python/test2/"
    CountFiles(path1, path2)
    print(dic)
Copy the code

Four, the results

The contents of the test1 and test2 folders are as follows:

The contents of the 0.py file are the same, the contents of the 1.py file are different, and the 2.txt file is only a single one.

Before running this program, you need to create a test_data.json file in the directory where the program is located, and then run the program, open the JSON file, you can view the following content:

{"0.py": true, "1.py": false, "2.txt": false}
Copy the code

5. Areas for Improvement

  • If you only consider the file directory structure, if there is a secondary directory below the directory, how can you continue to determine whether the contents of the secondary directory are the same?
  • Can the two directories be updated simultaneously to ensure consistency?

Six, reference

Python’s way of determining whether two files are the same and of filtering the same items for two texts

Method steps for converting Python objects to JSON

[Python] The dictionary content is written to a JSON file