preface

I didn’t use file incremental synchronization before, so I just recorded it.

scenario

There is A very similar file between two different ends of A and B, and the file is relatively large. If updates are done via full transport, HTTP transport is very heavy and unfriendly. There is some way to upload only the modified content and reuse the rest.

Duplicate block detection technology

  • Fixed block detection technology: in fixed block detection, if a certain area changes, the block will not be able to detect the hit block, low hit rate and great limitations.
  • Variable block detection techniques: Variable blocks are good for solving the above problems, for exampleCDCAlgorithm. But there doesn’t seem to be a universal solution. I can’t find one.
  • Sliding block detection technology: using fixed blocks, sliding detection whether hit, high hit rate, but the lower the similarity of the file, the more time-consuming, for example:rsyncThe incremental transfer algorithm is also the goal of this article

Rsync incremental transmission algorithm

For example, file A and FILE B are synchronized. File A is divided into blocks first, and the summary of each piece is calculated (weak summary & strong summary), and the summary data is transmitted to terminal B. B starts with the first byte and slides down in blocks of the same size to compare one by one.

First, the weak digests of the current block are calculated. If the weak digests hit, whether the strong digests are the same is calculated. If they are all the same, the region is the same block; The area between two identical blocks is a difference block.

  • Weak abstract adoptionAdler-32, the generation speed is fast, but there may be duplication.
  • Strong abstract adoptionmd5, generation is slow, but guaranteed.

(Photo from Internet)

So the above content, in fact, for both ends of A and B, there is computation. In fact, it is not friendly. In single-A versus multi-B or single-B versus multi-A situations, it is too stressful.

At the same time, the size of the slider also greatly affects the results: if the slider is too small, the hit ratio can be improved, but the calculation amount increases; If the slider is too large, the computation is reduced, but the hit ratio may be reduced.

To optimize the

The actual application scenarios can be roughly divided into two situations:

  • The client has a new file, which needs to be incrementally synchronized to the server.
  • A new file exists on the server and needs to be incrementally synchronized to other ends.

In either case, we want to put the computation on the client, and the server just does the merge or transfer.

Then we can compute the summary locally and send it to the server together when the other end first sends the new file to the server. In subsequent updates, the new digest is also sent to the server for update. The server can then cache the summary content, saving the calculation process.

In some cases, you can cache the difference between the old and new files, and check whether the server saves the difference based on the version number. If yes, you can directly transfer the update.

Then for the first case: the client first initiates a request to the server to obtain the summary, and after the server returns the summary, the client local slider detects the summary, and then sends the detection result and the difference content file to the server, and the server does the merge operation.

In the second case, the client still sends a summary request to the server, and when the server returns the summary, the client slider detects the summary. If the same block content is obtained from the local file, the different block content file is requested through the HTTP range. In fact, you can request the difference file directly, if the server has saved it.

At the same time, the threshold value needs to be set. In the block sliding detection, if the hit ratio is too low, the detection process can be omitted and the full upload file synchronization operation can be directly adopted. For files with low similarity, sliding detection takes time and results are unsatisfactory.

The actual use

I also wrote a demo on Github with a link at the end of the article.

The following uses incremental synchronization of new files from the browser to the server as an example.

We can create a Webwork and put the process of calculating the summary and the process of slider detection on the Webwork.

Because the summary result needs three data, weak summary, strong summary and corresponding serial number. I started thinking of arrays, where objects have two kinds of digests, with subscripts as ordinals. However, this search efficiency is too low, each search needs to go through again, low efficiency.

Later, I changed my mind and used the following data structure:

The weak summary is the key value, the strong summary is the value value, and the corresponding content sequence number of the original file is placed at the end of the value with @index. At the same time, because the weak summary may be repeated, so in the calculation, detect whether the weak summary exists, if it does, add @1 at the end, if it still exists, continue @2 and so on, until it does not exist, can be put into the object.

The goal is to reduce the search complexity from O(n) to O(1).

At the same time, for files with high similarity, the slider detection will inevitably appear continuous same blocks. If the contiguous region is a contiguous sequence number on the original file, it can be combined to reduce the size of the transmitted data:

Null is a differential block file with subscript as one of the identifiers of the file.

Num is the num block of the original file partition result.

The array is the start to end region of the result of the original file partition.

Then, for the difference block file, on the normal upload operation, currently http1.1 as the mainstream, or use 4~5 4~5 upload at the same time is better.

The complete code