update

The implementation of breakpoint download has been updated, with some corrections and improvements based on comments.

preface

Recently, I passed the golden three and got the offer of summer intern. The leader of the internship department assigned me pre-employment learning tasks to strengthen the knowledge of multi-threading and database, and suggested me to implement a downloader similar to those in their products.

Implementation approach

This article focuses on the implementation of the download section. Currently I am also doing single task download development and optimization. If there are good ideas after the completion of the subsequent update, I will share with you. The project address is: github.com/SirLYC/Yuch… (Under development)

Breakpoint downloading

First of all, the downloader has a breakpoint continuation function, breakpoint continuation implementation is the basic knowledge of the HTTP protocol Range header. For example, if a file has 500bytes and I want to download from the 200th bytes, I add a key Range entry to the header of the request that says bytes=200-. Therefore, at implementation time, we need to record the current download, and when resuming the download, we can start from the last current download, saving user traffic.

But not all servers support breakpoint downloading. Therefore, you can make a request before the official download, add the Range field to the request, and also get the file length (ContentLength header) in this way.

In the comments section, I was asked how to get the length of the file from Range. The Range field value of the request I sent is bytes=0-, requesting the file from the 0th byte. Therefore, if the request returns normally and has a contentLength header, it must be the total length of the file. Bytes =0- Requests the entire file, but 206 partial Content is returned on servers that support resumable (I have tested several links and this is true).

I can’t say for sure, but by protocol, when you have a Range field, the server, what the server is supposed to do is check that Range is reasonable, as long as it’s reasonable and supports 206.

But what if the server returns 416 for unsupported? At this point we can’t get a file length that doesn’t support breakpoints, so my previous code implementation might have problems. There’s actually a field called if-range that will return 206 If the server supports breakpoints, 200 If it doesn’t, and that will fix the problem.

For bytes=0- servers that might support breakpoints would look to return 200, I also thought of another way: ask for a byte. Use if-range =0-0 to request, support breakpoints return content-range to get the total length of the file. In this way, it makes no difference whether a breakpoint is used when a file is downloaded with only 1 byte, even if it returns either 206 or 200 (all).

What if the server doesn’t support it? I think, first of all, for the product, first of all to adapt to most of the situation, and the download example, most of the situation is the network protocol, we think the server will implement according to the protocol requirements, which is why I directly use the previous method to check whether support breakpoint continuation. In a real production environment, if you encounter some non-compliant servers, you have to do special processing, but is this special processing really necessary? That’s a matter of opinion.

Here is a brief description of the principle of downloading files. On a GET request, the first thing the server will do is return you the entire header, and if you download a file, which is typically a stream, there’s a flag that tells you that responseBody is a stream. HTTP is based on TCP, and this stream is actually a TCP stream, which in Java corresponds to an InputStream. A stream can be thought of as a backwards-only pointer to the next byte to be read, and one byte to be read before the next byte can be read. Therefore, if the pause recovery does not use partial requests, you have to accept all the previously downloaded bytes, which obviously wastes time and traffic.

Multithreaded download

The first thing to know is that multithreading is based on breakpoint downloading. A file is essentially binary data, broken up into segments that each thread downloads. Therefore, each thread needs to control the start and end of the file when requested, and each thread is assigned a downloaded segment. Therefore, a server that does not support breakpoint continuation cannot be downloaded using multiple threads.

So why does multithreaded downloading speed up? The first obvious point is that multithreading can take advantage of the CPU’s multi-core characteristics to complete more tasks in the same time. But in fact this doesn’t increase the speed very much because the total bandwidth at the receiving end is fixed. Imagine this scenario:

Is the more threads the better? Obviously this is not true. Thread itself is a very heavy object, thread creation, multithreading scheduling management will occupy CPU time, will reduce the proportion of user time. Another problem is that multithreading takes up memory. Therefore, there is a limit to the number of download threads that can be started.

The download is separate from the writer thread

In the past, when writing a downloader, the common download mode was

/ / pseudo codewhile (data remains to read) {
    buffer = inputstream.read(bufferSize)
    outputstream.write(buffer)
}
Copy the code

In the case of multiple threads it looks something like this

At the time of the on-site interview, I also said that the download can be so realized, and the result of the interview came up to ask, read and photo should be placed in the same thread? For now, writing to a disk is generally much faster than network access. If we can put the write data in a single thread, suppose that three threads at the same speed read the network byte stream of the same size into the buffer, each thread sends its own buffer into the writer thread, and then separately reads the network data. Because we are writing faster than the network download speed, we can finish writing before the next three buffers are sent, which ideally saves a disk write.

But there are a lot of caveats when it comes to actual implementation. First of all, the download thread cannot download without limit. If the writer thread is blocked and the download thread is still downloading, the buffer will get bigger and bigger, causing OOM. Another is the exchange of buffers, the writer thread needs to take, the download thread needs to send, this is a typical consumer-producer pattern. This aspect of the implementation of the article is more, I finally choose BlockQueue to achieve. The general process is as follows:

There are many other aspects of the above process that do not cover everything, such as error handling, state transitions, etc. In fact, it is not easy to write a good user experience, good performance of the download.

subsequent

At present, my project only single-task multithreaded download, multi-task, download information local saving has not been realized.

In addition to this, I would also consider adding a multi-process architecture that can be downloaded offline after the UI exits. Clone running sample or give some comments are welcome!

Re-hang the project address: github.com/SirLYC/Yuch…