background

Recently at home to download a relatively large image file, because the network is too bad, every time is to download to the middle of the stop, the file download failure. I used chrome’s own download feature and tried this 4-5 times and failed each time. Unfortunately, sometimes you download 80% to 90% of the time and have to start all over again if you fail.

Then I thought of the wget command line, searched the Internet and found that Wget comes with breakpoint download. That is, if the intermediate connection is disconnected, you can just continue from where you downloaded it. Then, the task is done directly using wGET.

If you already have a download section of the file, you can also use the -c command of wget to continue the download. See nixCraft’s article Wget: Resume Broken Download

Thinking about

I should be relieved when this is over. The files are downloaded. What else do you have to worry about? But I always had a knot in my heart: isn’t HTTP stateless? How does a resending request tell the server that I only want a certain piece of data? How does the application concatenate different data?

Then I continued my search and found the secret to HTTP breakpoint continuation: the Range header field.

explain

Breakpoint download, a popular point, is that the client tells the server: I have downloaded the first length of the content is N, please from N +1 data to me, do not start from 0 to re-upload.

In HTTP 1.1, there is a corresponding entity header that does this: the Range. Range specifies the Range of entities to request and, like most program specifications, counts from zero. For example, if you have already downloaded 500 bytes of content, to request more than 500 bytes, add Range: bytes=500- to the HTTP request header.

Range there are several different ways to define a Range:

  • 500-900.: Specifies the length from start to end. Remember that Range counts from 0, so this will require the server to start with 501 bytes and end with 901 bytes. This is usually used to transfer very large file fragments, such as videos.
  • 500 -: Starts after 501 bytes and continues to the end. This is more suitable for breakpoint continuation, the client has downloaded 500 bytes of content, please pass the following
  • - 500.If the range does not specify the start position, the server will pass the bottom 500 bytes. Instead of 500 bytes starting at 0.
  • 500-900, 1000-2000: It is also possible to specify more than one scope at a time, which is not very common

The client makes such a request and the server responds and supports it. As mentioned earlier, this entity header was added in HTTP 1.1, so some HTTP servers using older versions may not support it.

The response should also contain a Content-range field if the server supports it. For example, if the client uses Range: bytes=1024-, the header returned by the server contains content similar to content-range: bytes 1024-439714/439715. This means, I know, I will only pass in the 1024-439714 range, and the entire file size is 439715 bytes. In addition, the status code must be 206, which can be explained in the w3 website. . That would be all well and good, but there are two possible anomalies:

Does not support

The response header returned by the server does not contain the Content-range field, indicating that the server ignored the range field in the request. Probably because the server does not support this feature, you have no choice but to download it from scratch. Note: Most HTTP currently supports this feature, so this is rare

Yes, yes, but give me the beginning

The response header returned by the server contains the Content-range, but is redownloaded from 0. This is where the server implementation is stricter, allowing for the possibility that the file could be modified.

What do you mean? Consider this situation: yesterday I downloaded a file, downloaded half of it, and today I continue using wget -C. But here’s the thing: at some point in the middle, the file gets updated! The URL hasn’t changed, but the contents of the file have. If I scroll down and add the previous content to the new content, it probably won’t be a normal file and I can’t open it. This is even worse than downloading from scratch: it doesn’t work after all that hard work.

If the server wants to download files, or other resources are mutable. There has to be a way to identify the uniqueness of a file or resource, which version I downloaded before and whether it is the same as the current version. There are two ways to do this in HTTP, and of course the two header fields, ETag and last-modified, are defined in RFC2616.

  • ETag: You can think of ETag as MD5 for a file, which uniquely identifies a file as a string of characters
  • Last-Modified: As the name implies, it is the time when this file was last modified

If the server is strict and checks whether the file you are downloading has changed from the Last one, it must provide at least one of the ETag or Last-Modified fields on the first download. A breakpoint download request can be sent to the if-range field (if-range can only be used with Range, otherwise it will be ignored by the server).

If the same file is downloaded twice, 206 will be returned, starting with the continuation. Otherwise it will return 200, indicating that the file has changed and you need to start over.

If the range sent by the client is incorrect, 416 is returned. The bytes */439715 content-range field indicates that the range provided is incorrect. The total file size is 439715.

implementation

Knowing the above principle, it is very simple to implement a breakpoint download program of your own.

This code implements the function of wGET -C, which can achieve the following effect:

  • The first time a file is downloaded a new file is created and the download begins
  • On the second download, if a file already exists, a Range request is sent to start over from the one already downloaded
  • If the server supports Content-range, proceed with the download
  • If the server doesn’t support it, or If the header is sent again (we didn’t send the if-range header), delete the original file and start over

To use if-range, you need to store the data from the first request. For simplicity, we won’t implement this feature.

import os import sys import requests def file_size(filename): return os.stat(filename).st_size def download(url, chunk_size=65535): downloaded = 0 # How many data already downloaded. filename = url.split('/')[-1] # Use the last part of url as filename if os.path.isfile(filename): downloaded = file_size(filename) print("File already exists. Send resume request after {} bytes".format( downloaded)) # Update request header to add `Range` headers = {} if downloaded: headers['Range'] = 'bytes={}-'.format(downloaded) res = requests.get(url, headers=headers, stream=True, timeout=15) mode = 'w+' content_len = int(res.headers.get('content-length')) print("{} bytes to download.".format(content_len)) # Check if server supports range feature, and works as expected. if res.status_code == 206: # Content range is in format `bytes 327675-43968289/43968290`, check # if it starts from where we requested. content_range = res.headers.get('content-range') # If file is already downloaded, it will reutrn `bytes */43968290`. if content_range and \ int(content_range.split(' ')[-1].split('-')[0]) == downloaded:  mode = 'a+' if res.status_code == 416: print("File download already complete.") return with open(filename, mode) as fd: for chunk in res.iter_content(chunk_size): fd.write(chunk) downloaded += len(chunk) print("{} bytes downloaded.".format(downloaded)) print("Download complete.") if  __name__ == '__main__': Url = 'http://dldir1.qq.com/qqfile/QQforMac/QQ_V4.0.4.dmg' url = sys. Argv [1] if len (sys. Argv) else url = = 2 download(url)Copy the code

We used the Requests library to implement HTTP requests, and the code was relatively simple to implement, with comments added where necessary.

With a few tweaks, it can be packaged into a decent download tool.