Direct Attack on alibaba Double 11 Mysterious Technology: PB-level large-scale file distribution System "Dragonfly"

Introduction: On Tmall Double 11 in 2017, the transaction peak value was 325,000 / SEC, the payment peak value was 256,000 / SEC, and the database processing peak value was 42 million/SEC, setting a new record again. Dragonfly, the infrastructure of Alibaba Group, delivered 5GB data files to tens of thousands of servers at the same time during the Double 11, so that large-scale file distribution was perfectly realized by dragonfly system.

Dragonfly, by solving various problems in scenarios such as large-scale file download and cross-network isolation, greatly improves the service capabilities of data preheating and large-scale container image distribution. The average number of monthly distribution exceeded 2 billion, and 3.4PB of data was distributed. The speed of container image distribution can be up to 57 times faster than that of NATVIE, and the outflow traffic of Registry network can be reduced by more than 99.5%. Today, Ali Sister invited Rubo, senior technical expert of Ali infrastructure business group, to detail the technical road of Dragonfly from file distribution to image transmission.

The birth of dragonflies

With the explosive growth, ali business daily average of 2015 release system release amount exceed twenty thousand, the scale of the many applications began to break, release rate begins to rise, and the fundamental reason is the pull release process requires a lot of files, the file server can not carry a lot of request, of course it’s easy to think of server capacity, and found the back-end storage but expansion after become bottlenecks. In addition, a large number of client requests from different IDCs consume a large amount of network bandwidth, causing network congestion.

At the same time, many businesses are internationalized, and a large number of applications are deployed overseas. Overseas servers have to be downloaded back to the source country, which wastes a lot of international bandwidth and is very slow. If the transmission of large files, poor network environment, failure and then have to start again, efficiency is very low.

Therefore, IT is natural to think of P2P technology, because P2P technology is not new. At that time, many domestic and foreign systems were investigated, but the conclusion was that the scale and stability of these systems could not meet our expectations. Hence the dragonfly product.

Design goals

To address these pain points, Dragonfly set several goals at the beginning of its design:

1. Solve the problem that file sources are overwhelmed and set up P2P networks between hosts to relieve the pressure on file servers and save network bandwidth resources across IDCs.

2. Accelerate the file distribution speed, and ensure that tens of thousands of servers download at the same time, with a server download without too much fluctuation.

3. Solve cross-country download acceleration and bandwidth savings.

4. Solve the problem of large file download, and must support resumable breakpoint.

5. Disk I/OS and network I/OS on hosts can be controlled to avoid adverse impact on services.

System architecture

Overall structure of dragonfly

The overall architecture of Dragonfly is divided into three layers: the first layer is Config Service, which manages all the Cluster Manager, and the Cluster Manager manages all the Host. Host is the terminal, and DFGET is a client program similar to WGET.

The Config Service is mainly responsible for Cluster Manager management, client node routing, system configuration management, and warm-up services. In simple terms, it is responsible for telling the Host the address list of the Cluster Manager nearest to it, and maintaining and updating this list regularly so that the Host can always find the Cluster Manager nearest to it.

The Cluster Manager has two main responsibilities:

1. Download files from file sources in passive CDN mode and generate a group of seed block data;

2. Construct a P2P network and schedule each peer to transfer the specified block data to each other.

The syntax of dfGET is very similar to that of wGET. The main functions include file download and P2P sharing.

Inside Ali, we can use StarAgent to issue DFGET instructions, so that a group of machines can download files at the same time. In some scenarios, a group of machines may be all servers of Ali, so it is very efficient to use. In addition to the client, Dragonfly has a Java SDK that lets you “PUSH” files to a set of servers.

The following figure illustrates the interaction diagram of the system when two terminals call DFGET at the same time and download the same file:

Dragonfly P2P networking logic diagram

First, CM will check whether there is a local cache, if not, it will download back to the source, and of course the files will be fragmented. CM will download these fragments through multiple threads, and at the same time, it will provide the downloaded fragments to hosts for download. After hosts download a fragment, they will also provide it to their peer for download. And so on until all hosts have been downloaded.

During local download, the downloaded fragments are recorded in the metadata. If the download is interrupted suddenly and the DFGET command is executed again, the download is resumed.

After the download, MD5 is also compared to ensure that the downloaded file is exactly the same as the source file. Dragonfly uses HTTP cache protocol to control the cache duration of files on THE CM side. Of course, the CM side also has the ability to periodically clean disks to ensure sufficient space to support long-term services.

Ali also has many file preheating scenarios, which need to push files to CM in advance, including container images, index files, cache files optimized for business, and so on.

After the first release went live, we ran a test round, and the results were as follows:

Comparison of traditional download and Dragonfly P2P download test results

On the X-axis is the number of clients, on the Y-axis is the download duration,

File source: Test target file 200MB (NIC: GIGABit bit/s)

Host: 100 MBIT /s NIC

CM terminal: 2 servers (24-core 64G, NETWORK card: GIGABit bit/s)

Two things can be seen from this graph:

1. In the traditional mode, the download time increases with the increase of clients, but dfGET can support up to 7000 clients.

2. In traditional mode, there is no data after 1200 clients, because the data source is overwhelmed.

From release systems to infrastructure

After November 11, 2015, dragonfly was downloaded 120,000 times per month and distributed 4TB. At that time, there were other download tools in Ali, such as wget, curl, SCP, FTP, etc., as well as their own small-scale file distribution system. In addition to fully covering our own publishing system, we also did a small scale promotion. By November 11, 2016, we had 140 million downloads per month and 708TB of distribution, increasing our business by nearly a thousand times.

After November 11, 2016, we put forward a higher goal, hoping that 90% of ali’s large-scale document distribution and large document distribution business should be undertaken by Dragonfly.

With this goal, I hope to hone the best P2P file distribution system. It is also possible to unify all document distribution systems within the group. Unification can benefit more users, but unification is never the ultimate goal. The purpose of unification is: 1. Reduce redundant construction; 2. Global optimization.

As long as dragonfly optimization of a system, the whole group can benefit. For example, we found that system files are distributed throughout the network every day, and this file compression alone can save the company 9 TERabytes of network traffic per day. Cross-border bandwidth resources are especially valuable. This kind of global optimization is not possible if everyone is using their own distribution system.

So unification is imperative!

Based on a lot of data analysis, we came up with something like 350 million distribution per week across the group, and our share was less than 10%.

After half a year’s efforts, we finally achieved this goal in April 2017, reaching 90%+ business share. The business volume increased to 300 million times per week (basically consistent with our previous analysis data), and the distribution volume was 977TB, which was larger than the volume of a month half a year ago.

Of course, it has to be said that this is also inseparable from Ali containerization, mirror distribution traffic accounted for about half. Here’s how Dragonfly supports mirror distribution. Before we talk about image distribution, let’s talk about Ali’s container technique.

Ali’s container technology

The advantages of Container technology naturally need no further introduction. Globally, Container technology is dominated by Docker, which accounts for most of the market. Of course, there are other solutions, such as RKT, Mesos Uni Container, LXC, etc., while Pouch is named as Ali’s Container technology. As early as 2011, Ali independently developed container technology T4 based on LXC, but at that time we did not create the concept of image, T4 is still used as a virtual machine, of course, much lighter than virtual machine.

In 2016, Ali made a significant upgrade to the T4, which evolved into today’s Pouch and has since been open sourced. At present, Pouch container technology has covered almost all business divisions of Alibaba Group, with 100% containerization of online businesses and a scale of hundreds of thousands. The value of mirror technology expands the application boundary of container technology, and how to achieve efficient “mirror distribution” has become a major proposition in such a huge application scenario of Ali.

Back to the mirror plane. On the macro level, Alibaba has a large scale container application scenario; Microscopically, the quality of each application image varies when it is mirrored.

In theory there should not be a big difference in application size with mirroring or traditional “baseline” mode. But in fact it all depends on how well the Dockerfile is written and how well the image is layered. There are actually best practices within Alibaba, but each team has different levels of understanding and acceptance that are sure to be useful. Especially in the beginning, it’s not uncommon for people to type images that are 3 to 4 gigabytes.

So as a P2P file distribution system, Dragonfly has a place to use, no matter how big the image, no matter how many machines to distribute, even if your image is very bad, we provide very efficient distribution, will not become a bottleneck. This gives us time to quickly promote container technology, let everyone accept container operation and maintenance mode, and give full digestion.

Container mirror

Let’s talk a little bit about container mirroring before we talk about image distribution. We can run the docker history Ubuntu :14.04 command to check Ubuntu :14.04. The result is as follows:

Note that there is nothing in the mirror layer D2A0ECFFE6FA, which is called an empty mirror.

The image is layered, each Layer has its own ID and size, there are 4 layers, and the image is made up of these layers.

Docker image is built by Dockerfile, look at a simple Dockerfile:

The image construction process is shown below:

As you can see, the new image is generated from the base image layer by layer. With each software installation, a layer is added to the existing image.

When the container is started, a writable layer is loaded onto the top layer of the image. This writable layer is also called the “container layer”. Below the container layer are the “image layers”, which are read-only.

If the image layer content is empty, the corresponding information will be described in the image JSON file. If the image layer content is not empty, it will be stored as a file in OSS.

The mirror to distribute

Docker image download flowchart

Take Ali Cloud container service as an example, the traditional image transmission is shown in the figure above. Of course, this is the most simplified architecture mode, and the actual deployment situation will be much more complex, and authentication, security, high availability and so on will also be considered.

As can be seen from the figure above, image transfer and file distribution have similar problems. When there are 10,000 hosts requesting Registry at the same time, Registry will become a bottleneck. In addition, when overseas hosts access domestic Registry, there will also be problems such as bandwidth waste, longer latency and lower success rate.

Docker Pull execution process is described below:

Docker image hierarchical download diagram

The Docker Daemon calls the Registry API to obtain the Manifest of the image, from which the URL of each layer can be calculated. The Daemon then downloads all the image layers from Registry to the Host local repository in parallel.

So eventually, the problem of image transfer becomes a problem of parallel downloading of files at each image layer. Dragonfly is good at transferring each layer of image files from Registry to the local repository in P2P mode.

So how does it work?

In fact, dfGet Proxy will be enabled on Host, and all command requests from the Docker/Pouch Engine will pass through this proxy, as shown below:

Dragonfly P2P container image distribution diagram

First, the docker pull command will be intercepted by dfGET proxy. Then, DFGET proxy sends a scheduling request to CM. CM will check whether the corresponding downloaded file has been cached locally after receiving the request. If not, it will download the corresponding file from Registry and generate seed block data (once the seed block data is generated, it can be used immediately). If it has been cached, the block task is directly generated, and the requester parses the corresponding block task and downloads the block data from other peers or supernodes. When all the blocks of a Layer are downloaded, a Layer is downloaded. Similarly, when all the blocks of a Layer are downloaded, The entire image is downloaded.

Dragonfly supports container image distribution and also has several design goals:

1. Large-scale concurrency: the system must support 100,000 concurrent Pull images.

2. Do not invade the Docker Daemon (Registry) : that is, you cannot change any code of the container service.

3. Support Docker, Pouch, Rocket, Hyper and all container/virtual machine technologies.

4. Support image preheating: push to dragonfly cluster CM at build time.

5. Large image files: at least 30GB.

6. Security

Native Docker v. dragonflies

We conducted two groups of experiments:

Experiment 1:1 client

1. The test image size can be 50MB, 200MB, 500MB, 1 gb, or 5GB

2. Bandwidth of the mirror warehouse: 15Gbps

3. Client bandwidth: double 100 MBIT /s network environment

4. Test scale: single download

Comparison diagram of different modes on a single client

The average time taken by Native and Dragonfly (with intelligent compression disabled) is basically similar, while dragonfly is a little higher, because Dragonfly verifies the MD5 value of each block data during download and the MD5 value of the whole file after download to ensure that the downloaded file is consistent with the source file. With smart compression enabled, it takes less time than Native!

Experiment 2: Concurrency among multiple clients

1. The test image size can be 50MB, 200MB, 500MB, 1 gb, or 5GB

2. Bandwidth of the mirror warehouse: 15Gbps

3. Client bandwidth: double 100 MBIT /s network environment

4. Multiple concurrency: 10, 200, 1000

A comparison of different mirror sizes and concurrent numbers

As can be seen from the above figure, with the expansion of download scale, the time difference between Dragonfly and Native mode significantly expands, and the speed can be increased up to 20 times. The source bandwidth is also critical in the test environment, and if the source bandwidth is 2Gbps, the speed can be increased up to 57 times.

Here is a comparison of the total traffic for downloading files (concurrency x file size) and the traffic back to the source (traffic to Registry download) :

Dragonfly image sends out traffic comparison diagram

The image of 500M is distributed to 200 nodes, which uses lower network traffic than the original docker mode. Experimental data show that after dragonfly is adopted, the outbound traffic of Registry is reduced by more than 99.5%. In the case of 1000 concurrent transactions, the outbound traffic of Registry can be reduced to 99.9% or so.

Alibaba practice Effect

Dragonfly has been put into use in Ali for about two years. In the past two years, its business has developed rapidly. According to the number of distribution, it is nearly 2 billion times a month and 3.4PB of data are distributed. Nearly half of the container images were distributed.

Dragonfly in Ali file VS image distribution traffic trend chart

Alibaba’s biggest distribution should be the simultaneous distribution of 5GB data files to tens of thousands of servers during singles’ Day this year.

Towards intelligence

Ali was not the first to start AIOps, but we have invested heavily in recent years and have applied it to many products. Dragonfly this product has the following applications:

Intelligent flow control

Flow control is very common in road traffic. For example, China’s road speed limit is set at 40 km/h on roads without center line. A highway with only one vehicle lane in the same direction, with a speed limit of 70 km/h; 80 kilometers of fast road; Highway speed limit is 120 km/h and so on. The speed limit is the same for every car, which is obviously not flexible enough, so when the roads are very empty, the road resources are actually very wasteful and the overall efficiency is very low.

Traffic light is part of a flow control method, the present traffic lights are fixed time, not according to the real traffic intelligent judgment, so the cloud conference held in October last year, Dr Jian wang has regrets, the furthest distance in the world is not from the arctic to the Antarctic, but from traffic lights to the cameras, they are on the same lever, But never connected by data, what the camera sees never turns into traffic light action. This not only wastes the data resources of the city, but also increases the cost of urban operation and development.

One of dragonfly’s parameters controls disk and network bandwidth utilization, allowing users to set how much network IO/ disk IO to use. As mentioned above, this approach is very rigid. Therefore, one of the main ideas of our current intelligence is that we hope that similar parameters will not be set manually, but according to the business situation and the operation of the system, the intelligent decision of the configuration of these parameters. The solution may not be optimal at the beginning, but after running and training for a period of time, it automatically reaches the optimal state to ensure stable service running and make full use of network and disk bandwidth as far as possible to avoid resource waste.

Intelligent scheduling

Block task scheduling is the key to determining the overall file distribution efficiency high and low or not, if only through simple scheduling policies, such as stochastic scheduling or other fixed priority scheduling, this approach tends to cause the download speed frequency jitter, could easily lead to download too much burr, download overall efficiency at the same time also is very poor. In order to optimize the task scheduling, we went through countless attempts and explorations. Finally, through data analysis of multidimensional dimensions (such as machine hardware configuration, geographical location, network environment, historical download results and speed, etc.) (mainly using gradient descent algorithm, and other algorithms will be tried later), Intelligence dynamically determines the optimal list of subsequent partitioned tasks for the current requester.

Intelligent compression

Intelligent compression implements compression policies for files that are most worthy of compression, saving a large amount of network bandwidth resources.

According to the actual average data of container images, Compression rate is 40%, that is, 100MB images can be compressed to 40MB. For 1000 concurrent applications, intelligent compression can reduce traffic by 60%.

security

When downloading some sensitive files (such as secret key files or account data files, etc.), the security of transmission must be effectively guaranteed. In this respect, Dragonfly mainly does two works:

1. Supports HTTP header data for file sources that need to use the header for permission verification.

2. Use the symmetric encryption algorithm to encrypt file contents during transmission.

Open source

With the popularity of container technology, the distribution of large files such as container images has become an important issue. In order to better support the development of container technology and the distribution of large-scale files in data centers, Ali decided to open source Dragonfly to better promote the development of technology. Alibaba will continue to support the open source community and contribute its battle-proven technology to the community. Stay tuned.

conclusion

Dragonfly uses P2P technology combined with intelligent compression, intelligent flow control and other innovative technologies to solve various file distribution problems in large-scale file download and cross-network isolation scenarios, and greatly improve data preheating, large-scale container image distribution and other business capabilities.

Dragonfly supports a variety of container technologies, without any modification of the container itself, image distribution can be up to 57 times faster than NatVIE mode, and the outbound traffic of Registry network can be reduced by more than 99.5%. Dragonfly, which carries PB level of traffic, has become one of the important infrastructures in Ali, escorting the rapid expansion of business and the Promotion of Double 11.

PS: Cloud Effect 2.0 intelligent operation and maintenance platform – committed to creating a world-class intelligent operation and maintenance platform, we are looking for senior technical/product experts to work in Hangzhou, Beijing and the United States. Interested parties can click on the “read the original article” at the end of the article for more details.

Reference

[1] Docker the Overview:

https://docs.docker.com/engine/docker-overview/

[2]Where are docker images stored:

http://blog.thoward37.me/articles/where-are-docker-images-stored/

[3] Image Spec:

https://github.com/moby/moby/blob/master/image/spec/v1.md

[4]Pouch open Source Address:

https://github.com/alibaba/pouch

[5] Dragonfly open Source address

https://github.com/alibaba/dragonfly

[6] Ali Cloud Container Service:

https://www.aliyun.com/product/containerservice

[7] Feitian Proprietary Cloud Agile Edition:

https://yq.aliyun.com/articles/224507

[8] Cloud Effect intelligent operation and maintenance platform:

https://www.aliyun.com/product/yunxiao

You might like it

Click on the image below to read it

Alibaba CTO:

Alibaba Double 11 is the world’s Internet technology super project

Focus on “Ali Technology”

Grasp the pulse of cutting-edge technology

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Direct Attack on alibaba Double 11 Mysterious Technology: PB-level large-scale file distribution System “Dragonfly”

Direct Attack on alibaba Double 11 Mysterious Technology: PB-level large-scale file distribution System “Dragonfly”

Related Posts

Warner Cloud analysis of 5 machine learning skills required for cloud computing

Q: How do you nail an interviewer

Mysql- Primary/secondary synchronization causing problems?