Not long ago, Aliyun Technology Tiantan parachuted into the CSDN online summit to interpret the core technology competitiveness. Among them, Zeng Fuhua, a senior technical expert of Ali Cloud, shared the topic of “Double 11: How CDN can ensure that e-commerce promotion is as smooth as silk”. As the saying goes: keep a thousand days, a time. Every battle is supported by countless teams, countless plans, and countless drills. The stability of Singles’ Day is not only guaranteed by various kinds of innovation and high-tech, but also by a lot of systematic engineering. From the physical layer to the application layer, from resource access to the online exercise, there are all kinds of doorways without exception. In the face of e-commerce promotion, in the face of 100 Tbps level of traffic, Ali Cloud CDN is how to ensure as smooth as silk?

Lecturer: Zeng Fuhua, senior technical expert, the person in charge of Ali Cloud Edge cloud intelligent dispatching system, his main research direction is edge cloud network integrated dispatching.

Follow aliyun Edge Plus wechat official account, see more technical information, the end of the message is more Aliyun T-shirt free to get oh ~

Contents:

Brief introduction of CDN development history and architecture

In the grand promotion scenario, the development history of Aliyun CDN is explained by key technologies such as the volume, guarantee and arrangement of CDN, flexibility, computing power and simulation, etc. In 2008, Aliyun CDN originated from Taobao CDN, and its main service object at that time was Taobao e-commerce. 2009: Double 11 launch; Established CDN R&D team; Ali’s CDN and e-commerce have inextricably linked. 2011: AS the traffic infrastructure, CDN gradually expanded from the service of Taobao e-commerce to the full service of Ali Group. 2014: Aliyun CDN was officially commercialized; Tengine+Swift node architecture was launched, replacing ATS node architecture. 2015: Aliyun CDN and e-commerce hand in hand into the HTTPS stage of the whole site; Ali Cloud CDN self – developed AIM1.0 intelligent scheduling system was launched. 2017: Aliyun CDN started the globalization strategy; Merge Youku CDN; Officially release SCDN, DCDN and other products; In the same year, it was rated as a global supplier by Gartner. 2018: Aliyun CDN supported 70% of the live traffic of the World Cup in the whole network; AIM2.0 intelligent scheduling system released. 2019: Focusing on “intelligence”, Ali Cloud CDN ploughed deeply in technology, and continued polishing in programmable CDN, multi-dimensional resource load balancing, refined operation and various edge scene services. 2020: in the context of the nationwide fight against COVID-19, Ali Cloud CDN support live streaming, online education and other scenarios traffic growth; At the technical level, CDN edge cloud native and cloud network integrated scheduling transformation.

As can be seen from the figure above, with the annual Double 11 and the popularity of mobile Internet and video, the traffic of CDN also shows an exponential growth. Up to now, Aliyun CDN has 2800+ edge nodes in the world, covering more than 3000 regions and operators. Serving hundreds of thousands of customers worldwide, providing accelerated services for more than one million domain names. Aliyun CDN creates a marginal ecological network connecting the whole world, which processes hundreds of millions of QPS user connection requests per second during the evening peak, and sends billions of configuration management commands to all nodes of the whole network every day.

Introduction and technical architecture of CDN

As we all know, site loading speed has a huge impact on the Internet experience. According to statistics, if most sites can not open within 3 seconds, close to 50% of users will choose to leave. This is especially true for large online systems, where a one-second increase in load times can reduce revenues by hundreds of millions a year. CDN is a PaaS cloud service designed to provide accelerated access to customer sites (although the concept of cloud computing has not been proposed when CDN was born). CDN is very popular and carries more than 90% of Internet traffic. The working principle is to distribute the content to all parts of the world through the edge nodes with wide area coverage. The scheduling system guides the user requests to reasonable edge nodes to greatly reduce the access delay, and controls the back source and flow of cached content, so as to achieve the purpose of accelerating the customer site.

A more accurate definition of CDN is the low-cost, highly reliable and widely covered computing infrastructure, content link capability and video carrying platform based on carrier resources. The internal system of CDN covers intelligent scheduling, network/protocol and supply chain management modules, as well as data and security protection ecological capabilities. On this basis, it provides acceleration capabilities for different scenarios such as web pages, pictures, on-demand, live broadcast, dynamic, government and enterprise, and security.

The technical characteristics and challenges of e-commerce promotion

Today’s big promotion scale is very large, usually 100 T level of business bandwidth, billion level of concurrent requests, the need for millions of CPU core consumption, such a large scale of the scene tempered, promoted the rapid growth of cloud products, especially CDN.

Big promotion activities have two characteristics, the first is the dense arrangement, including the compact arrangement of activities, corresponding to different business parties; There are various forms of activities, such as cat night live broadcast, big anchor, red envelope, second kill, 0 o ‘clock and so on; Simultaneous activity estimation is very complex, requiring consideration of concurrent connections, bandwidth, computing power, hit ratio and other indicators. The second is the need to flexibly change the array under high load, taking into account the following: the running water level of large-volume resources with high load; Different activity scenes have different demands on different dimensions of resources; The scheduling coverage policy for each active scenario must be adaptable and flexible.

So in such a complex background, how to meet the needs of rapid business development?

Face to face Flood peak Discharge: How does the CDN carry the large flood?

Generally, taking the Double 11 promotion as an example, the guarantee of CDN can be divided into three stages: preparation, pre-war and escort. At the same time, since the Double 11 guarantee is a system engineering combining organization and system, it can be broken down into different stages, such as demand assessment, plan preparation, demand delivery, pressure test, network sealing guarantee, and great escort promotion.

In the stage of demand collection and evaluation, CDN needs to collect business requirements from each business side, clarify the time point, business feature portrait and business activity report, and at the same time, make clear whether new function customization is involved. Then according to the business demand into resource demand, including inventory, off-peak reuse and resource gap and construction; In the preparation stage of the plan, it is necessary to check the existing plans and new plans in previous years. Advance plan or emergency plan according to scenarios and uses; Most of the plans can be completed independently by the CDN platform, or some of them need to cooperate and linkage with the business side; The delivery process of customized requirements mainly involves customized development, joint commissioning test, online verification and other processes. During the construction and delivery of resources, service simulation needs to be performed on the resources to be delivered, and service resource pools need to be adjusted.

Ii. The pre-war preparation should first complete the pressure test stage, including the drill of disaster recovery, safety, performance and function, and further ensure that the preparation of software and hardware system, personnel organization and other aspects of Double 11 are in place, including whether the plan covers the whole; Next, will enter the security censorship stage, generally before big events online will ban all release and change, ongoing system inspection, inspection to ensure that all the defects repair in place, at the same time also will undertake the war mobilization guarantee at this stage, inspire morale as well as to promote security specification for further emphasized.

Iii. After the escort phase officially begins, the escort will usually enter the site support according to the pre-arranged division of labor, especially if the preliminary work is fully done. The main work in this phase is to stare at the disk, quickly find and locate the problem in accordance with the abnormal situation, and start the corresponding emergency plan to respond as planned. Finally, after the guarantee, the overall escort should be reviewed and summarized, so as to make more reference for the future guarantee promotion.

CDN guarantees the application of key technology points of great promotion

I. How to guarantee the flexibility of the grand promotion scene?

Because many services are deployed during the period of the great promotion, it is critical to ensure flexible scheduling when the resource running water level is high. Ali Cloud CDN is how to guarantee it? As shown in the following figure (upper part), each service has its own resource preference. To ensure elasticity, a service resource matching mechanism is adopted in the process of matching services and resources. To sum up, the CDN scheduling system of AliYun has the following advantages: resource pool convergence is the most critical factor for service elastic guarantee; Node hardware isolation at the service level is not implemented, and traffic flows to all nodes on the network in real time on demand. Elasticity and quality are tradeable double objectives, which can be independently flexible and controllable according to the actual situation (matching degree of business resources); Carry out resource planning, resource construction and resource scheduling according to the overall market demand;

As shown in the figure above (lower part), we should optimize the resource scheduling of the service resource pool, so that all the 2800+CDN nodes in the whole network can rise and fall simultaneously, so as to provide the maximum resource elasticity guarantee for promoting services. Therefore, scheduling is not only global load balancing, but also elastic scaling. Aliyun CDN scheduling system has made the following preparations:

Node role to dispatch system decision, cut off the hard constraints; Service resource pool traffic scheduling system real-time decision-making, flexible array; The service mixing and intersection scheduling system decision-making on nodes, fully reuse; Resource pool planning and global load balancing are combined to flexibly scale.

2. How to schedule the computing power of the big promotion scene?

Many people think that CDN is simply a traffic distribution system, but in fact, this concept is worth discussing, in some scenarios will be massive sudden requests will consume huge computing power resources. Take the example of double 11, has carried on the HTTPS and reformation of total e-commerce sites, in 11 0 am electric daqo and promote open rob moment, all the requests together, which can form very big spike, calculate the power consumption is very large, at this time if there is no good mechanism to calculate force resource scheduling and global load balancing, will appear a large area of a business exception. In terms of day, QPS on Double 11 was about 30% higher than usual. Therefore, another core challenge of CDN in the large-scale promotion scenario is: how to perform accurate global load balancing scheduling for massive service computing power consumption?

Zeng Fuhua introduces here: In bandwidth scheduling or traffic scheduling, we can accurately deduce the traffic consumed by each request from the log, but it is difficult to accurately calculate the amount of computing power consumed by each request, which is a very difficult problem for us. The following figure shows the solution of Ali Cloud CDN. Based on the known CPU consumption of each node and the concurrent QPS of nodes, the CPU consumption of each service unit QPS can be calculated by formula. At the same time, according to different data changes at different time points, continuous machine learning training, the average single request CPU consumption of each business accurate data. With the data of bandwidth and computing power consumption, the previous single-dimensional bandwidth scheduling needs to be scaled up to a multi-dimensional resource scheduling model to generate a new scheduling policy of global load balancing under multi-resource dimensions such as bandwidth and computing power.

As previously introduced, during the period of great promotion, the whole market is operating under the state of high water level of resources, so the business side needs to make accurate business report, and the CDN platform will carry out resource assessment according to this report. However, everything needs to take into account the suddenness and unexpectness. If the sudden increase of business exceeds the previously evaluated business report, how can the CDN platform carry out risk control?

On the one hand, the service side needs to accurately evaluate the reports, and on the other hand, the CDN platform side needs the traffic limiting strategy to ensure the smooth operation of the whole service. Ali Cloud CDN has accumulated a lot of practical experience and capability in many aspects on the current limiting guarantee. Aliyun CDN multi-stage omnidirectional current limiting guarantee includes:

Service types: live streaming, on-demand, download, dynamic acceleration, and other scenarios. Traffic limiting type: bandwidth, QPS, connection number, etc. Current limiting mode: single-threshold interval current limiting, multi-threshold interval current limiting; Current limiting range: whole network, area, node, etc. Current limiting levels: L1 current limiting, L2 current limiting, back source current limiting;

Before we talked about the need for flexible array change in the case of high load in the big promotion scenario, the scheduling simulation platform is a very useful tool. Big promote scene combined with various business on the global scheduling strategy estimate submitted to the amount of simulation, can be in advance prophecy resource bottleneck and risk points will happen where, need for various business do resources complementary and strategy adjustment, loop iteration to adjust business resource pool, until the risk point on the simulation platform lift off. In addition to the big promotion scenario, the scheduling simulation platform can also accelerate the function evolution of the core system of auxiliary CDN scheduling, and observe whether the change impact brought by the new component verification is positive or negative from the global perspective.

Summary of the simulation platform:

First, based on sand table simulation iteration, the evolution of components such as a zero-cost trial-and-error path and accelerated scheduling core system is honed. • Test platform + simulation platform to jointly safeguard stability; • Real-time assessment of business strategy and resource adjustment;

Second, through greatly promoted simulation, risks, business reporting and resource construction can be predicted in advance, connecting the whole process of management and control; • Accurately derive the resource gap, accurately derive the load increment; • Promote activity matrix simulation and iterate business resource pool;

The above is ali Cloud CDN based on years of e-commerce promotion scene guarantee process precipitate down some practical experience sharing. Ali Cloud CDN has also made a lot of technical evolution in the edge cloud native and cloud network integrated scheduling, we will continue to share with you later, thank you for watching.