Alibaba Sigma scheduling and cluster management system architecture in detail

To highlight

Alibaba has experienced a 280 times increase in transaction volume and a 800 times increase in transaction peak value, showing explosive growth in the number of systems. In the process of supporting double 11, the complexity and support difficulty of the system increase exponentially. The essence of the peak value of Double 11 is to maximize user experience and cluster throughput capacity with limited cost and solve the peak value with reasonable cost.

This paper will explain in detail how Alibaba supports such a huge system from three aspects of Alibaba’s unified scheduling system, mixed department architecture and cloud architecture.

Unified dispatching system

Sigma, which was established in 2011, is a scheduling system for Alibaba’s online business. Around Sigma, there is a set of scheduling centered cluster management system.

Sigma has three layers of Alikenel, SigmaSlave and SigmaMaster brain coordination. Alikenel is deployed on each physical machine to enhance the kernel, adjust resource allocation and time slices flexibly according to priorities and strategies, and delay of tasks. The preemption of task time slice and the expulsion of unreasonable preemption can be decided by the upper rule configuration. SigmaSlave can be used for container CPU allocation and emergency scenario processing. The local Slave can quickly make decisions and respond to the interference of delay-sensitive tasks to avoid service loss caused by long global decision processing time. SigmaMaster is a powerful central brain that can take charge of the whole situation and make resource allocation and algorithm optimization decisions for container deployment of a large number of physical machines.

The whole architecture is designed for the final state. After receiving the request, the data is stored in the persistent storage layer. The scheduler identifies the location of resources allocated by scheduling requirements, and the Slave identifies the state changes to promote the local allocation deployment. The overall coordination and final consistency of the system is very good. We started to make scheduling system in 2011, rewrote it with Go language in 2016, and compatible with Kubernetes API in 2017. We hope to combine the power of ecology to build and develop together.

Mix of architecture

Alibaba started pushing for a hybrid structure in 2014, and it has been deployed on a large scale within Alibaba. Online services are tasks with long life cycle, high complexity of rules and policies, and time delay sensitive. However, computing tasks have short life cycle, large concurrency and high throughput, different priorities, and are not sensitive to time delay. Based on the differences in the essential demands of the two kinds of scheduling, we processed the two kinds of scheduling in parallel on the architecture of mixed deployment, that is, a physical machine can have both Sigma scheduling and Fuxi scheduling, so as to realize the unification of the basic environment. Sigma scheduling is to start the PouchContainer container through SigmaAgent. Fuxi also grabs resources on this physical machine to start its own computing tasks. PouchContainer is used to allocate server resources and run the online tasks. PouchContainer is used to fill in the empty space of PouchContainer and ensure that the physical machine resources are fully utilized. In this way, the mixed deployment of the two tasks is completed.

Key technology of mixing part

Key techniques of kernel resource isolation

Noise Clean kernel feature is implemented in CPU HT resource isolation to solve the problem of in-line/off-line hyperthreading resource contention.
In terms of CPU scheduling isolation, Task Preempt is added to CFS to improve the priority of online Task scheduling.
In terms of CPU cache isolation, CAT is used to realize channel isolation (Broadwell and above) at and offline level cache (LLC).
CGroup isolation /OOM priority in memory isolation; Bandwidth Control Reduces offline quotas to achieve Bandwidth isolation.
In terms of memory elasticity, it can improve the mixing effect without increasing the memory, and break the MEMCG limit offline when the memory is idle online; When memory is needed, it is released offline in time.
In terms of network QoS isolation, control marking is gold, online marking is silver, offline marking is bronze, and bandwidth is guaranteed by grading.

Key techniques of online cluster management

Figure out the memory, CPU, network, disk and network I/O capacity of the application, know its characteristics, resource specification requirements, the real use of resources at different times, and then analyze the correlation between the overall specification and time to optimize the overall scheduling.
Affinity mutual exclusion and task priority allocation, which applications together make the overall computing capacity is less, higher throughput capacity, this is a certain affinity.
Different scenarios have different policies. For Double 11, the policy is stability first. Stability first means that the tiling policy is adopted to exhaust all resources and ensure that all resource layers reach their lowest watermark. In daily scenarios, utilization priority is required. Utilization priority means that the used resources reach the highest watermark to free up a large number of complete resources for large-scale computing.
Application to achieve automatic shrinkage, vertical expansion, time – sharing multiplexing.
Rapid capacity expansion and reduction of the entire site, elastic memory technology, etc.

Mixed deployment refers to adding computing tasks to online service clusters to improve daily resource efficiency. With the introduction of offline tasks, the average CPU utilization increases from 10% to more than 40%, and the delay effect of delay-sensitive services is less than 5%, which is completely acceptable. At present, our whole mixed department cluster has reached thousands of units, which has been verified by the double-11 promotion of transaction core link. This optimization can result in a daily savings of over 30% on the server. This year, we will expand the deployment scale by 10 times and achieve scale benefits.

Through time-sharing multiplexing, the efficiency of resources can be further improved. The graph above is a flow curve for one of our applications. It is very regular, with the evening trough on the left and the peak during the day on the right. The normal mixing part refers to occupying the resources in the blue shaded part of the figure to increase the utilization rate to 40%. The elastic time-sharing multiplexing technology refers to finding the application flow trough period for the application picture, reducing the capacity of the application, releasing the memory and CPU in large quantities, and scheduling more computing tasks. With this technology, the average CPU utilization is increased to over 60%.

PouchContainer advances in containers and containerization

Comprehensive containerization is the key technology to improve operation and maintenance capability and unified scheduling. First, let’s introduce PouchContainre, Alibaba’s internal container product. It was built and launched in 2011, based on LXC, and began to incorporate Docker image features and compliant container standards in early 2015. Alibaba container is very characteristic, it combines ali core, greatly improves the security isolation, currently deployed in alibaba Group on a scale of millions.

Let’s take a look at PouchContainer’s development path. In the past, virtual machine virtualization technology was used, but the transition from virtualization technology to container technology faces many challenges in operation and maintenance system. There are significant technical costs associated with the migration of operations and maintenance systems. We achieved seamless migration of ali’s internal operation and application perspective, with independent IP, SSH login ability, independent file system and resource isolation usage visibility. After 2015, alibaba introduced container standards, formed a new set of container PouchContainer and integrated it into the entire operation and maintenance system.

PouchContainer location diagram

PouchContainer is a rich container with good isolation. You can log in to the PouchContainer and see how much resources are occupied by processes in the container and how many processes are in the container. If a process is suspended, the PouchContainer will not be suspended. Compatibility is good, and older kernels also support it, which is very helpful. At the same time, after the large-scale verification of the deployment of millions of containers, we developed a set of P2P image distribution mechanism, which greatly improved the distribution efficiency. At the same time, it is compatible with more standards in the industry, promoting the construction of standards, supporting RunC, RunV, RunLXC and so on. After the test of the scale of millions of containers, it is stable and efficient, and is the best choice for enterprises to comprehensively containerize.

PouchContainer architecture diagram

The structure of PouchContainer is relatively clear, how Pouchd interacts with Kubelet, Swarm and Sigma. CSI standards have been established with the industry in storage. Support distributed storage such as CEPh and Pangu. Use LXCFS on the network to enhance isolation and support multiple standards.

At present, Pouchcontainerization covers most BU of Ali. In 2017, the deployment of Pouchcontainerization reached millions, and the online business reached 100% containerization. Computing tasks also began to be containerized, which leveled the operation and maintenance costs of heterogeneous platforms. Cover operating modes, multiple programming languages, DevOps architecture. PouchContainer covers almost all business segments of Alibaba, such as ants, transactions, middleware and so on.

PouchContainer was opened on October 10, 2017, and officially opened on November 19. The first major release is planned for March 2018. With PouchContainer open source, we hope to promote the development of the container field and the maturation of standards, and provide differentiated and competitive technology options for the industry. IT is not only convenient for traditional IT enterprises to benefit from the old, but also for the old infrastructure to enjoy the benefits and advantages brought by cloud technology. Moreover, IT is convenient for new IT enterprises to enjoy the advantages brought by scale stability and multi-standard compatibility.

PouchContainer open Source: github.com/alibaba/pou…

Cloud architecture

Double 11 Cloud architecture O&M system

Cloud architecture O&M system

The cluster is divided into online task cluster, computing task cluster and ECS cluster. Basic operation and maintenance systems such as resource management, single machine operation and maintenance, condition management, command channel, monitoring and alarm have been established. In the Double 11 scenario, we will draw a separate area on the cloud to communicate with other scenes. In the interworking area, Sigma scheduling can apply for resources in the compute cluster server, produce Pouch containers, or apply for ECS in the Cloud Open API to produce resources for the container. In daily scenarios, Fuxi can apply for resources in Sigma and create required containers.

In the Double 11 scenario, large-scale operation and maintenance system is used to build a large number of online services on containers, including mixed deployment of the business layer. Each cluster has online services, stateful services and big data analysis. Ali Cloud’s exclusive cluster also deployable online services and stateful data services, enabling datacenter as a computer. Multiple data centers can be managed like a single computer, and resources needed for business development can be scheduled across multiple platforms. Build a hybrid cloud with a very low cost to the server, to solve the problem of no.

There is server scale first, and then resource utilization is greatly improved through time-sharing multiplexing and mixed deployment. It truly realizes the flexible and mixed deployment of flexible resource smooth reuse tasks and achieves the business capacity target with the minimum server time and optimal efficiency. Through this set of cloud architecture, we achieved 50% reduction of new IT costs and 30% reduction of daily IT costs on Double 11, which brought about the explosion of technical value in the field of cluster management and scheduling, and also showed that the popularity of container and scheduling technology is inevitable.

Ali Scheduling system team is committed to creating the most efficient scheduling and cluster management system in the world, and building the best cloud solutions through enterprise-class containers and container platforms. We look forward to working together with colleagues in the industry to reduce the IT cost of the whole industry and accelerate the innovation and development of enterprises.

Shu Tong (Ding Yu), senior technical expert of Alibaba, participated in the Double 11 battle for 8 times, responsible person of Ali’s high availability architecture and Double 11 stability, responsible person of Ali container, scheduling, cluster management, operation and maintenance technology.

Alibaba Scheduling system team provides scheduling, container and cluster management infrastructure for Alibaba economy, promotes the optimization of alibaba’s overall cloud efficiency and cost, and provides sufficient technical competitiveness for Alibaba economy and cloud business. It is committed to building the world’s leading and most efficient scheduling cluster management system and efficient and stable enterprise-class rich container engine.

Pouch Container Team provides container technology for alibaba’s economy in the area of infrastructure, helping Alibaba fully realize business containerization and laying a solid foundation for the Group’s “cloud” strategy. The team is committed to creating the world’s leading, efficient and stable enterprise rich container engine.

An excellent team is always looking for talents to join. If you want to enter the core of technology and challenge the limits of computers, please join us. If you want to work with dedicated, great people, please join us. If you have a passion for elegant code, please join us!

The following positions are open permanently: Golang Engineer, Java Engineer, Scheduling Architect, Container Architect, Hybrid Architect, Cluster Resource Management R&D Specialist, Container PaaS Platform technical specialist, Enterprise Container Platform solution architect, etc……

Push channel: [email protected]

Open position is introduced: alibaba.tupu360.com/social/inde…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Alibaba Sigma scheduling and cluster management system architecture in detail

Unified dispatching system

Mix of architecture

Key technology of mixing part

Key techniques of kernel resource isolation

Key techniques of online cluster management

PouchContainer advances in containers and containerization

Cloud architecture

Cloud architecture O&M system

Focus onAli System Software TechnologyWechat official account for more information

Alibaba Sigma scheduling and cluster management system architecture in detail

Unified dispatching system

Mix of architecture

Key technology of mixing part

Key techniques of kernel resource isolation

Key techniques of online cluster management

PouchContainer advances in containers and containerization

Cloud architecture

Cloud architecture O&M system

Focus onAli System Software TechnologyWechat official account for more information

Related Posts

Grain Mall — Distributed Fundamentals P28~P101 (End)

Quartz learn

How do you implement idempotent and de-heavy in the back end?