Abstract:

Takeaway:Remember the dark history of all those years when we got up in the middle of the night to restart servers? During double 11, Alibaba millions of host management can be safe, stable, efficient, as smooth as silk is how to do? Song Yi, a technical expert of Alibaba operation and maintenance in Taiwan, revealed the StarAgent, the infrastructure of Alibaba IT operation and maintenance for the first time live, and analyzed in detail how StarAgent supports the management and control of millions of servers. How to do a good job of ali operation and maintenance infrastructure platform, just like hydropower and coal in life?


The guest is introduced
Song Jian (Song Yi) : Alibaba operation and maintenance technical expert in Taiwan. Working in the field of operation and maintenance for 10 years, I have a profound understanding and practice of large-scale operation and maintenance system and automatic operation and maintenance. He joined Alibaba in 2010 and is now responsible for the basic operation and maintenance platform. After joining Ali, I was responsible for: establishing the basic monitoring system of Alipay from scratch, promoting the integration and unification of the monitoring system of the whole group, operating and maintaining tools and testing the PE team.


StarAgent



From the perspective of cloud Efficiency 2.0 intelligent operation and maintenance platform (StarOps for short), operation and maintenance can be divided into two platforms, basic operation and maintenance platform and application operation and maintenance platform. The basic operation and maintenance platform is unified, called StarAgent, which can be justly said to be the infrastructure of Alibaba’S IT operation and maintenance.
From 10,000 servers to 100,000 servers and gradually to millions of servers, the importance of infrastructure was not realized at first, but discovered gradually. The stability, performance, and capacity of the o&M system cannot meet the rapid growth of the number of servers and services. In 2015, we upgraded the architecture, and the success rate of StarAgent system increased from 90% to 99.995%, and the amount of daily transfer increased from 10 million to more than 100 million.
Server is worth millions of level of enterprises, should also is a handful in the whole world, and a lot of enterprise internal and made a break up of the business, the business management own server, a set of management system of one million machines scene should be less, so we don’t have much to learn things, mostly is in groping forward, That’s how our system evolved into what it is today.


Product introduction





As shown in the figure above, StarAgent is divided into three layers: host layer, operation and maintenance layer, and business layer. Each team collaborates in a hierarchical manner. Through this figure, we can roughly understand the position of StarAgent products in the Group, which is the only official default Agent of the Group.


  • Host layer: refers to all servers, each machine has our Agent installed by default.
  • Operation management layer: refers to the operation and maintenance management system, including application operation and maintenance system, database operation and maintenance system, middleware operation and maintenance system, and security system. Products in various professional fields have independent portals, and the operation of servers is realized through StarAgent.
  • Service layer: refers to the services of each BU. Most BU directly use the management and control system of the OPERATION and maintenance layer. However, some BU may have individual requirements.


Application scenarios






StarAgent runs through the server lifecycle:



  • Asset check: after the server is installed, it will be set as network startup, and then a mini OS will be loaded to run in memory, which already contains our Agent. After the OS is started, instructions can be issued to collect hardware information of the server for asset check, such as CPU, memory, disk manufacturer and size information, etc.
  • OS installation: The system installs an OS before delivering services. The OS installation is implemented by running commands to the Agent in the memory.
  • Environment configuration: Initialize the basic environment such as accounts on the machine, general O&M scripts and scheduled tasks after the OS installation.
  • Application publishing: Releases application configurations and software packages.
  • Running monitoring: Monitors application and service scripts and monitors the installation of agents.
  • Routine O&M: Perform routine O&M operations such as logging in to servers, single servers, and batches, including clearing services before they go offline.


Product data





These are also some data of our products in Ali. There are hundreds of millions of server operations every day, 500,000 servers can be operated in one minute, and there are more than 150 plug-ins. The scale of management servers is in the millions.


Product features






The StarAgent core functions can be summarized into two parts: channel control and system configuration.

This is similar to open source configuration management products such as SaltStack/Puppet/Ansible, but we do it more subtly.



  • Control channel: All O&M operations are converted to commands to be executed on the server. This command channel is unique on the entire network and has corresponding capabilities such as user permission control, operation audit, and interception of high-risk commands.
  • System configuration: Common O&M scripts, scheduled tasks, system accounts, monitoring agents, etc. These configurations will be automatically initialized after the Agent is started. Agent is packaged in the OS by default, so the initialization of the basic server o&M environment can be automatically completed after the startup.





According to the function list subdivided by Portal, API and Agent, Portal is mainly used by front-line development and operation and maintenance students, WHILE API is mostly called by upper-layer operation and maintenance system. Agent represents the ability that can be directly used on each machine.




Portal


  • Operations market: Also called plug-in platform, similar to the app market in mobile. If a service manager finds some useful tools in the market, he/she can click “Install” to install them on the corresponding machine of the service. If the service has a newly expanded server, these tools will also be automatically installed. The development of small tools also comes from the front-line students. Everyone can upload their own tools to the operation and maintenance market and share them with others.
  • WEB terminal: After you click the next machine on the Portal, a terminal will automatically pop up. The effect is exactly the same as SSH login to the server. Based on the current user information, automatic authentication, this terminal can also be embedded into any other WEB pages through JS.
  • File distribution: I will not expand the introduction if it is easier to understand.
  • Scheduled task: Similar to crontab, but we support the second level and can split the execution. For example, a batch of machines can be added with a scheduled task executed once per minute. With Crontab, all machines will be executed at the first second of every minute.
  • Host account: provides three functions: a personal account for logging in to the server, a public account such as admin on the host, and SSH channels between one host and other hosts.
  • API account: it is closely related to the API functions on the right. To use these capabilities on the right, you must first apply for an API account.


API


  • CMD: If the target machine and command information are passed in the invocation, the specified machine can execute the command. The commands executed on the login machine can be invoked through the CMD interface.
  • Plugin: Corresponding to the previous o&M market, if some scripts are installed on the machine through o&M market, they can be directly executed by Plugin.
  • File/Store: Both are used for File distribution. The difference is that File depends on the download source. Store can directly POST the script content when calling the HTTP API. File is based on P2P implementation, and there is a product called Dragonfly that specializes in File download. The advantage is that hundreds or thousands of machines will only return to the source once when downloading at the same time. The pressure on the source is very small, and the machines can share downloading with each other.
  • For example, use file to download the script, and then use CMD to execute the script after the download is complete. In addition, CMD is used to execute the script only after the download is successful.


Agent


  • Hostinfo: Provides information such as the host name, IP address, and SN of the collection server.
  • Data channel: The output of commands or scripts executed on each machine is dumped directly here, and the data is automatically uploaded to the center, where it is then consumed.
  • Incremental logs and P2P files: both are developed by third parties and installed on each machine as plug-ins in the o&M marketplace.





Figure: On the left is the Web terminal, which is automatically authenticated and can be embedded into any Web page with JS.

On the right is the batch command execution function. Select a batch of machines first, and the commands entered on this page will be sent to this batch of machines.



System architecture



Logical architecture





Our system is a three-tier architecture. Agent is installed on each machine, and a long connection is established with channel. Then channel regularly reports the information of Agent connection to the center, and the center maintains complete relationship data between Agent and channel. Share two processes:
1. The registered Agent


The Agent has a default configuration file. After the Agent is started, it first connects to the ConfigService, which reports the IP address and SN of the host. The ConfigService calculates which channel cluster to connect to and returns it to the channel list. After receiving the results, disconnect from the ConfigService and establish a long connection to the Channel.
2. Run the command


External systems call proxy to issue commands. After receiving the request, proxy will find out the corresponding channel according to the target machine, and then send the task to channel, which then forwards the command to Agent for execution.


Deployment architecture





At the bottom is each IDC. Channel will deploy a set of clusters in each IDC, and Agent will randomly establish a long connection in one of them. The preceding is the center. In the center, Dr Is deployed for two equipment rooms and services are provided online. The failure of one equipment room has no impact on services.


Problems & Challenges






As shown in the figure above: the problem we encountered in system refactoring the year before last:


The first three problems are similar, mainly because tasks are caused by status. Manager 1.0 can be understood as proxy in 2.0, while server is equivalent to channel. There are a large number of systems issuing commands online at any time. Therefore, all tasks on this link will fail. For example, the agent connected to the server will be disconnected after the server restarts. Because the link is down, the command sent by the server cannot obtain the result. Restarting the server will cause the sixth load imbalance problem. Assume that there are 10,000 machines in an IDC, and 5000 servers are connected to each other. After the restart, all 10,000 machines are connected to one server.
User if the call API issued an order of failure will find come and let us check the reason, sometimes it is indeed a system problem, but also has a lot of environmental problems, is itself such as machine downtime, SSH impassability, load is high, the disk is full, and so on, millions of magnitude scale of server, one percent of the machine every day there are ten thousand units, as a result of answering questions. At that time, we were in great pain. Half of the team was answering questions every day, and we had to get up and restart the service to recover from the practice of disconnection in the middle of the night.


How to solve these problems? We divide the problems into system problems and environmental problems.









System problems


We have done a thorough restructuring the system, using distributed architecture news, or issued an order, for example, every time is a task, the issuance of the increased state in 2.0 for each task, the proxy after receiving issued an order request, will be recorded and the state to receive tasks, and then distributed to the agent, the agent will respond immediately after receipt of the task, After receiving the agent’s response, the proxy changes the status to Executing. After the agent completes the execution, the Proxy actively reports the result. After receiving the result, the proxy changes the status to Executed.
During the whole process, messages between the Proxy and agent are confirmed by a confirmation mechanism. If the message is not confirmed, the proxy will retry. In this way, if the role is restarted during the task execution, the task itself is not affected.
In 2.0, machines in a channel cluster communicate with each other and regularly report information such as the number of agents connected to them. Combined with the received information and their own information, if too many agents are connected to them, the machines that have no tasks to perform are automatically disconnected to solve the load balancing problem. The central node has long connections with all channels and stores the number of agents connected to each channel. When abnormal channels or high capacity are found in a certain equipment room, it will automatically trigger capacity expansion or temporarily borrow channels from other equipment rooms. After capacity recovery, the expanded channels will be automatically eliminated.


Environmental problems


In 2.0, each layer of proxy/ Channel/Agent has a detailed error code, through which you can intuitively determine the cause of task error.
Aiming at the problem of the machine itself, and through the data of monitoring and control system, after a failed mission will trigger the environment inspection, including downtime, disk space, load, etc., if you have corresponding API will directly return machine has a problem, and also the head of the machine shall be returned, so that users who can know why when I read the result processing. At the same time, these diagnostic abilities will be opened by the way of nail robot, so that people can directly check and confirm in the group @ robot.





stable


From the previous introduction, we can see that we are actually the infrastructure of operation and maintenance, just like water, electricity and coal in life. All the operation of the server depends on us strongly. When we have a failure, if there is a serious failure of online services, we can only wait for the service failure. Because we cannot operate the server, release and change the server, we have high requirements on system stability. Same-city dual-machine room and remote multi-center DISASTER recovery deployment are achieved, relying on storage such as mysql/ Redis /hbase. The storage itself has high availability guarantee. On top of this, we also make redundancy between storage to ensure that any single storage failure will not affect the business. I believe that there are few systems in the industry to achieve this degree.


security


You can operate 500,000 servers in 1 minute. You can operate tens of thousands of machines in such an instant by typing the command and hitting enter. If it is a malicious destructive operation, the impact can be imagined. Therefore, the high-risk command block function is implemented to automatically identify and block some high-risk operations. The entire invocation link is encrypted and signed to ensure that no third party can crack or tamper with it. In view of the possible leakage problem of API accounts, the command mapping function is developed to change the commands in the operating system by mapping. For example, the reboot command may need to be imported to A1B2, and the mapping relationship of each API account is different.


The environment


Environmental problems such as machine downtime can be solved by connecting with monitoring data. As mentioned above, the problem of network isolation is no longer stated too much. Here emphasizes the next entry in the CMDB data and the Agent to collect the data inconsistency problem, mainly is the SN, IP these basic information, because everyone is when using the first machine information, removed from a CMDB to invoke our system again, if you don’t agree, will cause the failure of calls directly, why SN/IP inconsistent problem?
The data in THE CMDB is generally triggered and entered by manual or other systems, while the Agent is actually collected from the machine. Some motherboards of the machine do not record SN, and some machines have many network cards, etc. The environment is complicated and various situations exist.
To solve this kind of situation is through the establishment of specification, make standard of SN, IP acquisition, respectively, to allow the machine custom machine SN/IP, cooperate with the specification also provides acquisition tool, not only is our Agent, and all other acquisition machine scene can make use of the information acquisition tool, when the standard updated we will update the small tools, In this way, the transparency of the upper level business is realized.