1. Current situation and development trend of enterprise operation and maintenance

With the continuous development of enterprise informatization, operation and maintenance personnel need to face more and more complex businesses and more and more diversified user needs. The continuous expansion of applications requires more and more reasonable modes to ensure that operation and maintenance services can be flexible, convenient, safe and stable.

An enterprise has grown from a few servers at the initial stage to a huge data center. Manual labor alone cannot meet the requirements of technology, business, management and other aspects, so standardization, automation, architecture optimization, process optimization and other factors to reduce the cost of operation and maintenance services are more and more valued by people.

Among them, automation began to replace manual operation in the process of enterprise operation and maintenance gradually showed a strong advantage.

Operation and maintenance With the development of enterprise business, automation, as one of its important attributes, has not only replaced manual operation, but more importantly, deep exploration and global analysis, focusing on how to achieve performance and service optimization under the current conditions, while ensuring the maximum return on investment.

Through automatic operation and maintenance, the operation and maintenance goals can be achieved in less maintenance time to the maximum extent and the quality of operation and maintenance service can be improved.

Therefore, for increasingly complex operation and maintenance, it is an important development trend to gradually change manual operation into automatic management.

2. Problems and requirements of enterprise operation and maintenance

An enterprise only file sharing and early mail services and so on several servers, operational work entirely by manual operation, with the development of enterprises, new business systems are online enterprises, construction of the center machine room, operational work is given priority to with artificial, but this stage increased the network management system and environmental monitoring system, the two systems to a certain extent, reduce the workload of operations, Basically realize the semi-automation of operation and maintenance.

With the development of the enterprise, the workload of operation and maintenance is constantly increasing. The enterprise operation and maintenance work is faced with the following problems and needs to be solved:

2.1 The work efficiency and initiative of operation and maintenance personnel need to be improved

In the process of enterprise operation and maintenance, faults can only be discovered and dealt with when they have occurred and cause business impacts. Such passive “fire fighting” not only keeps operation and maintenance personnel busy all day long, but also makes IT difficult to improve the quality of operation and maintenance, resulting in low satisfaction of IT department and business department to operation and maintenance services.

Operations staff spends much time and energy is a problem with some simple repetition, and due to the fault early warning mechanism is not perfect, is often failure or alarm after processing, the operations staff’s work is often in a passive state, how to discover in time before failure and get rid of faults, make change the passive to active operations work?

2.2 An efficient operation and maintenance mechanism needs to be established

In the process of operation and maintenance management, enterprises lack an automatic operation and maintenance management mode, and there is no clear role definition and responsibility division of operation and maintenance personnel. As a result, it is difficult to find the root cause quickly and accurately after problems occur, and corresponding personnel cannot be timely found to repair and deal with them.

Or lack of streamlined fault handling mechanism after problems are found, and lack of standardized solutions and comprehensive tracking records when dealing with problems. Enterprises need to establish a set of efficient operation and maintenance management system to provide direction and basis for operation and maintenance work.

2.3 Lack of efficient operation and maintenance tools

With the deepening of the informatization construction, enterprise business system has become increasingly complex, all kinds of network equipment, servers, storage devices, such as business systems for operations staff to cope, even work overtime to maintain, deployment, management, often because of equipment malfunction and cause the interruption of business, seriously affect the normal operation of the enterprise.

These problems are partly due to the lack of operational and maintenance tools such as event monitoring and diagnostic tools, because without the support of efficient technical tools, fault events can hardly be actively and quickly handled.

3. Standardized business processes and improved operation and maintenance management system

3.1 Standardize business processes and lay a good foundation for automated operation and maintenance

Standardization is the basis of automatic o&M. To achieve standardization, we should first identify each O&M object, and then perform all daily O&M work for these objects.

If the o&M operation is isolated from the object, it is meaningless. Similarly, without a clear object, the operation and maintenance of natural rules. For example, determine whether to expand the capacity of servers, applications, or other objects.

You will notice that the actions performed in the expansion scenario are completely different depending on the object.

If the capacity expansion of the server is applied to the capacity expansion of the application, the process will inevitably lead to confusion. At the same time, inconsistent understanding of objects will increase unnecessary communication costs, resulting in low operation and maintenance efficiency.

In this case, automated operation and maintenance can not only improve efficiency, but also become more automatic and chaotic.

The first step to achieve standardization is the standardization of physical infrastructure, for example, the identification of physical objects such as servers, switches, cabinets and other hardware; Identify the attributes of these physical objects, server serial number, IP address, vendor and other information;

Identify the relationship between the objects, the cabinet where the server resides, and the port on which the switch is connected.

Standardization of server physical infrastructure is shown below (standardization of other devices and so on) :

The second step is the standardization of application, application services, middleware, database, etc. For example, the standardization of database tables, views and stored procedures, the field names, values and indexes of tables, and the association between tables and views.

The third step is process standardization, such as backup, software upgrade, antivirus, new business on-line and other processes standardization. The following figure is the current operation and maintenance process:

Automated operation and maintenance is a process-based framework that associates events with IT processes. Once the monitoring system finds that performance exceeds the pre-configured threshold or outages, relevant events and pre-defined processes will be triggered to automatically start the fault response and recovery mechanism.

The automated work platform can also help operation and maintenance personnel to complete daily repetitive work and improve operation and maintenance efficiency. The following is the flow chart of automatic operation and maintenance:

The automation of operation and maintenance can predict faults and give alarms before faults occur, allowing operation and maintenance personnel to eliminate faults before they occur and minimize losses. From the past manual execution to automatic operation, so as to reduce and even eliminate the delay in operation and maintenance, to achieve “zero delay” operation and maintenance.

3.2 Establish a complete and comprehensive operation and maintenance management system to escort the realization of automatic operation and maintenance

The establishment of operation and maintenance system includes environmental management, asset management, media management, equipment management, monitoring management, network security management, system security management, malicious code prevention management, password management, change management, backup and recovery management, security incident disposal, emergency plan management and other systems.

  1. The operation and maintenance management system is a ruler to measure the operation and maintenance work. A perfect management system can effectively improve the efficiency of operation and maintenance work. Daily work is based on the management system, and the operation is fast and accurate according to the specified requirements and procedures.

  2. Comprehensive operation and maintenance management system can be found in time before problems and failures appear and cause no loss, so that problems can be effectively dealt with and business continuity is guaranteed;

  3. The operation and maintenance management system provides a standardized solution for operation and maintenance work, enabling operation and maintenance personnel to quickly find the root cause of problems according to rules when dealing with problems and minimize the loss caused by problems to the business.

  4. The operation and maintenance management system serves the business, and the business is constantly developing. Therefore, the operation and maintenance management system should keep up with the continuous development of the business to achieve the innovation of the management system.

4. Automatic operation and maintenance technology route selection

4.1 Overview of Automated O&M

Automatic o&M includes installation automation, deployment automation, monitoring automation, release automation, upgrade automation, security control automation, optimization automation, and data backup automation.

Automatic operation and maintenance system includes commercial automatic operation and maintenance system, open source automatic operation and maintenance system and self-built (developed) automatic operation and maintenance system.

Commercial operation and maintenance systems are more comprehensive in function, better in service support, guaranteed in updating and upgrading, with higher procurement costs and relatively low technical requirements for operation and maintenance personnel.

The open source operation and maintenance system is more flexible, and the service support requires more time and energy of operation and maintenance personnel. The update and upgrade is more personalized, and the relative cost is lower.

The self-built automatic operation and maintenance system has the highest technical requirements for personnel and the cost is not low, but when the enterprise develops to a certain scale, the self-built operation and maintenance system can be more suitable for the requirements of the enterprise for automatic operation and maintenance.

4.2 Application scenarios and Advantages of open Source O&M Tools

1) Puppet is an open source software configuration and deployment tool. IT is simple to use and powerful. Many large IT companies use Puppet to manage and deploy software in clusters.

Advantages and disadvantages: The advantage is that the Web interface generates processing reports, resource lists, real-time node management, push command can trigger changes immediately;

The disadvantage is that the installation process lacks error verification and error reporting because it is more complex than other tools and requires learning Puppet’s DSL or Ruby.

2) SaltStack is a brand new way of infrastructure management, easy to deploy, can be up and running in a few minutes, good scalability, easy to manage tens of thousands of servers, fast enough to communicate between servers in seconds.

Advantages and disadvantages: The advantages are that simple configuration modules or complex scripts can be used, and the working status and event logs of operation and monitoring can be viewed on the Web interface.

The downside is the lack of ability to generate in-depth reports.

3) Ansible is a new operation and maintenance tool developed based on Python. It integrates the advantages of many old operation and maintenance tools to implement batch operating system configuration, batch program deployment, batch command running and other functions.

In large-scale deployments, it is not practical to manually configure the server environment, so you must resort to automated deployment tools.

Advantages and disadvantages: The advantages are that the module can be developed in any language, the standby node does not need to install agent software, there is a Web management interface, simple installation and operation;

The disadvantage is that Windows standby management nodes need to be strengthened and the execution efficiency is relatively low.

The following figure shows a comparison of Puppet, Saltstack, and Ansible’s processing capabilities and efficiency.

Various O&M tools are only used to help o&M personnel. Each tool has its own advantages. Puppet applies to automatic software configuration and deployment.

SaltStack is for infrastructure management and can be up and running in minutes, easily managing tens of thousands of servers, fast enough;

Ansible is suitable for batch operating system configuration, batch program deployment, and batch running commands.

Here are two common open source monitoring systems:

1) Nagios is a free and open source IT infrastructure monitoring system with powerful functions and flexibility. IT can effectively monitor the status of Windows, Linux, VMware and Unix hosts, as well as the network Settings of switches, routers and other network devices.

Once the host or service status is abnormal, the system sends an email or SMS alarm to inform IT operation and maintenance personnel immediately. After the status recovers, the system sends a normal email or SMS notification.

Analysis of advantages and disadvantages: Advantages are flexible configuration, many monitoring items, automatic log rolling, host monitoring in redundant mode, and diversified alarm Settings.

The disadvantages are the weak event console, inability to view historical data, and poor plug-in usability.

2) Zabbix is an enterprise-level open source solution that provides distributed system monitoring and network monitoring functions based on WEB interface.

The network management system is used to monitor the status of servers or services on the network and other network equipment. The background is based on C and the foreground is written by PHP. It can be used with a variety of databases and provide various real-time alarm mechanisms.

Analysis of advantages and disadvantages: advantages are enterprise-level open source, powerful, easy to get started, data can be graphically presented, provide a variety of API interfaces, can be customized development.

The disadvantages are that it is difficult to develop deep requirements, complex alarm Settings, lack of data summary function, and data reports need secondary development.

Nagios is suitable for IT infrastructure monitoring system, its powerful, flexible, can effectively monitor all kinds of operating system host, switching routing equipment;

Zabbix provides distributed system monitoring and network monitoring, a network management system for monitoring the status of servers, services and other network equipment on the network.

The above five tools are open source. O&m personnel can use a combination of tools based on the enterprise scale, business needs, and o&M functions to give full play to the advantages of o&M and monitoring tools.

The use of tools requires human intervention and decision making. Tools cannot completely replace all o&M work. You also need to integrate tools and business with actual business logic and business scenarios. For example, secondary development of tools based on service requirements can make full use of the advantages of o&M and monitoring tools to improve the work efficiency of O&M personnel.

4.3 Saltstack Automates server deployment

Saltstack implements automatic operation and maintenance of server deployment in enterprises. Saltstack is a C/S architecture configuration management tool developed based on Python. Its bottom layer uses zeroMQ message queue PUB/SUB to communicate, and uses SSL certificate to issue authentication management.

For salt, we chose version 0.16.0, which added multi-Masterr features, in which all minions are connected to all configured masters.

When a master fails, the rest of the master can continue to provide services, without affecting our normal use. The SaltStack architecture is shown as follows:

The steps for deploying Saltstack in an enterprise are as follows: 1. Determine whether the dependency of Saltstack software meets the requirements: Saltstack requires Python to be larger than 2.6 or smaller than 3.0, and checks the following libraries, including MsgPack-Python, YAMl, Jinja2, Markupsafe, Apache-libcloud, Requests, and more.

2, install the master and minions: my server operating system is centos, installation command is as follows:

Wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm yum install salt-master yum install Salt-minion Note: Installation is successful, Complete is displayed.

Copy the code

Create a backup node for the master service and copy the master key to the standby node.

Master: -saltmaster1.cccxht.com -saltmaster2.cccxht.com

Copy the code

The default private key for master is in the directory:

/etc/salt/pki/master. 

Copy master.pem from the directory to the same location on the standby master node. Perform the same operations for the master public key file master.pub to enable the standby master node and accept keys on the standby master node.

4. Restart the Minions: After the configuration is complete, minion checks the master and slave masters, and both master and slave master have operation rights on Minion.

Note: Minion can automatically detect failed masters and try to reconnect to a faster master. To enable this function, set master_alive_interval to true.

5, saltSTACK state file preparation, SaltSTACK online, operation and maintenance work from complex repeated server deployment and configuration work transferred to the preparation and maintenance of saltStack state file, the preparation of the state file to consider modularity and versatility, before mass deployment to be tested, no problem after deployment, Here are some commonly used test commands:

(1) Check the network connection – Whether the client can be connected

[root@centos salt]# salt '*' test.ping localhost: True server.cccxht.com: True

Copy the code

(2) Query the NIC IP address

[root@centos /]# salt 'localhost' network.interfaces localhost:    eth0:         hwaddr:            08:00:27:59:a9:8d        inet:            - address:                192.168.151.202            - broadcast:                192.168.151.255            - label:                eth0            - netmask:                255.255.255.0

Copy the code

(3) Query the disk space

[root@centos tmp]# salt 'localhost' disk.usage localhost:    /:        1K-blocks: 28423128        available: 21572236        capacity: 25%        filesystem: /dev/mapper/vg_centos-lv_root        used:5406132

Copy the code

Saltstack, which implements cloud computing and data center architecture choreography, can be invoked by Zabbix monitoring events.

Salt-cloud of Saltstack supports cloud platforms such as Docker and openstack, and automatic business expansion of various cloud platforms can be realized with Saltstack’s real-time mine discovery function.

Saltstack can be combined with CMDB to realize operation and maintenance platform, automation and intelligence.

5. Automatic operation and maintenance scheme design

5.1 Automatic O&M Planning diagram

When IT comes to automated operation and maintenance, ITIL is the Information Technology Infrastructure Library, which mainly applies to IT Service management (ITSM).

ITIL provides an objective, rigorous and quantifiable standard and norm for enterprise IT service management practice.

ITIL has become an international standard for IT service Management, and the CMDB Configuration Management Database is the most important content to implement ITIL.

With the development of enterprises, they have higher and higher requirements for operation and maintenance. The existing open source tools can no longer meet the requirements of enterprises for operation and maintenance. It is urgent for enterprises to build a unified operation and maintenance management platform according to their business development and requirements for operation and maintenance.

The following is the overall plan of enterprise automatic operation and maintenance:

The construction of the automated operation and maintenance platform is based on ITIL standards, and each operation and maintenance subsystem of the service tool area should be built first according to the principle of first at the bottom and then at the top. Each operation and maintenance subsystem provides services to the upper layer through API.

Finally, different business platforms can call these service interfaces, and the construction of each level of the operation and maintenance platform should fully meet the requirements of the management system.

5.2 Automatic operation and maintenance platform module design

The automated operation and maintenance platform is developed on the basis of THE ITIL standard. In the first stage, the standardization of business processes has been achieved. At this stage, each subsystem is gradually improved from the event management subsystem, and various configurations are treated as services.

CMDB can also be understood as a unified metadata database, such as machine room information, server information, personnel information, service information, service information, and their physical and business topology relationships.

All upper systems should be associated with THE CMDB, with the CMDB as the center, and the changed data information must be fed back to the CMDB in real time, so that each operation and maintenance subsystem can see the latest data information, to ensure that other systems can synchronize the change to achieve the purpose of unified synchronization.

Therefore, treating THE CMDB system as the core system of operation and maintenance is conducive to the communication between subsequent systems. The following are the design requirements of some modules:

Incident management: Responsible for recording, categorizing and assigning experts to handle incidents and monitoring the whole process until the incidents are resolved and terminated.

The purpose of event management is to restore IT systems to the Service Level defined in SLA Service-level agreements with the least possible impact on customers and users’ business.

Problem and log management: A service management process that minimizes the negative impact of problems and accidents on the business by investigating and analyzing weaknesses in the IT infrastructure, identifying the causes of accidents, and devising solutions and measures to prevent accidents from happening again.

In the problem management part, log the problem handling process and provide the query function to track the problem to prevent similar problems from happening again.

Change management: The service management process of controlling changes to an infrastructure or service in the shortest possible time window.

The goal of change management is to ensure that standard methods and procedures are used to implement changes as quickly as possible so as to minimize the impact of business interruption caused by changes on the business.

Feasibility management: The management process of optimizing and designing the feasibility of the IT infrastructure by analyzing the feasibility needs of users and business systems to ensure that the growing feasibility needs are met at reasonable cost.

Feasibility management is a forward-looking management process, IT through the positioning of business and user feasibility requirements, so that the design of IT services based on real demand, so as to avoid the excessive feasibility level in the operation of IT services, saving the operation cost of IT services.

Emergencies: Analyze the running status of the business system and the logs of problems that have occurred, master the root causes of routine problems in the system, and standardize the processing process for emergencies.

Timely discovery, timely solution, strengthen monitoring and supervision, technology, spare parts and spare parts, emergency measures, plans, strategies and other combined methods to avoid and timely solve emergencies.

Automatic operation and maintenance platform is a business-oriented scheduling platform, which coordinates each subsystem and commands each sub-system at the bottom to serve it.

The construction of an automated o&M platform is a gradual process. Only through continuous testing and improvement according to the needs of the business and O&M can the status quo of o&M be fundamentally changed, the work efficiency of O&M be improved, and the automatic O&M be finally realized.

6. Summary of enterprise automatic operation and maintenance scheme

The operation and maintenance work of the enterprise has experienced a process from all manual operation at the beginning, to most manual operation and a little automation later, to the automatic operation and maintenance now.

Before the operation and maintenance platform is built, many operations need to be done when a new service is launched, such as DNS change, LVS change, OS initialization, automated testing, continuous deployment, continuous feedback, monitoring, service invocation relationship configuration, and so on.

At present, the launching of new services only requires simple configuration, and the remaining work is automatically completed by the platform coordination.

After using the automated operation and maintenance platform, user satisfaction increased from 33% to 95%, and the proportion of IT expenses in revenue decreased from 4% to 2.4% during the same period.

Through the construction of automatic operation and maintenance platform to realize the effective sorting of business processes, effectively understand the existing IT resources, running status, reliability and availability, so that enterprises master the detailed information of IT resources and assets from the overall situation, provides a strong support for enterprise decision-making;

The construction of an automated operation and maintenance platform has improved the operation and maintenance work efficiency. In the past, there were many faults and events that needed to be handled manually, but now most of them are handled automatically by the operation and maintenance platform according to predetermined rules, and the operation and maintenance response time has been greatly improved.

Through the construction of automated operation and maintenance platform to find potential problems, reduce the failure rate, operation and maintenance personnel are no longer the previous “fire fighting” team, some potential problems were found and dealt with in the early stage, to avoid business interruption caused by the failure;

Building an automatic o&M platform facilitates rapid fault recovery. You can save configurations at previous points in time to create baseline snapshots. Then, based on the comparison of configuration benchmarks before and after a fault occurs, you can quickly discover the clue and root cause of the fault and find a solution to recover the system in time.

Description: This article is reprinted from TalkWithTrend, by Nie Kuijia.

11.3, JUCC is coming!

How is the continuous delivery of the tool chain triggered ten thousand times a day achieved?

Watch meituan Tech Leader sharing on JUCC!