Dear friends, I feel that I have posted a little too many technical articles recently. I wonder if it has brought some inspiration to you to solve practical problems in your work. Why do you say that? Because it is the little knowledge points involved in the article that add up to a lot, which liberated me from the fragmented and busy operation and maintenance work to a certain extent. Believe the small partner that has read seriously, the technical ability that can solve pain point of what tall and high is not only in meeting to feel the job, on the contrary, it is the detail that those we ignore at ordinary times is the key of the problem. Then only by hitting the nail on the head, can we do the right thing.

So in the next few days, I will probably share my thoughts on some of the problems in the process of operation and maintenance, hoping to give you some inspiration.

This is to share the thinking of o&M management and o&M automation.

I. What is the work of operation and maintenance?

1. Infrastructure, including network, server, operating system, etc.; 2. Environmental management, including development environment, test environment, production environment, etc.; 3. Deployment: Deploy the application or system to different environments. 4. Monitoring, to monitor infrastructure, applications or systems; 5. Alarm response: responds to and handles alarm notifications. 6. Performance optimization, optimizing the performance of the system and related components; 7. System high availability (HA) : upgrade the single point in the application system; 8.SLA guarantee to ensure the availability of the service system and realize automatic capacity expansion and reduction according to SLA;

The above work is extracted according to the operation and maintenance management framework, including but not limited to the above aspects.

Ii. Operation and maintenance status

From the perspective of “80-20 law”, 80% of the above operation and maintenance work can be processed by cumbersome manual processing, and 20% need to be processed according to different factors.

80% of the work can be handled by automation, and the remaining 20% can be collected, analyzed and further judged by multi-dimensional monitoring of monitoring.

Iii. Operation and maintenance management

From the current situation of operation and maintenance, our priority is to solve the problem of automation, and the premise of automation is standardization/standardization, and good automation needs to be combined with visualization or web, can optimize 80% or more of our work.

Therefore, the main objectives of our current operation and maintenance management are standardization/standardization, automation, visualization /web.

Standardization can be formulated according to the actual situation of operation and maintenance; Visualization/Web can be achieved through open source tools or Web development.

Iv. Operation and maintenance automation

Several main aspects of o&M automation can be realized:

1. Server mounting automation

After a new server or VM is created and delivered to a different environment, a series of customization is required, such as CPU, memory, disk, IP address, kernel parameter optimization, time synchronization, SSH hardening, firewall, and client installation. Of course, this is not enough, if the operation and maintenance platform integrated CMDB, jump-off machine, Zabbix, etc., server shelf also need to register with CMDB and jump-off machine, Zabbix and other management tools; If there are other tools that need to be integrated.

In summary, the ultimate goal of server shelf automation is environmental optimization, secure availability, and registration of all administrative tools.

2. Automation of environment definitions

Environment customization can be divided into two situations: (1) for small and medium-sized companies, the test environment includes all systems, that is, the systems are not isolated, and the database contains libraries corresponding to various systems; (2) In large companies, each system needs a separate set of isolated test environment, and each system cannot access each other;

Automation of the environment definition is more applicable in the second case, where resources need to be created quickly for the required department.

Overall, the main principle of environment definition automation is to have some degree of isolation in either case to reduce the problems caused by environmental misconnections. Identifying environmental problems is a disgusting problem for operation and maintenance.

3. Deployment automation

The process of deployment automation is constantly evolving, roughly divided into script > batch SSH > Automation Tools > Containers, and deployment automation has been shifting from batch operations > availability > Ease of use > efficiency in each process. Deployment automation now addresses not only the deployment itself, but also how to mask the underlying differences more quickly and easily.

Note: This is reminiscent of the DevOps mind map about the speed of improvement in automation, where automation is initially completed and speed needs to be optimized.

After the deployment automation is complete, you need to interwork with the monitoring system, that is, the availability monitoring and performance monitoring of the system are automatically added to the monitoring system.

4. Monitoring automation

From System Monitoring System, we know that monitoring objects are divided into multiple dimensions, and different tools may be used in each dimension. That is, different tools may be required for monitoring automation. For example: (1) Automatically add availability monitoring, such as port and URL monitoring. (2) Automatically add log status monitoring, such as STATUS and ERROR

Of course, monitoring automation is not only for monitoring, but also for fault recovery automation, that is, fault self-healing.

5. Release automation

In the case of a small server scale, node removal and alarm masking must be considered in the version release, and nGINx and monitoring must be interworked. For example: (1) Nginx to achieve smooth extraction node (2) call API to achieve the monitoring item disable and start

Five, several stages of operation and maintenance automation

He who stands tall sees far. No matter what aspect of automation we are doing, it is more beneficial for us to understand the various stages of o&M automation at a higher level:

1. Automatic operation

This level is characterized by a series of manually performed operations, linked by scripts or tools, to a certain extent to solve the problem of operation and maintenance manual execution. However, different scenarios require constant tweaking of scripts or tools, which increases the probability of errors.

2. Scene automation

This level is characterized by the fact that tools make assumptions about how to operate based on external conditions that are defined by operations and maintenance. The operation and maintenance system at this level needs all kinds of environmental data as the judgment condition, and can change the operation behavior at the same time. In addition, the operation and maintenance system at this level needs to be connected with many third-party systems (CMDB, network management system).

Intelligent 3.

The operation and maintenance system at this level has the data core (big data storage, and all data in operation will be centrally stored according to the association relationship), and has the ability to analyze and judge according to the data, make decisions and execute by itself. At this level, the main work of o&M is to add analysis strategies to the system, operate and maintain the intelligent o&M system, and intervene to make human judgments at the key nodes of system execution.

Vi. How to automate operation and maintenance

Before we think about how to automate operations, we need to realize that “the architecture of the enterprise is not designed, it is evolved.” So we can use that as a guide.

1. Deal with pain points first

In daily work, common problems should be classified and sorted out, and those that can be made into tools should be instrumentalized and those that can be programmed to avoid human intervention. Whether it’s based on the CMDB or not is less important, especially if the business system isn’t that big and the server changes less frequently.

2. Choose the right stage

Operation and maintenance automation generally follows the following stages: manual support => online standard standardization => operation and maintenance tool => platform self-help/automation. Choose the operation and maintenance automation mode suitable for their current business development stage, do not eat a fat man.

In addition, for large and medium-sized o&M automation platforms, CMDB and configuration systems are still indispensable. The CMDB is a configuration management database used to centrally manage IT data and server data assets. The accuracy and authority of CMDB data is related to whether the operation and maintenance automation is on the right road.

Seven,

1. Operation and maintenance automation

In the above automation process, different third-party systems need to be interconnected at different automation stages, so it is important to have a unified ESB (enterprise system bus) to interconnect the systems. However, the absence of an ESB is not bad. Different stages address different pain points, and only operations and maintenance automation suitable for the stage of business development is the best.

2. Operation and maintenance management

At the beginning of this article, it is said that the main objectives of operation and maintenance management are standardization/standardization, automation and visualization /web. From my personal experience, the objectives of operation and maintenance management also change with the different stages of operation and maintenance automation.

For example, now the company has initially achieved scene automation and intelligence, although it is not in-depth, to a certain extent, my operation and maintenance work has been liberated about 80%, which has freed most of my time. I am also thinking whether the operation and maintenance management should step into the next stage: operation and maintenance service?

Reason:

  • The value of o&M automation lies in that it can free o&M from tedious, routine, and accidon-prone work, and make o&M more valuable.

    So, from this perspective, o&M automation is neither the beginning nor the end. O&m automation is not a panacea, we need to see where it is.

  • The essence of o&M is to serve the business, because o&M uses technology to solve business problems, and the value of o&M can only be reflected by relying on the business. Operations are not awesome because they are technically sophisticated, or because they manage tens of thousands of servers, or because they can play with a lot of open source tools. For operation and maintenance, service first, technology second. The value of operations and maintenance technology is limited if it does not serve the business and help the business succeed.