The authors introduce

Liang Mingtu, chief architect of New Actions Network, has more than 10 years of experience in database operation and maintenance, data analysis, database design and system planning and construction, and has in-depth research in data architecture management and data asset management

In my previous article “IT Operation and Maintenance Development Trend and Transformation and Upgrading of Operation and Maintenance Personnel”, I talked about several trends of enterprise IT operation and maintenance. Many friends engaged in IT system operation and maintenance communicated and discussed with me on this topic, and one of them was the IT operation tool platform that everyone paid close attention to.

This may be closely related to our positions and roles. After all, we are all engaged in the business of operation and maintenance, and we have a common demand for proper tools or platforms to help. However, the IT environment and IT development stage of different enterprises lead to their IT operation and maintenance platform at different stages of development.

This paper briefly summarizes the four common stages in the construction of IT operation and maintenance platform, and finds that the IT operation and maintenance platform is extremely similar to the production and development stage of our society, which roughly correspond to the “farming age”, “industrial age”, “information age” and “intelligent age” of IT operation and maintenance.

I. IT Operation and maintenance Farming era — manual operation and maintenance

Some IT informatization construction degree is not high enterprise, is taking their core business as the center to build IT informatization support and management system. For example, WHEN I just graduated, I maintained the 97 system of the telecom industry, and realized unified information management of the core business and information of the telecom industry, such as business acceptance, pipeline information management and business opening, for the first time.

At that time, the system architecture and infrastructure architecture were extremely simple. Dozens of servers and some basic software, such as middleware and database, were all the household goods of enterprise informatization. At the same time, factors such as IT operation and maintenance system and business dependence on IT system determine that operation and maintenance personnel have a relatively low demand for IT tools and platforms at this time.

In the stage of manual OPERATION and maintenance of IT systems, the enterprise IT scale is small and the technology stack is single, so the enterprise operation and maintenance level is often determined by the experience of the core operation and maintenance personnel in the team, and several technical masters are often the core of the operation and maintenance team. Therefore, under the human-centered operation and maintenance mechanism, operation and maintenance personnel often form their own maintenance script library, and store some frequently used solutions and operation modes in scripts, which become their own “sunflower treasure book” of operation and maintenance.

 

(The book we used in those years)

1. The dilemma of insufficient operation and maintenance resources

The gap between the size and complexity of an enterprise’s IT systems and the human resources of its operations teams tends to widen. In addition, new technologies such as cloud and open source software are gradually introduced into the enterprise IT environment. The introduction of these new technologies aggravates the dilemma of insufficient operation and maintenance resources.

2. Slow transfer of operation and maintenance knowledge system

For example, a large amount of operation and maintenance experience and knowledge are scattered in their respective “Sunflower treasure book”, which is not conducive to the effective dissemination of operation and maintenance knowledge in the team.

Even with the addition of human resources, new operation and maintenance personnel need a lot of time to get familiar with the environment, and need to accumulate experience in the actual environment and in the process of team members teaching, helping and leading. The process of getting started is often very long. At the same time, the departure of key personnel in the operation and maintenance team will also cause fluctuations in IT operation and maintenance quality to varying degrees.

3,The standardization of operation and maintenance is low

Different people doing the same operation according to their experience may bring different effects, and even cause large-scale failure. I once encountered a routine operation of “adding data space to the database” that caused the system to break down for half a day. The reason was that the operation and maintenance personnel took it for granted and did not execute according to the standard operation.

4. Enterprises rely too much on IT systems

Enterprises are increasingly dependent on IT systems. Once an IT system fails, IT will have a huge impact on enterprise business. The operation and maintenance mode relying on manpower alone is far from meeting such requirements.

Therefore, at the end of the operation and maintenance stage, many operation and maintenance teams spontaneously write some simple tools to facilitate their operation and maintenance work. At the same time, in order to solve these urgent problems, more enterprises began to introduce various professional operation and maintenance tools to solve various problems and gradually move towards automatic operation and maintenance.

Ii. IT Operation and Maintenance Industrial Era — automated operation and maintenance

In the era of manual operation and maintenance, in fact, many far-sighted technical masters feel the inadequacy of this operation and maintenance method and start to establish various operation and maintenance tools to solve the problems of low efficiency and non-standard operation.

For example, Deyu Zou, an expert in our company and the DBAPlus community, developed a set of minimalist OraZ tools using simple shell scripts for operations teams more than a decade ago.

The Oracle database operation wizard of the time, it integrates most Oracle database operations, performance analysis, problem analysis, and query scripts into one package. The database operations team loved this tool, and as we worked with new customers, the first thing the operations engineers did was upload the tool from their laptops to the customer’s environment and deploy it. For no other reason, without it, you need to write a lot of operation statements from memory, which will greatly reduce the efficiency of operation.

Therefore, various specialized automated operation and maintenance tools and platforms for different scenarios emerge as The Times require, for example:

  • Automatic monitoring: provides automatic monitoring and alarm services such as application performance monitoring, basic software service monitoring, host storage device monitoring, and network device monitoring.

  • Management: all kinds of software services that provide IT operation and maintenance support services and configuration management, such as VARIOUS ITSM systems and CMDB software systems.

  • Automatic o&M: Various tools and software that provide automatic O&M methods.

  • Other special tools: such as application performance management (APM) and database operation and maintenance (DPM).

Bring meaning

1.Represents the spirit of industrialization of human society

In essence, automated operation and maintenance represents the spirit of industrialization of human society. Most of the mechanical and repetitive manual labor of human beings are replaced by machines, which solves the contradiction between system operation and maintenance work and insufficient human resources. The original time-consuming and accident prone work such as monitoring, inspection, software installation and deployment is handled by machines through tools and platforms.

2,Realize the intensification of enterprise IT operation and maintenance

Automated OPERATION and maintenance (O&M) implements intensive ENTERPRISE IT operation and maintenance (O&M). IT can monitor and manage all software and hardware devices in an enterprise through an O&M UI. Intensification simplifies the operation and management of operation and maintenance personnel in a complex operation and maintenance environment, and further reduces the workload of operation and maintenance personnel.

3, implementation,Standardization of operation and maintenance operations

The standardization of operation and maintenance operations is realized. Meanwhile, the operation and maintenance operations are further constrained by improving script management and standard operation under expert review. The operation and maintenance operations in different scenarios are solidified by means of tools and platforms, so as to avoid the tragedy of system breakdown caused by adding data space to the database due to routine operations.

4,Realize the specialization of operation and maintenance

Various professional tool platforms provide professional-level o&M services in different scenarios, which to some extent supplement the professional skills of many o&M teams. For example, the professional DPM database operation and maintenance management platform can realize various expert-level capabilities such as data collection, problem analysis and processing of common relational databases.

5,Realize effective inheritance of operation and maintenance knowledge

Finally, automatic operation and maintenance can realize effective transmission of operation and maintenance knowledge. The energy of various operation and maintenance experts can be liberated from the heavy work, and they can accumulate effective operation and maintenance knowledge into the operation and maintenance platform in a scenario-oriented way, and constantly enrich and improve the operation and maintenance platform.

(O&M automation based on O&M scenarios)

Therefore, the construction of o&M automation platform is essentially a realization process of o&M team’s servitization capability based on o&M scenarios. It frees us from a large number of repetitive and irregular human operations and focuses on the improvement of o&M service quality. How to further improve the quality of operation and maintenance, we all started to study the combination of operation and maintenance work and data.

3. IT Operation and Maintenance In the information age — Data-oriented operation and maintenance

Automated o&M tools and platforms greatly improve o&M efficiency, freeing o&M teams from mechanical and repetitive work. At this point, the operation and maintenance personnel can re-examine the entire operation and maintenance process, and find that there are still some problems in our existing operation and maintenance process, and the automated operation and maintenance system still does not provide good answers to these problems.

Problems still exist in the automated operation and maintenance system

1. Make decisions based on the operator’s experience/head

For example, the operation and maintenance operation and problem analysis process are still based on the operator’s experience to make judgment, which to a large extent lacks the support of data and quantification. They still rely on the operator’s experience or even intuition to analyze and process, and the operation and maintenance decisions rely on all kinds of experience judgment and head-scratching decisions.

2. The deep relationship between operation and maintenance operations and events is unknown

The deep correlation between many o&M operations and events cannot be effectively evaluated and analyzed. For example, we are about to launch a new version of our app that will involve multiple changes. In this case, to what extent will these changes affect the operation of other applications and systems? Relevant personnel need to be informed before going online.

Data trend of O&M tools

For example, the failure of insufficient space capacity in the data space of a database is not caused overnight in most cases, and its occurrence is a relatively long accumulated process.

By tracking and analyzing space capacity data based on time axis, we can make more accurate forward-looking prediction in a long time period, so that we can deal with and deal with it in advance and avoid the occurrence of faults and problems.

As early as more than 10 years ago, data warehouse, BI and other fields in the IT industry use business data for specialized data analysis and mining. Nowadays, big data has become the main driving force of enterprise business, which provides important enlightenment for IT operation and maintenance. How to use various data generated in the operation and maintenance process for effective analysis and application, and further improve the quality of operation and maintenance. Therefore, the data of operation and maintenance tools becomes an inevitable trend after automation.

Characteristics of operation and maintenance data

So, ops will all operational events is one of the characteristics of the digital and digital operation, will come from a variety of monitoring systems, automated operations, CMDB, log files, and various professional operational tools such as all kinds of data collection, cleaning, integration and structured, and focuses all of the data integration platform to operational data, Make more connections and collisions between operations and maintenance data that were previously isolated in the enterprise IT environment.

Secondly, a more open and transparent operation and maintenance data system should be built, so that more operation and maintenance personnel can participate in operation and maintenance data analysis, so that operation and maintenance personnel can develop their own strengths and analyze and apply data from different perspectives. Just like business data, the value of data is reflected in the process of application. The more applications, the more value of data.

In addition, a set of simple and effective visual operation and maintenance data analysis means should be established so that operation and maintenance personnel can intuitively understand the potential relationship and trend in operation and maintenance data through visual data analysis charts and statements.

Finally, the visualization operation and maintenance screen based on operation and maintenance data analysis makes the operation and maintenance work more obvious. On the one hand, visualization will make the operation and maintenance work more open and transparent, improve other departments’ perception of operation and maintenance work, and improve the experience of operation and maintenance work. On the other hand, the level of operation and maintenance visualization to a certain extent reflects our understanding of operation and maintenance work. The higher the visualization, the easier and more efficient the operation and maintenance.

By collecting operation and maintenance data in the current environment, integrating existing operation and maintenance platforms and tools, and using big data and data analysis technologies, IT can quickly locate, troubleshoot and predict problems in each link of the IT system. Overall analysis of the data from each distribution system in the business link, reasonable optimization of IT services.

Iv. Intelligent ERA of IT Operation and Maintenance — Intelligent operation and maintenance

In the last year or two, AI has become a hot research topic in the industry. Operations people are also beginning to study how to combine AI with operations to allow machines to truly manage themselves.

There are currently two different interpretations of intelligent Operations (AIOps), Actually I think no matter AIOps refers to IT Operations (Algorithmic) based on the algorithm of the IT Operations are based on artificial intelligence (ArtificialIntelligencefor IT Operations) IT Operations, the essence of both nature and no gap, Both are based on massive operation and maintenance data. Through big data, modern machine learning and more advanced data analysis technologies, they provide initiative, humanization and dynamic management capabilities, so that operation and maintenance work can get rid of dependence on human experience and knowledge to a certain extent.

Common application scenarios include:

    Abnormal alarm: According to the historical monitoring indicator data, the relevant algorithms based on time series are used to analyze the abnormal monitoring indicators, and accurate alarms are issued for abnormal monitoring indicators.

  • Alarm convergence: according to the historical events and alarm data, found that the relationship between these events and alarms, integration of frequent events and alarms, and recognize it as the same kind of fault alarm, which combine multiple alarms and index, pushed to operations staff, do fine alarm, avoid the traditional monitoring tools of alarm storm, as a result of a failure alarm noise production.

  • Fault analysis: Based on o&M data, events and alarms, combined with the experience knowledge base and model of previous problem discovery, fault tree analysis is established. Combined with decision tree and other related algorithms, the path derivation enables O&M personnel to locate problems more quickly and intuitively, making it easier to solve problems.

  • Trend prediction: Perform historical data fitting and other algorithms to predict resource trend/capacity. For example, host CPU, insufficient switching pages, insufficient memory, and insufficient storage will gradually lead to system failure or application failure. The system establishes an association model to remind users of possible subsequent system failure or application failure. Inform O&M personnel to rectify the fault before it affects services.

  • Fault portrait: By collecting multi-dimensional operation and maintenance data, a multi-structured underlying operation and maintenance data model is constructed to coordinate with various operation and maintenance scenarios, and faults are depicted in the scenarios. Various standard forms of fault portraits are used to assist enterprises in IT operation and maintenance decisions and processing processes.

On the whole, I thinkAIOpsIt is a further extension of automated operation and maintenance and operation based on data analysis. Using a large amount of operation and maintenance data accumulated in the automatic operation and maintenance stage, various operation and maintenance scenarios and applications based on operation and maintenance data analysis and the foundation laid during automatic operation and maintenance, combined with various artificial intelligence technologies, provide more convenient operation and analysis.

In recent years, O&M has gradually entered the intelligent era. Its current application scenarios focus on some basic aspects such as abnormal alarm, alarm convergence, fault analysis, trend prediction and fault portrait, but it reflects the main development trend of o&M tool platform in the future. IT is reasonable to believe that with the different development of AI technology and the continuous efforts of operation and maintenance personnel, AIOps will gradually improve and realize more application scenarios, and the realization of unattended IT operation and maintenance system in the future may be more than a fantasy.

V. Planning of enterprise IT operation and maintenance platform

In a recent discussion, a friend asked a question: “Our enterprise has encountered many problems in IT operation and maintenance. Can we skip the automation stage and directly implement AIOps intelligent operation and maintenance?”

To this question, MY opinion is negative. The reasons mainly come from two aspects:

On the one hand, as the example mentioned in the article development Trend of IT Operation and Maintenance and transformation and Upgrading of operation and Maintenance personnel: the economic base determines the superstructure.

I personally believe that the IT operation platform or tool must always match the enterprise IT technology architecture and operation system stage. The IT operation and maintenance platform lags behind the overall technical architecture and operation and maintenance system, which will cause many problems, such as the shortage of human resources in operation and maintenance, and the repeated occurrence of various problems that cannot be cured. If the IT operation and maintenance platform is too advanced, there will be unaffordable, unnecessary investment waste, and even more adverse effects.

If IT is not high level of IT information, and the scale of the IT environment is not large, its IT operation and maintenance platform can consider starting from automatic monitoring, and gradually improve the centralized operating system. Later, with the gradual improvement of IT informatization, other functions and modules of the IT operation and maintenance platform will be gradually introduced to build the IT operation and maintenance platform step by step and in a planned way, and limited budget and resources will be invested in key areas.

On the other hand, like the process of economic construction, the construction of IT operation and maintenance platform is a multi-stage, gradual and continuous construction process with planning, rather than an overnight move.

This is because:

First of all, the construction and landing of the operation and maintenance platform requires a process, and the operation and maintenance platform also involves all aspects of operation and maintenance.

For example, the automated operation and maintenance platform itself involves the function construction of monitoring, automated operation and maintenance, configuration management database (CMDB), log collection and other professional tools. “Rome is not built in a day”, so is the operation and maintenance platform suitable for enterprise characteristics.

Moreover, the construction of enterprise IT operation and maintenance platform had a huge driving effect on the IT system at that time, and the IT system needed to make necessary adjustments according to the construction of the operation and maintenance platform.

Automation of operations, for example, be born to a large number of daily work done by machines, artificial completed operations staff from heavy work liberate, will inevitably bring change of the current operational system, needs to be abundant human resources into a more important position and role, such as rich scene, in automated operations will be more complex operations standardization operation, etc., Further improve the quality and efficiency of enterprise IT operation and maintenance. This kind of adjustment is not a matter of overnight, it needs to be gradually transformed, adjusted and digested.

Finally, the four stages of operation and maintenance platform construction have a strong correlation and sequence. The latter stage usually needs to be based on the accumulation and experience of the previous stage.

For example, in the manual operation and maintenance stage, the experience and scripts in the minds of operation and maintenance personnel or in their own treasure books will be standardized, centralized and automated in the automatic operation and maintenance stage in the way of operation and maintenance scenes. The large amount of operation and maintenance data accumulated in automatic operation and maintenance provides the necessary foundation for the information analysis and intelligent operation and maintenance of operation and maintenance data. Intelligent operation and maintenance is a combination of operation and maintenance data analysis data, algorithms and scenarios.

Therefore, the construction of each stage of the operation and maintenance platform often has a significant impact on the subsequent stage.

 

(Overall platform vision and phased construction plan based on the current situation and pain points of the enterprise)

Therefore, for the construction of enterprise operation and maintenance platform in full swing, my views are as follows:

  • The construction of enterprise operation and maintenance platform is very important. IT directly promotes the huge improvement of enterprise IT operation and maintenance efficiency, reduces enterprise operation and maintenance resource investment, and is also the best tool for enterprise operation and maintenance quality improvement.

  • The construction of operation and maintenance platform will obviously drive the entire operation and maintenance system, which includes various factors such as management system, process and personnel, with which appropriate scheduling needs to be made.

  • IT is necessary to make a reasonable blueprint and construction plan for the future operation and maintenance platform according to the existing IT environment and the development of the enterprise in the future. Planning is very important, and only planning can carry out operation and maintenance platform construction in a planned and purposeful way.

  • Enterprise operation and maintenance platforms need to be forward-looking, but the implementation requires time and resources, so they should not be overly ambitious.

  • Build various operation and maintenance scenarios based on the characteristics and needs of the enterprise and build its own operation and maintenance platform. It is not mechanically applied, but the one suitable for itself is the best.

  • Operation and maintenance data visualization is an important way to reflect operation and maintenance value. It makes operation and maintenance data more open and transparent, and enables enterprise management to bring more clear operation and maintenance experience.

  • Enterprise IT operation and maintenance personnel are the main force in the construction of the operation and maintenance platform. Their ideas, experience and knowledge accumulated in enterprise operation and maintenance for a long time will be summarized and passed on to the operation and maintenance platform.