Jingwei Hu, senior operation and maintenance engineer, is responsible for the operation and maintenance of the people’s Payment system of CMBC.

One, foreword

Minsheng bank IT practical operational experience for many years, has been the construction of CMDB, IT operations management system platform (process), the centralized monitoring system, transaction performance monitoring systems, automated operations system, management tools, such as log management platform and deepening in the actual work of optimization, in recent years has built the operational data platform, Used to support IT operation and maintenance management.

In daily work, monitoring (all kinds of monitoring), management (process), control (automation) and CMDB system have established mapping relationships to get through the data consumption scenarios of each system.

However, the actual work is still faced with scattered tools, relying on the experience of operation and maintenance personnel and frequently switching professional analysis tools to achieve fault location, impact analysis and other operations, there is room for improvement in the efficiency of operation and maintenance data consumption.

2. Construction ideas and achievements

Based on the above background, CMBC tries to integrate configuration data (CMDB), monitoring data (centralized monitoring of alarms, transaction performance monitoring), automatic operation and maintenance management tools, and change data of IT operation and maintenance management system into IT operation and maintenance architecture diagram with the help of architecture management visualization tools. Create a unified operation and maintenance data consumption scenario-IT operation and maintenance architecture management visual platform, known as cloud map system in the industry.

At the beginning of system construction, we defined four types of operation and maintenance data consumption scenarios, as shown in the figure below:

Below, we first illustrate the four typical scenes:

1 Routine Monitoring

As operation and maintenance personnel, everyone needs to be well informed about the operation of their own system. The performance indicators of the system can be actively monitored in real time by monitoring methods such as database, middleware, operating system and network traffic analysis. The transaction performance of the system needs to be diagnosed and output in real time through the transaction performance monitoring system.

Front-line personnel need to open monitoring Windows of different tools to monitor alarms and exception indicators of the system in real time. These Windows occupy a lot of terminal resources.

After receiving an abnormal alarm, second-line o&M personnel also need to open each monitoring platform for fault diagnosis and fault location. They often waste some time and energy during login and login, which cannot effectively meet the “Double Ten” target of “10 minutes for fault location and 10 minutes for recovery”.

Through the cloud system on the above data to realize the efficient integration of each professional monitoring tools can now to apply as the center, in the unity of the various running status data present on a page, real-time synchronous display alarm data and performance data, and combined with a specific scene visual, intuitive and efficient, be clear at a glance.

For example, Figure 1 shows the transaction volume, response time, response rate and success rate of 14 rival banks sent by the e-Banking interconnection system of OUR Bank to ICBC, Agricultural Bank of China, Bank of China, CCB, Bank of Communications and CMB. When abnormal transaction alarms occur, the alarms will be mounted on the icon of the application system in real time.

Figure 1: Monitoring of transactions between e-bank and rival institutions

2. Locating obstacles

For example, a large number of systems have high-level alarms at the same time. These systems rely on various networks for support and dependence, and each system is supported by a complex system architecture.

In this case, how to locate faults and quickly recover services within a limited time is a low-frequency but high-risk problem facing O&M personnel.

Compared with traditional troubleshooting methods, O&M personnel need to comprehensively analyze these alarms to determine possible causes.

The general idea is that the person in charge of each application system respectively seeks the database, operating system, middleware, network and other teams to confirm whether the fault is caused by the system.

If not, you need to use the upstream and downstream system diagram drawn in advance to comb out possible root cause nodes, and then check whether faults exist in the architecture of the corresponding suspected root cause system for further handling.

Due to cross-departmental communication and strong visual logical thinking ability, the relevant work requires high requirements for operation and maintenance personnel.

Through the cloud map system, we can first view the overall application wall (as shown in Figure 2) to analyze the distribution of alarms of each system, then preliminarily judge the key nodes of transactions according to experience, and click drill to enter the panorama of application relationship.

Figure 2: Application wall display

You can view time-series based alarms, performance indicator curves, and recent change records to further narrow the fault area that needs to be further determined. Then drill the system architecture diagram and network topology diagram based on the suspected root cause node, and analyze the alarm, change, and performance data of objects in the architecture diagram to further locate the fault source (as shown in Figure 3).

Figure 3: Application interaction

Finally, the automatic operations are integrated into the corresponding architecture diagrams, including one-click inspection and other operations, to shorten the precious time consumed by brain thinking and logging in each system one by one. After the processing is completed, the real-time monitoring data in the corresponding architecture diagrams are compared again to confirm the effect of fault handling.

After removing obstacles, the application portrait function (as shown in Figure 4 below) can be used to review the causes and solutions of faults, formulate plans, and provide preventive measures and emergency treatment guidance for possible secondary or secondary risks of faults.

Figure 4: Application portrait display

3 Change impact analysis

In daily change management, analysis of change impact and change process review is the focus of change management.

In terms of change impact analysis, if the relational data in THE CMDB data is not perfect, the confirmation of the scope of impact becomes extremely difficult, requiring more empirical judgment, multi-party communication and a lot of thinking.

Relying on cloud map system, the work of change impact analysis has been systematically improved. For example, when you need to maintain a storage system, you only need to search any configuration item of the storage device to know which systems are associated with the storage and link to the corresponding system architecture diagram to further understand the impact range (as shown in Figure 5 below).

Figure 5: Relationship between storage and application

4 Knowledge sharing

Knowledge sharing can enhance the ability of collaboration and sharing among people and give play to the initiative and creativity of team members. For example, an architecture diagram based on configuration data, combined with related monitoring information and change records, can be assembled by second-line professionals and shared with the FIRST-line duty manager of the ECC.

On the one hand, the on-duty manager can be familiar with the various systems to be managed through easier understanding of the architecture diagram. In addition, it is easier to reduce the root cause range of the fault domain during fault location, so as to transmit information to the second professional line and improve the overall efficiency of troubleshooting.

In addition, the presentation and report in daily operation and maintenance is one of the scenarios of knowledge sharing. As the manifestation of consensus in the FIELD of IT management, the architecture diagram itself has the basic ability of presentation and report.

Whether IT is in the training of new employees or the daily communication with the operation and maintenance backup post, or in the introduction of IT operation and maintenance daily work to business units, or describing some important system construction achievements.

The demonstration mode of the system can effectively improve the efficiency of communication, so that the whole organization can form a mechanism of knowledge accumulation, unified cognition, rapid sharing and real-time update.

Figure 6: Demo report large-screen mode

Iii. Future prospects

1 visualization AIOps

In recent years, the concept of AIOps has gradually gained popularity, and Gartner has supplemented the core nodes of AIOps on the basis of the regulatory control operation and maintenance architecture. As AIOps, it can summarize various data sources into a large database. On this basis, it can carry out calculation, analysis, incorporate algorithm, increase machine learning ability, and finally provide data consumption with visualization.

The operation and maintenance big data platform of CMBC has been completed. Currently, it has also carried out cooperation with the Intelligent Operation and Maintenance Laboratory of Tsinghua University to apply its machine learning and algorithm research results to the production environment for accumulation and learning.

In the next step, the cloud map system will connect with the anomaly monitoring and analysis data of the intelligent operation and maintenance system, and realize the display ability of visual fault location between AiOps and IT operation and maintenance architecture.

For example, in addition to filtering, compression, association, and enrichment, the event information presented in the architecture diagram will also supplement the system exceptions mined by the single-valued anomaly detection system in the performance data.

Such as business trading system response time was defined in 100 ms to generate the alarm events, and after the anomaly detection system online, machine learning, based on the data characteristics, during the slack season, even if its response time is only 50 ms, can also detect system, thus further added event reminder, in combination with cloud system, realize the visualization of fault early warning, Further improve the quality of operation and maintenance.

Figure 7: Gartner regulatory operations Architecture

2. Automatic scene visualization

Next, the system will realize the visualization capability of application publishing and AUTOMATIC DISASTER recovery switchover:

Application publishing and Dr Switchover require management of complex resource relationships and high dependency between application systems. The flow management of an automatic o&M system can clearly define these relationships, effectively ensuring the service quality of the Dr System and improving the ability to respond to emergencies.

At the same time, colleagues and leaders of all departments can clearly understand the process execution through the big screen, making ECC become a unified “battle command center”.

3. In-depth scenario-based construction

Based on the integration of architecture diagram and all kinds of data, architecture management visualization tool has become the most close to the operation personnel’s comprehensive situation analysis tool.

Based on this, the system can be further deepened to carry out function deepening and data encapsulation in different working scenarios of operation and maintenance personnel.

Fault in many cases, for example, is the result of change, review the change will be required before system change, scenarios before the review, the ability to change before and after the need to focus on system architecture, the application of trade performance index, system and network level load index, and the application of the new generation number, log are encapsulated in a page.

When change the day in the morning, apply operations staff can automatically receive email notification, the above information to carry on the summary, click open the sealed after the data and graphic scene page, and after the change of state be clear at a glance, once appear, you can also view the problem representation, and quickly locate upstream and downstream.

Four,

“The mind can never think without images”, a famous saying of Aristotle, is mapped to IT operation and maintenance management. The architecture diagram is a visual presentation of the mind image.

On the one hand, the standardization of IT architecture diagram ensures the sustainable optimization of operation and maintenance management at the level of IT governance;

On the other hand, with the deepening of visual management of architecture, the habit of using IT architecture diagram to run through the thinking flow of operation and maintenance work is gradually forming.

In the future, configuration data, monitoring data, log data, automation tools, and process tools will be organically integrated based on the architecture diagram, which will stimulate the new demand of operation and maintenance personnel for tools required by operation and maintenance, thus forming a more efficient data consumption scenario.

With the further use and continuous optimization of the tool, the corresponding requirements are still emerging, and we will share with you as we progress.

If you feel good, please don’t forget to forward, share, like so that more people to learn, your lift a finger, is the best support for xiaobian, thank you very much!