An overview of the

The online system is down, the service is abnormal, and the response times out; The system running result is not as expected… Users are affected, party A’s father is not happy, the consequences are very serious. In a sense, “solve the problem before the user encounters it, and the problem is not a problem”.

Article summary: Online environmental Pain Points “solutions” imagination space.

1, the pain points

The following pain points are the pain points encountered in the practice of our company. The pain points may not be universal, but the ideas can be used for reference.

1.1. During service release, we often encounter the following problems

  • The service has just been released. Check whether the service is deployed successfully.
  • The service has just been released, whether the version is correct (whether the instance is running at the version I want to deploy);
  • The service has just been released, whether the file is missing;

If the number of service instances is small, it is also feasible to check the server one by one. However, the system with a few users at present is multi-instance deployment, and a computer room often has dozens or hundreds of instances, which takes time and effort to check the server one by one.

In fact, if CI & CD is strong, none of the above problems will be a problem.

1.2 After the service is released, the following problems will also occur

  • Availability of services;
  • A single service is available, but whether the system link is unblocked;
  • Whether different instance codes in the same machine room are consistent (caused by human negligence, o&M system and other problems);

In addition, our company is involved in localization deployment (packaging the company’s system to deploy to other companies), which requires the deployment of a new version over a period of time, but we don’t know when the system was last deployed. Directly deploy the latest version? The system is too large and associated with too many systems. The deployment and verification of the latest version of all systems takes too much time and costs too much.

1.3 management demands

  • Release fault: An alarm is generated immediately after the release is complete.
  • System operation failure: operation failure immediately alarm, strive to find abnormalities earlier than users;
  • Monitoring automation: you can’t look at the console case by case;
  • Convenient and efficient receiving of alarms: the relevant responsibility can easily receive messages (not considering emails);
  • The monitoring system is stable: Is there no fault if no alarm is received? Is the monitoring system itself faulty?

2. Solutions

At present, there are many monitoring systems on the market, which are stable and efficient and have been tested by many large factories, such as Zabbix, Prometheus, etc. But I ended up building my own monitoring system instead of one of those. Why build it? Is it better than them? Of course not. The more excellent is not necessarily the best, the more suitable is the best. The reasons for building monitoring systems are as follows: ① These systems are too heavy, and I only monitor several services of a business module; ② These systems belong to the company level monitoring, need to implement the company level, and I need to be able to online monitoring system in a few days; ③ These systems are for the public, and the monitoring scheme is also universal, but I need to customize some monitoring indicators. General things are not easy to do fine, the more customized, the greater the probability of doing well.

2.1. How to monitor

2.1.1. Monitor the normal operation of services

Is it possible to monitor ports? It is definitely not feasible. Normal port communication does not mean normal service, so we finally choose to add a dedicated monitoring interface in the service.

2.1.2. Monitor alarm notification

The mail? Direct pass, who will check email all day long, the company uses Tacks, so finally chose Tacks group robots;

2.1.3 Alarm content

The less, the better, only alarm the real core content, otherwise the notice is flooded, but no one pays attention to, the company so many system monitoring emails you read how much? If you do have a lot of content to notify (not alarms), use separate group notifications.

2.2 Monitoring implementation – code transformation

It is not recommended to use a service interface for monitoring. The monitoring interface should support fast response and low resource consumption. Of course, except for service-level monitoring, resource consumption should be reasonably assessed to ensure that normal services are not affected.

2.2.1 Service Operation monitoring interface

New independent monitoring interface in the code, request interface return data means that the service is normal;

@apiOperation (value = "test") @requestMapping (value = "/test", method = requestmethod. GET) public int test() {return 1;  }Copy the code

The above interface is simple and crude. Returning 1 indicates normal, which is also the monitoring implementation of the first version. The monitoring interfaces of the latest version are as follows. For the reason, see Consistent Monitoring of Different Instance Versions in the same equipment room.

@apiOperation (value = "Query the current service version, RequestMapping(value = "/getVersion", method = RequestMethod.GET) @ResponseBody public String getVersion() { String version = new StringBuilder().append(Version.developVersion).append("_").append(Version.modifyVersion).toString(); return version; }Copy the code

2.2.2 System link monitoring

For fixed services, the overall system link is fixed. System A calls system B, and system B calls system C. Therefore, you can directly add call link interfaces in systems A, B, and C to monitor only the upper-layer service A. If the call link is abnormal, an alarm is generated.

@apiOperation (value = "test service call chain, return log.AppName, service IP, ") @requestMapping (value = "/testCallChain", method = RequestMethod.POST) @ResponseBody public String testCallChain(@RequestBody CloudRequestVo<String> cloudRequestVo) throws Exception { String localVersion = new StringBuilder().append(cloudRequestVo.getCloudTraceID()).append(" ==> ").append(LogUtil.getCloudTraceID()).toString(); String nextService = bServiceFeignClient.testCallChain(cloudRequestVo); return new StringBuilder(nextService).append(" <== [").append(LogUtil.getCloudTraceID()).append("]").append("[ServiceStartTime]") .append(serviceStartTime).append("[Version]").append(getVersion()).toString(); }Copy the code

If you look closely at the code, you’ll see that the service response content contains the response information for each service in the invocation chain, The information includes CloudTraceID (current instance IP address, current instance name, and current request time), ServiceStartTime (system startup time), and Version (system Version information).

2.2.3 Consistency monitoring of different instance versions in the same machine room

Careful students may notice that the service response data in system Link Monitoring contains system Version information, which is used to monitor whether the versions of different instances in the same machine room are consistent. Monitoring principle:

  • (1) After the developer completes the development, as long as it needs to release the production, it must modify the version information in the code, and the version number is increased by 1;
  • (2) After the monitoring script requests the same service in the same equipment room, if the version information returned is inconsistent, this alarm is generated.

Perhaps some students will say, this means to modify the code, forget to modify how to do? Our company uses Jenkins compilation, so I modified the relevant code. When Jenkins is compiled, it verifies whether the version number is increasing or not. If it is not increasing, compilation is not allowed directly.

* developVersion: the development version number must be the same as the version number on the SVN address (format cannot be changed, only the number can be changed, otherwise Jenkins may not allow compilation). * modifyVersion: Change the version number, each time the code needs to be compiled and published, the version number must be increased by 1 (do not change the format, can only change the number, otherwise Jenkins may not allow compilation). */ public class Version { // start public static int developVersion = 8102; public static int modifyVersion = 9; // end // Note: do not add anything between start and end, only the version number can be changed. }Copy the code

Note:

  • To avoid conflicts between the monitoring interface address and service interface address, you are advised to use the same prefix for all monitoring addresses, for example:
  • Service running monitoring interface ADDRESS: IP:PORT/test/getVersion;
  • Monitoring system link address: IP: PORT/test/testCallChain;
  • Interface address for monitoring the consistency of different instance versions in the same equipment room: Directly use the getVersion interface to return data.

PS: version consistency monitoring. At first, it was considered to use SSH command to connect the remote server and execute shell script for verification, but this scheme was too tedious and passed.

2.3 Monitoring implementation – Monitoring and alarm

2.3.1. Start the nail group robot

Nailing group robots are used to receive alarm information. How to start the pin swarm robot is not described here. After starting the robot and getting the Webhook address, you can use HTTP request to send a group message to Dingding. For details, refer to the Nailing Open Platform documentation: ding-doc.dingtalk.com/doc#/server… .

2.3.2 Monitoring implementation

Preparations: IP addresses and ports of instances to be monitored are grouped by instance and equipment room. IP address and port of the top-level instance of the service to be monitored. Monitoring logic: request the monitoring interface regularly. If there is any abnormality, call the interface of the nail robot to alarm. Monitoring frequency: The frequency should not be too high to affect services. (PS monitoring interfaces are lightweight interfaces.) It is recommended to set the frequency of call chain monitoring to be longer. After all, the service connectivity is normal and there is less possibility of abnormal call chain. Normal monitoring notification: Is there no fault if no alarm is received? Is the monitoring system faulty? Therefore, you need to periodically send a notification “XXX monitoring is normal” to the monitoring group every day.

2.3.3 Monitoring platform

As long as you have access to the production service and can make Http requests. I use XXL-job as the monitoring platform and directly use GLUE mode to improve the monitoring code and check the running logs of the monitoring system at any time. You can view the official xxL-job tutorial at www.xuxueli.com/xxl-job/#/.

Colleagues in the company can directly contact me to obtain the monitoring script, which can be used to replace the IP port of the instance to be monitored (the existing interface can be used temporarily before adding the dedicated test interface).

3. Imagination space

3.1 Diversified alarm channels

Dingding, wechat, telephone, enterprise internal communication tools, as long as the channel supports HTTP request.

3.2. Service monitoring

(1) Business-level monitoring and instance internal customized data monitoring; (2) Integrate the existing monitoring system, many enterprises use ELK as the log system, and there are many monitoring panels, you can collect the data of the monitoring panel for alarm notification.

3.3. Automated testing

When the business is stable, automated testing can be fully implemented. Once the code is released, the automated test is conducted, and QA is involved in the test of complex scenarios only after basic anomalies are found and solved. Even when the structure of data receiving and response data of complex business scenarios is stable and clear, the automated test of complex scenarios can also be performed.

3.4 Other Notification scenarios

Does it have to be a system monitoring alarm? With Swarm robots, any kind of notification can be done, such as periodic notification of important online business, or even some periodic reminders.

4, afterword.

The monitoring system was launched in January and has been iteratively optimized for more than half a year. Now the happiest thing every day is to receive the notice of normal operation of the monitoring system in the morning and evening, and no abnormal notice at other times. The monitoring system did spot a lot of problems ahead of time during service deployment and online operations. Surveillance is done, and the system is safe? There was an alarm, but no one solved the problem, and it didn’t work; System exceptions on sibling business teams can also affect each other. The real system of high availability, the company to force, from the system architecture, business, management and other aspects of the comprehensive start, not BB. Of course their own business modules must be responsible for themselves. If it is a business module, the idea of this monitoring system may be very suitable. If it is a large-scale micro-service for unified monitoring, it is recommended to use the mature monitoring system on the market.

Ideas are more important than tools, so leave the imagination to everyone.

Finally, here’s a picture of monitoring alarms:


Good luck!

Life is all about choices!

In the future, you will be grateful for your hard work now!

【CSDN】 【GitHub】 【OSCHINA】 【The Denver nuggets】 【Wechat official account】