Build a road from 0 to 1000+ servers monitoring

Click on “Road of Technology for Migrant workers” and choose “Top or Star label”

10 o ‘clock every day for you to share different dry goods

AdMaster is a leading independent third-party marketing big data solution provider in China, as well as an independent third-party DMP (Big Data Management platform) platform in China. AdMaster is for the consumer, IT, automobile and other industries 80% of the world’s top 100 brands and many domestic well-known brands to provide data services, durex, p&g, kraft, estee lauder, Coca-Cola, yili, unilever, McDonald’s, such as Microsoft, dongfeng nissan familiar brands are using pure science and technology to recognize the data service.

Yunzhi is honored to invite Mr. Gu Kai, director of Operation and maintenance of Jingshuo Technology, to share the wonderful operation and maintenance experience from several to thousands of units with you:

It has been more than five years since I joined AdMaster and experienced the rapid increase from dozens of servers to thousands of servers in the company. At present, AdMaster increases data volume by more than 5T per day, requests more than 10 billion per day, calculates more than 100 billion records per day and calculates more than 100,000 tasks per day. 100 billion record second level queries, 1 million level QPS.

Over the years has been on the premise of stable operation, ensure that business never drops, lead the operations team operational system is developed independently, containing, asset management, work order management, monitoring and control system, the domain name management, management of public cloud, private cloud management platform, data analysis and operations, will work operational transparency, visualization.

This time mainly to introduce to you from dozens of servers to thousands of servers in the operation and maintenance process, monitoring system change experience. It is often said that there are a thousand Hamlets in a thousand people and a thousand methods of operation and maintenance in a thousand operations and maintenance. No method is universal and can be applied to all scenarios. Specific problems need to be analyzed in detail.

The first stage: less than 200 units

The second stage: 200~1000

Stage 3:1000+ (there is no difference between 1000 and 2000)

The cut-off point of each stage is not so precise, even a general period, and the change is a gradual process.

I. The number of machines is less than 200

In this period, the requirements are simple, mainly used to notify problems and quickly locate and solve problems. Generally speaking, the main requirements are as follows:

1. Simple and easy to use;

2. Stable operation;

3. Able to alarm, email, SMS.

Based on the above requirements, you can use the popular open source monitoring software Nagios, Cacti, Zabbix, Ganglia, etc. Popular open source products are well documented, quick to use, and have a lot of previous experience, which can avoid many problems and make it easy to find solutions. Among them, email alarm is generally supported, and SMS needs to be connected to the SMS platform.

We chose Nagios and Cacti in the early days, mostly for personal reasons. I’m most familiar with Nagios, and Cacti because it’s so easy to monitor switches, almost foolproof. In fact, at this stage, no matter which monitoring product, basically can meet the needs, the choice factor is still personal preference, operation and maintenance students can occasionally capricious at this time.

Two, the number of machines 200 to 1000 stage

During this period, the requirements became complicated, but they were mainly used for notification and alarm to avoid the same problem happening again. I mainly did the following things during this period:

1. Unified monitoring content: Basic monitoring is unified. By default, basic information such as CPU, memory, and disk space is monitored on each machine.

2. Overlay monitoring: all machines are monitored. In addition to basic monitoring, business monitoring is the most important, covering business processes as far as possible.

3. Timely notification to ensure no missing report: All monitoring shall be classified and notified by email, wechat, SMS, telephone and other different levels according to the importance and urgency, so as to ensure that each monitoring shall be handled by someone. In addition, for important businesses, call to death is adopted.

During this period, I conducted in-depth research on Nagios, wrote custom scripts, added a large number of monitoring items, and made full use of most Nagios plug-ins such as NRPE, NSCA and functions.

As more machines and more services need to be monitored, alarm information explodes, and thousands of alarm emails are received every day. There was a small incident. I should have been the first person to Max out Tencent enterprise mailbox. It was not the capacity that burst, but the number of emails exceeded the maximum value of their database, so I couldn’t send and receive emails or delete them within a week.

At the end of this stage, when Nagios was approaching 1000 machines, the monitoring function of Nagios was no longer able to meet the demand, and Nagios graphics function was always insufficient, so I began to think about the situation beyond 1000 machines. There were two paths ahead:

1. Continue to develop Nagios in depth according to your own needs;

2. Self-built monitoring.

At this point some friends will think: another open source monitoring can solve the problem. Using open source software the biggest problem is that this software has what function can you use what function, the function of no or their development, either to give up, a large number of alarm is only a change of turning point, after a long time and accumulation, the use of general, universal open source monitoring products has not fully meet the needs of large and complex.

After a long period of careful consideration, I decided to build a monitoring system of my own. In fact, it was also because I had a thorough understanding of the overall architecture and operation mode of Nagios before, and I thought it was not impossible to build a monitoring system by myself.

Iii. The stage where the number of machines exceeds 1000

After preliminary thinking and preparation, I started to develop my own monitoring system at this stage to solve the pain points and fulfill the requirements. There are mainly several things:

1. Have all the features of Nagios currently in use: do the same as Nagios, cover the original features, optimize for Nagios problems, and then upgrade after replacing Nagios. (The first step is the most important, if you can’t replace the functionality of the previous Nagios, the road to self-building will stop there.)

2. Sort out the alarms to simplify and reduce repeated alarms: When there is a bomb after the alarm information, if not timely finishing will really need to deal with things will delay, and for some reason, such as line problem, repeat the alarm will happen, so must will alarm information processing, warning information from 3000 + per day, fell to 300 within a day now.

3. Separate the alarm and display functions: In the monitoring system, the alarm and display functions are basically the same. Information from different equipment rooms must be displayed and reported on the central node in a unified manner. The processing of important alarms is a matter of time, and has nothing to do with the interface display. Therefore, I separated the display and alarm functions once in the design, and gave an alarm in the local machine room, and then centralized display.

4. Distributed deployment to avoid a single point: Each machine room is equipped with a sub-node, that is, the alarm node mentioned above, and a central node. Alarms are first distributed in each machine room, and then summarized and displayed in the center. If the central node is down, the DNS automatically switches to a sub-central node, and the sub-central node is upgraded to the central node.

Schematic diagram of distributed node switchover

conclusion

The advantage of self-built monitoring system is that it can make full use of data, combine data, analyze data, interpret data, interpret obscure data into data that adults can understand, so that product staff, sales staff, boss all understand how the current business status is. Finally, I will show you two data displayed after analysis in our self-built monitoring system:

This chart shows the access to Track system in each province of China, including not only the speed, the data center visited, but also whether domain name hijacking occurs. Of course, we can’t get so much and so complete monitoring data by relying on our own monitoring nodes. At this time, we need the help of “Monitoring treasure” of cloud intelligence. We use more than 200 nodes of monitoring Treasure nationwide to send back the detection data through API, and then organize, analyze and feedback the data on the chart. The traffic of switches used to be Cacti, but it is a huge task to find the traffic after there are more switches. Aiming at this demand pain point, our monitoring system supports switch monitoring. Besides basic CPU information, we pay special attention to the traffic.

The figure above shows the current speed between switches, where traffic comes from, and how much traffic there is.

This chart shows where the traffic reaches the warning value and which switch has a problem, which is very convenient for quick locating and processing.

Finally, the needs of each company are different, and the pain points of each operation and maintenance are also different. No matter how many changes there are, all changes are the same. With all kinds of monitoring data on the machine, we can combine and analyze the results you want. Thank you!

QA part

Q: Is the underlying nagios still there?

Answer: No, it is completely written from scratch, using the ideas of Nagios for reference, but the collection method, summary processing method is not the same.

Q: Is there any monitoring of the database? Or a dedicated DBA?

A: We are not targeting the monitoring of the database alone, or call someone else’s monitoring script and get the data.

Q: What do you do on business monitoring?

A: We also have some business monitoring. Let me send you a picture:

This is our business monitoring, and all the monitoring data are described in words, so that the products, business students and the boss know what the situation is now.

Q: Is there any special optimization on the database side for such a large amount of data collected? Asynchronous processing?

A: It is asynchronous. The business system is displayed on a large screen. When there is a problem, you can directly see where the problem is and know who to ask about the recovery.

Q: How resource-intensive is this monitoring?

A: Fortunately, there have been some bottlenecks in the centralized presentation and processing of data, and we are constantly optimizing.

Q: Was the smart DNS system developed by itself?

A: We use a third party’s smart DNS, and we also have our own.

Q: Is your database a MySQL cluster?

A: There is another reason for the MySQL master-slave to separate alarm and presentation, and that is performance concerns. The display can be a few seconds or a few minutes slow, but the alarm can’t, so the alarm is instant, and there is no fear of losing your eyes if the monitoring machine dies. We currently have 6 nodes distributed across the country. The probability of failure of all nodes is very small. As long as one node is alive, we can call the police.

Q: Is this exact value in seconds?

Answer: second level, the slowest notification is a phone call, takes more than ten seconds.

Q: Are you using only monitor now? Is the Clairvoyant in use?

Answer: Perspective treasure is studying.

Q: What metrics does the switch obtain?

A: CPU, memory, warning messages, traffic, ports.

Q: Ask ali cloud server performance is not worse than their own hosting server?

A: At present, the company uses aliyun self-built database, which has great performance problems. There are widespread IO problems in cloud services, and Aliyun is the most serious.

Q: How is service monitoring done?

A: Business monitoring is similar to Xanbao, but not as granular.

Q: Is it buried in the program?

Answer: not buried in the program, is the use of monitoring data to achieve, so can only do the phenomenon level, can not do the code level.

Q: Monitoring logs? Or the CPU?

A: It is not the CPU. It is some comprehensive judgment about whether the program runs normally. One of the business monitoring items may correspond to a dozen monitoring items. It has to do with the business of the company, some apis, some applications, different business, response speed, etc.

Q: How many operations does the company have?

A: There are eight people including me, and this is a platform we developed ourselves.

Q: How is the daily work of operation and maintenance divided into products?

A: In the early stage of product distribution, after the completion of the second stage of automatic transformation, basically arbitrary, through the work order system to complete, after the completion of routine work order approval automatically online, no operation and maintenance involvement.

Q: Are there a bunch of business statistics requirements?

A: Yes, the demand is allocated by me, and we will do a good job of the demand that often needs to be counted, and directly show them the system to take the number.

Q: What tools do private clouds use?

A: KV-based development, early use of GopStack, openstack, but later found too heavy. A brief understanding of private cloud is KVM automation.

Q: what is the general configuration of your physical machine?

A: The minimum is also dual 6 core, 64GB.

Q: what does your visualization look like? Is it a work order?

A: Another reason for the visualization of o&M is that people do not understand o&M and do not know what O&M is doing. It is often misunderstood as installing systems and executing scripts. Visualization is to show people’s key concerns and educate them with operation and maintenance data. Work order is the starting point of all operation and maintenance operations, and it is also a powerful tool to avoid the blame. Work order system is actually the system I designed most, including the process of work order, especially the approval. Abuse of work orders can piss you off.

Q: Have you ever encountered a situation where the server is fine, the middleware and database are fine, and the online business suddenly fails?

A: You may need a clairvoyant.

Q: Can Qianbao monitor bandwidth congestion at network exits?

A: Perspective is mainly to do application performance monitoring, treasure perspective treasure like application system of CT scanners, to be able to collect the actual user mobile client and the browser experience performance data, running on the server application environment, database access and application code execution performance data, then use big data technology for rapid diagnosis analysis of the collected data, The monitoring of the network link is completed by the monitoring tool. The combination of the two can realize the whole-link service monitoring and problem diagnosis from the client end to the server end.

Q: What does sudden failure mean? Did the front-end agent report an error? Drop it when you need it?

A: For example, a function works normally, but suddenly there is no response, and the code does not report any error. After a period of time, the log is normal, but there is no sign, but I can not find the reason, CPU and memory are normal, network traffic does not fluctuate, and the connection number is normal.

Q: Have you ever encountered a service failure caused by an Intranet problem?

Answer: perspective treasure should be able to help you, perspective treasure does very fine. Perspective treasure can solve internal problems, monitoring treasure can solve external problems, combined with ok, can check the switch, see whether there is SFP network shock, this I have encountered.

Q: what is SFP network oscillation? If the network is a problem, then everything else is affected, right?

A: Network flapping means that the switch relearns MAC addresses, which causes network failure in a short period of time.

Q: what are the causes of network shock?

A: The professional explanation is that recalculation is triggered repeatedly due to packet change or timer timeout, and the root bridge selection, port role switchover, and port status migration continue. Common reasons are as follows:

Link fault: The link attributes of a port on the network, such as the port status, rate, and duplex mode, continue to change.

Node fault: The CPU of a switch is too high to send or process STP packets at a scheduled interval.

Network fault: The network is congested. As a result, STP packets on the root port are discarded during forwarding. The L2PT transparently transmits STP packets from other networks, causing STP convergence at the local end. The multicast suppression function is incorrectly configured on the network, and STP packets are occasionally discarded. According to different fault causes, you need to modify the configuration or optimize the network design to solve flapping problems.

To put it simply, when a module or a network cable fails, the network flapping occurs when the module goes up and down frequently for several times.

Ask: encounter this kind of problem won’t report to the police? The characteristic is that the network is disconnected for a short period of time? About how long? How did Gu find out?

A: If you only look at the switch, it will be considered as false positives, but if you combine the business, it will not be false positives. It depends on how you set the threshold. I have made special monitoring for this, but the port can not be found out, nor can it be found in the regular log of the switch. There is a special log recorded, but I can’t remember it (can you add?)

Q: What about port duplex and rate changes? Switch logs have not been collected?

A: No change. ELK collects switch logs.

Public account back to “road to God” to get all the content

– MORE excellent articles – |

This guide won’t make you rich, but it will help you avoid pitfalls
How does a programmer earning 500,000 yuan a year live in Beijing
When it comes to chasing hot spots, Durex is second, no one dares to be first!
Good companies and bad companies on a map to identify!
A server for chicken after the actual combat investigation process!
2019 General higher Programmer recruitment unified examination

Long press the TWO-DIMENSIONAL code to pay attention to the technical road of migrant workers

Scan code to pay attention to the public number, reply to the “directory” can view the public number of articles catalog, reply to the “group” can join the reader technical exchange group, communicate with you together.

——————————-

All the best of the official account is here

You are watching, click here to have a surprise oh ~

Build a road from 0 to 1000+ servers monitoring

Related Posts

Java implementation AES ECP PKCS5Padding encryption and decryption tool class

Singleton pattern personal collation

Comprehensive cloud ThinkPHP6 simple access to Tencent cloud object storage COS