The six brothers here only refer to ELK (ElasticSearch+Logstash+Kibana) and TIG (Telegraf+InfluxDb+Grafana).



The image above shows two separate systems, ELK and TIG (I made TIG up myself; there’s no such thing as ELK on the web) :

These two systems are composed of collector + storage + display site, turquoise collector, turquoise storage, red display site.

Both systems have free components to use and are relatively easy to install and configure. (Of course, companies also need to make money, as they are sure to push the Cloud version and generally don’t use the Cloud version, so they are sure to deploy locally.)

ELK system is more used for the collection, storage, search, view and alarm of log data.

TIG system is more used for the collection, storage, viewing and alarm of various Metrics index data.

For ELK, because the volume of log data tends to be large and sudden log surges are common, indexes are not written as fast, so a message queue such as Kafka is introduced to block it in advance.

For ELK, there is some need for filtering parsing and additional alarms before entering ES, so logStash can be used as an aggregation processing layer before using rich plug-ins for various processing. However, the performance of logstash is not that high and it consumes a lot of resources, so you need to pay attention to it when using it.

The ELK



The above picture shows the interface of Kibana. You can see that we have collected the logs of various components of microservices into ES. Expressions can be used to search various kinds of logs on Kibana, the most commonly used is to search related logs according to the RequestID or user’s UserID of the whole process of microservices. A lot of companies are used to the server to search logs one by one, preferably using Ansible batch search, which is actually very inconvenient:

· Text searches are much slower than ES index database searches.

· If large files are encountered in text search, they will occupy considerable memory and CPU resources of the server and affect business.

· File logs are generally archived and compressed, so it is not convenient to search for out-of-date logs.

· Permissions are not easy to control, and there may be security problems and risk of information disclosure if the original file log is open for inquiry.

· In the process of unifying data into ES, we can do a lot of additional work, including desensitization, storage to other data sources, email and IM notifications (which can be integrated with Slack or spike bots, for example), etc.

The abnormal

I’ve always had a point that I don’t think you can emphasize exceptions too much, especially unhandled exceptions that have been thrown to the business surface and system exceptions in services. Exceptions can be distinguished into business exceptions that are generated proactively by business logic and can be known in advance, and system exceptions that cannot be known in advance. For a system exception is often the underlying infrastructure (such as network, database, middleware) have shake or fault or a Bug in the code (if not Bug is incomplete logic), each an exception, we all need one by one, to investigate the root cause of the February, if the investigation temporarily no time, need to put on record to investigate again when you have time. For some systems with very large business volumes, there will be hundreds of thousands of exceptions every day, about 100+ cases. Here are the worst things you can do:

· Comb the code comprehensively, and do not eat exceptions. Often, the reason why bugs cannot be found is that you do not know what exceptions are eaten here. Using ELK, we can easily search and filter logs. It is very helpful to remember more errors about exceptions or abnormal processes to fix bugs.

· We need to monitor and alarm the occurrence frequency of anomalies. For example, XXException has 200 anomalies in the last one minute. After a long time, we will feel these anomalies, and we know it must be jitter when we see such a quantity. Then we know this is not necessarily network jitter, it is dependent on the rhythm of service hang, immediately need to start the emergency response screening process.

· Make sure to pay 100% attention to and deal with exceptions such as null pointer, array out of bounds, and concurrency error. Every exception is basically a Bug, which will lead to business failure. Sometimes these exceptions will be buried among many exceptions due to the small absolute number, and need to be solved one by one by looking at these exceptions every day. If this anomaly affects the normal flow of a user, then the user may be lost. Although this user is only one of the tens of millions of users, the feeling brought to this user is very poor. I always think that we should come before users find and fix problems, it is best to wait until the customer service feedback over time (most of the paid parts Internet users won’t because a setback process problem to make customer service phone call, but choose to give up the product) is a known issue with repair time points.

Do better even we can assign an ID for every error, if the opportunity for this error through to the user side, in 500 the less obvious place to display on a page for this ID, if user screenshots feedback problem, can easily find the corresponding error through the error ID in the ELK, a key location problem.

About TIG



Grafana supports many data sources, including InfluxDb. Graphite is another good choice. Telegraf is an Agent suite that collects data from InfluxDb. There are quite a number of plug-ins. These plug-ins are not complicated and can be written by Python itself, but they take a little time. In other words, formatted data is collected from the Stats interface exposed by each middleware and then written to the InfluxDb. Let’s look at the Telegraf support plugin (image capture from https://github.com/influxdata/telegraf) :



Using these plug-ins for maintenance or development requires little effort to monitor all of our base components.

The rbi

As shown in the architecture diagram at the beginning of this text, in addition to using Telegraf’s various plug-ins to collect various storage, middleware, and system level metrics, we also built a MetricsClient library that allows applications to save data from various points to InfluxDb. Each Measurement record entered by the InfluxDb is only an event, with the following information:

The time stamp,

· Various tags for searching

· Value (time consuming, times of execution)

As you can see in the following figure, in the BankService, we log the success of various asynchronous synchronous operations, business exceptions, and system exception events, and then perform a simple configuration in Grafana to render the desired graph.



For MetricsClient, which can be invoked manually in code or using AOP, we can even add this concern to all methods and automatically collect the method’s execution times, times, results (normal, business exceptions, system exceptions) and log them in the InfluxDb. You then configure your own Dashboard in Grafana for monitoring.

For RPC framework, it is also recommended that the framework automatically integrate the points, save the execution of RPC methods each time, configure some charts to refine the granularity of the method, and locate the suspected faulty method with one click when an accident occurs. Automated logging via AOP +RPC framework already covers most of the requirements, although it would be nice if we could add some business-level logging to the code.

If we configure two graphs for each business activity, one is call volume and the other is call performance, as shown below:



So:

· When something goes wrong, we can tell which piece is wrong in a very short time.

· It can also preliminarily determine whether the cause of the problem is abnormal or caused by sudden increase of pressure.

The recommended configuration is to configure the amount and performance of data processing for each link according to the data flow from front to back:

· Incoming data

· Data sent to MQ

· Data received by MQ

· Completed data processed by MQ

· Requests for external interactions

· Requests that get external responses

· Request to drop storage

· Look up cached requests

Being able to locate the faulty module, or at least the line of business, is much better than being headless (of course, it would be useless if we hadn’t pre-configured the Dashboard we needed). Dashboard must be maintained continuously with business iterations. It should not be abandoned after several iterations. When something goes wrong, Dashboard will be called 0.

other



Grafana is good for connecting to the InfluxDb data source, but it is not very convenient to connect to MySQL to do some queries. Here we recommend an open source system Metabase, which can easily save some SQL for business or monitoring statistics. You might say, these statistics is operating business concerns, and we by BI, we need to do these charts do myself, I want to say we even make technology also had better have an own small business panel, not to say that focus on business but can have a place to let us know the business running, glance at the key to judge.

Ok, speaking of which, have you seen through the six brothers, in fact, we create a three-dimensional monitoring system, share a troubleshooting steps, after all, when there is a big problem, we often only have a few minutes:

· Pay attention to abnormal or pressure alarm at system level, and pay attention to alarm when business volume drops 0 (refers to a sudden drop of more than 30%).

· Determine which module of the system has pressure problems and performance problems through the business Dashboard configured by Grafana panel.

· Troubleshoot upstream and downstream problems and locate the problematic module based on the service volume and business volume configured by Grafana panel.

· Check whether the corresponding module has errors or exceptions through Kibana.

· Found the error ID according to the screenshot of the error feedback from the customer, and searched the whole link log in Kibana to find the problem.

· Another tip for details is to check the request log. We can make a switch in the system on the Web side. According to certain conditions, we can turn on the switch of recording detailed Request and Response HTTP Log. With the detailed data of each Request, we can “see” the whole process of the user visiting the website according to the user information, which is very helpful for us to troubleshoot problems. Of course, this amount of data can be very large, so you need to be careful to enable this heavy Trace function.

Have dot, have error log, have detailed request log, still afraid can not locate the problem?