1 the environment

1.1 Dev

Feel free to use any familiar tools. As long as the problem can be reproduced, the investigation will not be too difficult, at most is to debug the program to a variety of framework source code, so this is why the interview will ask the source code, not all read, but have the idea to know how to see to solve the problem.

1.2 the Test

There is less debug than in the development environment, but you can also attach to remote JVM processes using JVisualVM or Arthas.

In addition, the test environment allows you to create data to simulate the scene we need, so when there is a problem, remember to actively communicate with the tester to create data to make the bug easier to reproduce.

1.3 Prd

Developers have the lowest permissions in this environment, so troubleshooting is a big hurdle:

  • Unable to attach a process remotely using debugging tools
  • Quick recovery is the priority, even in the marriage, also need to fix the online problems. In addition, the production environment is prone to problems due to heavy traffic, strict network permissions, and complex invocation links.

2 monitor

When problems occur in the production environment, it is impossible to maintain a full site for troubleshooting and testing because of the need to restore the application as soon as possible. Therefore, is there sufficient information (logs, monitoring, and snapshots) to understand the history and restore bug scenarios? The most commonly used is the ELK log, note:

  • Ensure that errors and exceptions are fully recorded in the file log
  • Ensure that the logging level of the production application is above INFO

Use proper log priorities. DEBUG for development and debugging, INFO for important process information, WARN for problems requiring attention, and ERROR for blocking process errors

Production environment needs development and operation and maintenance to complete monitoring:

The host dimension

Monitors resources such as cpus, memory, disks, and networks. If the application is deployed on VMS or K8S clusters, VMS or PODS must be monitored in addition to basic resource monitoring on physical machines. The number of monitoring layers depends on the deployment scheme of the application.

The network dimension

Monitor dedicated line bandwidth, basic switch information, and network latency

All middleware and storage should be monitored

It not only monitors the basic indicators of CPU, memory, disk I/O, and network usage of processes, but also monitors some important indicators inside components. Prometheus, the most commonly used, offers a large assortment of middleware and storage systems

Application level

You need to monitor JVM process class loading, memory, GC, threads, and other common metrics (such as using Micrometer for application monitoring), as well as ensure that application logs and GC logs are collected and saved

Let’s look at the snapshot again. A snapshot is a snapshot of an application process at a certain point in time. Normally, we will for the production of Java application Settings – XX: + HeapDumpOnOutOfMemoryError and – XX: HeapDumpPath =… These two JVM parameters are used to preserve the heap snapshot when OOM occurs. We will also use the MAT tool several times in this course to analyze heap snapshots.

Analyze best practices for positioning problems

To locate the problem, the first step is to locate the problem at what level: the Java application itself or external factors.

  • You can first check whether the program is abnormal. The exception information is usually specific, and you can immediately locate the general direction of the problem
  • If there are some resource-consuming problems, there may be no exceptions. We can locate them through index monitoring and explicit problem points.

The causes of common problems can be classified as follows:

Bug after program release

Roll back, and then slowly analyze the root cause through version differences.

External factors

Such as host, middleware, or DB issues. Such host-level problems, middleware, or storage (collectively, components) are classified as:

The host layer

Use tools for troubleshooting:

CPU related

Use top, vmstat, pidstat, and ps

Memory related

Use free, TOP, ps, vmstat, cachestat, SAR

IO related

Lsof, iostat, pidstat, SAR, ioTOP, df, and du are used

Network related

Use ifconfig, IP, nslookup, dig, ping, tcpdump, iptables

component

Check from the following aspects:

  • Check whether the host where the component resides is faulty
  • Basic information about component processes and various monitoring indicators
  • Component log output, especially error logs
  • Go to the component console and use some commands to see how it works.

Insufficient system resources cause the system to feign death

Restart and capacity expansion are usually used to resolve the problem before analysis, preferably with a snapshot.

System resources are insufficient.

High CPU usage

If the site is still there, the specific analysis process is as follows:

  • Execute on the servertop -Hp pid

See which thread in the process uses the most CPU

  • Enter a capital P to sort the threads by CPU usage and convert the obvious CPU hogs to hexadecimal thread ids
  • Search the thread ID in the thread stack output from the jStack command to locate the current call stack of the offending thread

If you cannot execute top on the server directly, you can locate the problem by sampling: Run JStack once at a fixed interval. After sampling for several times, compare and sample to find out which threads are always running and find out the problem threads.

If the site is missing, it can be analyzed by elimination. High CPU usage is usually caused by:

  • Burst pressure

This can be confirmed by the traffic or Log volume of load balancing prior to application. Reverse proxies such as Nginx record urls, which can be refined by relying on the Access Log of the proxy, or by monitoring the number of JVM threads. In the case of high CPU usage due to stress issues, if the application’s resource usage is not obviously abnormal, then you can further locate the hot spot method by pressure measurement +Profiler (JVisualVM has this feature). If resource usage is abnormal, such as thousands of threads, you need to consider callbacks

  • GC

GC metrics and GC logs can be monitored by the JVM. If GC pressure is confirmed, then memory usage is also likely to be abnormal and needs to be analyzed according to the memory problem analysis process.

  • An infinite loop or abnormal processing flow

It can be combined with application log analysis. Generally, some logs are generated during application execution. You can pay attention to the abnormal log quantity.

Memory leak or OOM

The simplest is to use MAT analysis after heap dump. Heap heap dump, contains the site picture and thread stack information, general observation of dominant tree graph, histogram can see immediately take up a lot of memory object, can quickly locate to memory problems related to Java process for the use of the memory is not just a pile of area, include thread using memory * each thread (thread number of thread stacks) and metadata. Note that the JVM parameters are not set properly, which limits the number of JVM resources. Note that the JVM parameters are not set properly.

IO problem

Unless it’s a code problem that causes a resource not to be released, it’s usually not caused by something inside the Java process.

network

It is usually caused by external factors. Connectivity problems can be easily located based on abnormal information. To check performance or transient faults, use a tool such as ping. If the fault persists, use tcpdump or Wireshark.

Best practices when confused

Occasionally you may analyze and locate problems and lose yourself. If so, here are some lessons to follow

Cause or the result?

For example, the service is executing slowly and the number of threads is increasing.

  • Bad code logic, slow dependency on external services

Make their business logic execution slow, in the case of constant traffic, it needs more multithreading processing. For example, the concurrency of 10 TPS can be completed by 1s of a request, which can be supported by 10 threads. Now it takes 10 seconds to complete, 100 threads

  • Request volume increases

The number of threads increases, the CPU of the application itself is insufficient, and the processing is slow due to context switching

In this case, it is necessary to combine monitoring indicators and the inlet traffic of each service, and cause or result is the slow analysis.

To explore regularity

If you don’t have a clue, try summarizing the pattern. Such as

  • There are a bunch of servers for load balancing. When problems occur, monitoring and logs can be analyzed to see if requests are evenly distributed, and the problem may be concentrated on a single machine node
  • Application logs generally record the name of the thread. If problems occur, you can analyze whether the logs are concentrated in a certain thread
  • If a large number of TCP connections are enabled for an application, you can use Netstat to find out which service is connected to in a centralized manner

Once the law is explored, it is easy to break through.

Call the topology

For example, if Nginx returns 502, it is generally assumed that the gateway is unable to complete the request forwarding due to a problem with the downstream service. For the downstream service, we can not take it for granted that it is our Java program, for example, on the topology may be the Nginx agent is Kubernetes Traefik Ingress, the link is Nginx->Traefik-> application, if we check the health of Java programs, we will never find the root cause.

Sometimes when Feign is used for service invocation, connection timeout is not necessarily a server problem. It may be that the client invokes the server through the URL instead of the client load balancing implemented by Eureka’s service discovery. That is, the client connects to the Nginx agent instead of directly connecting to the application. The timeout occurred when the client connects to the service is actually caused by the failure of the Nginx agent.

Resource constraints

Observe various monitoring indicators, if you find that the curve slowly rises and then stabilizes at a horizontal line, it is generally a resource bottleneck.

When you observe the network bandwidth curve, if the bandwidth reaches 120MB or so and does not move, the 1GB network adapter or transmission bandwidth is used up. If the number of active connections in the database increases to 10 and does not move, the connection pool is used up

Observation monitoring should pay attention to any such curves once they are seen.

A chain reaction

CPU, memory, I/O, and network support each other. A bottleneck in one resource may cause a chain reaction in other resources.

After a memory leak, objects that can’t be collected can cause a lot of Full GC, and the CPU will consume a lot of GC, resulting in increased CPU usage

Data is often cached in memory queues for asynchronous I/O, and when network or disk problems occur, memory inflation is likely to occur.

So when something goes wrong, you have to take it into consideration to avoid misjudgment

Client or server or transmission problem?

For example, MySQL access is slow.

  • The connection pool is insufficient, resulting in slow connection obtaining, GC pause, and CPU usage. Procedure
  • Transmission process problem

Including the optical fiber may be cut, firewall, routing table and other Settings are wrong

  • It’s really on the server

It all needs to be sorted out.

Slow MySQL server logs and slow transmission can be detected by ping. These two possibilities are excluded. If only some clients experience slow access, you need to suspect the fault of the clients themselves. In the case of slow access to third-party systems, services, or storage, it is not entirely assumed to be a server-side problem.

Seventh, snapshot tools and trend tools need to be used together. For example, JSTAT, TOP, and various monitoring curves are trend-like tools, which allow us to observe the changes of various indicators and locate general problems. Jstack and MAT for analyzing heap snapshots are snapshot-like tools that analyze the details of an application at a point in time.

Typically, we use trend tools first to summarize patterns, and then use snapshot tools to analyze problems. If the reverse is true, you may misjudge, because snapshot tools reflect only the situation of a split-second program. You cannot draw conclusions by analyzing a single snapshot. If you lack the help of trend tools, you should at least extract multiple snapshots for comparison.

Eighth, do not easily suspect surveillance. I have read an analysis of an air accident. The pilot found in the air that the instrument showed that all the fuel tanks of the plane were in a state of lack of fuel. He suspected that the fuel gauge was faulty at the first time, and he was unwilling to believe that it was really lack of fuel. Similarly, we look at various monitoring systems when things go wrong with the app, but sometimes we trust our own experience rather than what the monitoring chart shows. This can lead us to look in the wrong direction entirely.

If you really doubt that something is wrong with the monitoring system, check to see if the monitoring system works for applications that don’t, and if it does, trust the monitoring system rather than your experience.

Ninth, if the root cause cannot be located due to lack of monitoring and other reasons, the same problem may occur again, and three tasks need to be done:

Log, monitor, and rectify snapshot leakage, and locate the root cause when problems occur next time. According to the symptoms of the problem to do a real-time alarm, to ensure that problems can be found in the first time; Consider making a hot backup solution. If a problem occurs, you can immediately switch to the hot backup system to solve the problem quickly, while preserving the site of the old system.

conclusion

The analysis of a problem must be reasonable

Can’t guess by guessing, need to do a good job in advance of the construction of basic monitoring. Monitoring is performed at the basic operation and maintenance layer, application layer, and service layer. When locating problems, we also need to make comprehensive analysis by referring to the indicator performance of multiple monitoring layers.

Locating the problem starts with a rough classification of the cause

For example, internal problems or external problems, CPU-related problems or memory-related problems, just the A interface problem or the whole application problem, and then to further refine the exploration, must be from the big to small to think about the problem; When tracing problems to bottlenecks, we can first withdraw from the details, and then from the larger aspects of the point involved, and then look at the problem again.

Experience counts

When encountering major problems, it is often necessary to find the most likely point in the first place according to intuition, and there is even an element of luck. I also share my nine experiences with you, and suggest you to think more and summarize more when solving problems at ordinary times, and extract more routines and tools for analyzing problems by yourself.

After locating the cause of the problem, do a good job of backtracking. The solution of each fault is valuable experience. The recheck is not only to record the problem, but also to optimize the architecture. Recheck can focus on the following:

  • Record complete information such as time line, handling measures, and reporting process
  • Analyze the root cause of the problem
  • Provide short, medium and long term improvement solutions, including but not limited to code changes, Sops and procedures, and record and track each solution for closed loop
  • Organize a team to review past failures regularly