This article has been sorted out and sent to my Github address, welcome everyone to star support,

Hello, I’m Kun Ge

Today, I will share with you a successful case of investigation in these two days, which I believe will be of some help to you.

Most of us have heard the classic interview question: Please speak in detail from the browser address bar enter the url to the final presents the results of process, the more detailed, the better, why is the interviewer so like to ask this question, because this problem involves very wide range, knowledge is very much, if you can fully understand, very help to troubleshoot some incurable diseases, today I want to share this case is a typical, Without further ado, let’s get to the point (welcome to pay attention to the public number “code sea”, common progress, ^_^)

Problem description

The front-end students found that the RTT (request round trip time) of the newly developed project interface had a 1/3 probability of greater than 3 s. Taking the login interface as an example, the Chrome request time is as follows

The normal RTT is about tens of ms (the login interface is about 100 ms), so the delay of 3S must be abnormal, so we started to investigate. As each interface may exceed 3s, the logon interface analysis is taken as an example in the following, because the logon interface logic is relatively simple.

Thinking of the February

1. Troubleshoot the browser itself

The browser itself may cause slow requests. The browser itself may be due to the “maximum number of concurrent requests”, “higher priority requests jump the queue.” Low priority tasks are delayed “and so on, resulting in slow requests

So, in order to eliminate the slow network request caused by the browser itself, it is best to find a different browser such as Safari or terminal curl request and retry the request to see if the request is still slow. Here I tried both methods. Safari also reproduces the RTT greater than 3s. Curl curl curl curl curl curl curl curl curl curl curl curl Click Network and select Copy as cURL to obtain the cURL representation of the request

With curl, add -o,-s, -w to the request

curl -o /dev/null -s -w "time_connect: %{time_connect}\ntime_starttransfer:
 %{time_starttransfer}\ntime_nslookup:%{time_namelookup}\ntime_total: %{time_total}\n"A request from a browser copyCopy the code

-o /dev/null blocks the original output, -s silent mode does not output anything, and -w %{http_code} controls additional output

A result similar to the following appears after the request

Time_connect: 0.045897 time_startTransfer: 3.064762 time_nsLOOKUP :0.004328 time_total: 3.064777Copy the code

The last parameter, time_total, is the total time the request took.

In summary, after safari and terminal test, it is found that the login interface has more than 3s phenomenon, which can rule out the problem of the client side. Next is our highlight: server side investigation!

2. Check the server

Our server side is a Spring MVC service, and its logic for the login interface is as follows

@RestController
@RequestMapping(path = "/api/auth", method = RequestMethod.POST)
@Slf4j
public class AuthController {
    @RequestMapping(params = "member.login")
    public ResultTO<TokenDTO> login(@RequestBody UserLoginDTO userLogin) {
        /** write login logic */ 
        returnResultTO.responseSuccess(tokenDTO); }}Copy the code

Logic is actually very simple, it is according to input the user name password to the db request check user information is correct, the correct is generated JWT token is returned to the front end, seems to do not have what problem, of course, in order to ensure that this logic is no problem, we can use some tools to help us verify, real-time Arthas is an open source Java diagnostic tool from Alibaba. Arthas is a command line interactive tool that provides a wide range of features for troubleshooting JVMS.

  • Provides performance kanban, including thread, CPU, memory, and other information, and is periodically refreshed.
  • View thread snapshots based on various criteria. For example, find the n threads with the highest CPU usage
  • Outputs various JVM information, such as GC algorithm, JDK version, ClassPath, and so on
  • View/set sysProp and Sysenv
  • To view the static properties of a class, you can also execute statements using ogNL syntax
  • View details about the loaded class, such as which JAR package the class was loaded from. You can also view information about a class’s methods
  • Dump the bytecode of a class to the specified directory
  • Decompile the specified class directly
  • View some information about the class loader
  • You can make the JVM reload a class
  • Monitor method execution and get input and output arguments executed and exceptions thrown
  • Trace the call stack that the method executes, and the call time of each method

The last thing we need to do here is to see when each method is called in real time. The steps to using Arthas are as follows

1. First we need to download arthas, as shown below

curl -O https://arthas.aliyun.com/arthas-boot.jar
Copy the code

Arthas starts up and displays a list of Java processes. Select the Spring MVC process arthas will be debugging, click 2 below, and then enter arthas’s interactive interface. Filtered a lot of descriptive information and kept only the core information useful to everyone.)

[root@d-sts-sh-1-spring-mvc-service-0 buser]# /opt/java8/bin/java -jar arthas-boot.jar [INFO] arthas-boot version: 3.5.1 [INFO] Found Existing Java process, please choose one and input the serial number of the process, eg: 1. Then hit ENTER. * [1]: 99 org.apache.flume.node.Application [2]: 2698 /opt/apps/business/spring-mvc-service.jar 2[arthas@2698]$
Copy the code

We can then trace method calls using arthas’s trace command. The trace command is described as follows:

The trace command can actively search the method invocation path corresponding to class-pattern/method-pattern, render and count all the performance costs on the entire invocation link, and trace the invocation link.

Now that we want to track the real-time elapsed time of login, we can specify “trace + fully qualified name (package name + class name) + method name” as follows

[arthas@2698]$ trace com.xxx.AuthController login
Copy the code

Then log in to the browser. When the login time of the browser is 3s+, the tracing result is as follows

It can be seen that login only took 76 ms, far less than 3 s, so can we conclude that the service of this machine is ok? PS: The SpringMVC service on our line deployed two machines and arthas debugging was turned on on the other machine, also around 76 ms

The answer is no. Let’s take a look at the Spring MVC request flow diagram

The yellow part of the processor in the figure above is the Controller that processes the business logic. It can be seen that the request to the Controller goes through filter chain, DispatcherService and other classes. In the filter chain at the beginning, The purple JwtAuthorizationFilter is our self-defined filter. Is it possible that the filter is processed too slowly? So it is best to execute arthas trace at the start of the request in the doFilter method of OncePerRequestFilter (there is actually quite a bit of flow prior to this filter, but until then it is normal for the framework to process it and can be ignored).

When the browser login request is 3s+, observe the trace result again:

It’s only 81 ms! The results show that the execution time of both machines is about 81 ms when the request at the front end is 3s. ** The SpringMVC service is fine for both machines on the line! ** Since there is no problem with the online machine service, we can only start with the flow path of the traffic. What process does the client need to go through to reach the SpringMVC service when making a request?

As you can see, requests need to go through the reverse proxy layer and access layer before they reach our site layer (our Spring MVC service), which means that both “reverse proxy layer to access layer” and “access layer to site layer” can cause slow requests. I told o&M about my arthas Trace results (MVC service execution time around 80ms) and the probability of front-end requests being more than 3s and asked them to check if there was a problem between the reverse proxy layer and the site layer, which they did shortly after.

The conclusion is this: There were originally three machines for MVC service, but one machine was reduced to two later. However, kongfu at the access layer still holds the IP of this reduced machine and does not kick it out. Therefore, after the front-end traffic comes in, due to the load balancing strategy of the access layer, there is a 1/3 probability that the request will be sent to the IP of this offline machine. Since the machine corresponding to the IP cannot respond to the request, kongfu will retry the request to either of the other two normal machines after the timeout, which means that most of the 3s of the request are spent waiting for the abnormal IP machine to respond.

Some people may ask, the machine is kicked off, the access layer Kongfu should be able to detect its offline, why would they send a request to this offline machine?

Yes, Kongfu will check whether the machine is alive through port detection. However, the problem is that although the machine with reduced capacity is recovered, its IP can also be reassigned to other machines. In this case, Kongfu will consider the machine with the CORRESPONDING IP is alive through port detection. The machine assigned this IP just happens not to be the Spring MVC service, so it can’t handle normal MVC requests. It can only wait until the request times out and then kongfu retries and forwards it to a normal machine.

“Curl -i www.example” can output the initial information, and then add the -b option to bring the cookie. Our access layer will return a header field named “X-kF-via” if it finds some special cookie in the request, as follows

X - KF - Via: agw (BJP = 10.65 X. 1:8 001,10.65. X. "001; b=mvc_service)Copy the code

10.65.x.1 failed, and then tried 10.65.x.2: 8,001 again. If I had known there was such an option, I would have been able to troubleshoot the problem in one step

Knowing the problem, the solution was simple: simply remove the offending machine from Kongfu

conclusion

Screening ideas actually relatively clear, but must be the request of the whole circulation process has a more sober understanding, so as to judge the problem quickly, now again looked back at the beginning of the classic interview problem “, please say in detail from the browser address bar enter the url to the final presents the results of process, the more detailed, the better “, believe that you will have a good feeling, This interview question will greatly improve your ability to identify and solve problems if you understand every point in the flow of requests. For example, this question has been reported in the past:

When doing the Server stress test, I found that the client sent repeated requests to the Server and received the response from the Server. During the process of receiving the response from the server, the recv server will respond and block for 40ms. However, according to the log of the server, the server will complete the request processing within 2ms and give the response to the client

If you know TCP, it is due to TCP’s delay acknowledgement mechanism and Nagle algorithm and congestion control, and it is natural to work in this direction, such as turning on TCP_NODELAY.

Welcome to follow the public number “code sea”, common progress, ^_^