background

Recently bad luck, almost every day by online problems harassment. Having solved a HashSet concurrency problem a few days ago, we had a performance problem on Saturday.

The general phenomenon is as follows:

An OpenAPI that we’ve provided responds quickly and slowly, sometimes by tens of milliseconds, sometimes by seconds.

Try to solve

Since this is also not a business problem, it cannot be directly located. So try to reproduce in the test environment, but unfortunately the test environment is fast.

I have no choice but to bite the bullet.

In the middle of the trip, I asked oEMS to check the response time of OpenAPI in Nginx and tried to throw the blame on the network. The result was indeed hit in the face; The logs in Nginx also show that there is indeed a problem with response time.

To get a clear picture of the problem, I’ve combed through the call process briefly.

The whole process is a common layered architecture:

  • The client request is receivedNginx.
  • NginxLoad the back endwebService.
  • webServices throughRPCCall back endServiceService.

The logging solution

The first thing that comes to mind is logging, recording processing times at potentially slow methods or interfaces to determine where there is a problem.

But in terms of the invocation chain, the request flow is not short. Adding logs involves more changes and may not locate the problem in case of missing.

Another is to change the code will also involve the release of the online.

Tools to analyze

So the best way to solve this problem is to not change a single line of code.

This is where an Agent tool is needed. We used Tprofile, which was open source by Ali.

Only need to add – javaagent to launch parameters: / xx/tprofiler jar can monitor you want to monitor method is time consuming, and can give you the output report, is very convenient. There is no intrusion into the code and little performance impact.

Tool use

Here’s a quick demonstration of how to use the tool.

The first step, of course, is to clone the source code and package it so that I can clone the modified source code.

Because ali has not maintained this project for many years, there are still some bugs, I fixed a bug affecting its use on the original basis, and made some optimization.

Execute the following script.

git clone https://github.com/crossoverJie/TProfiler

mvn assembly:assembly
Copy the code

Here will be in the project after TProfiler/PKG/TProfiler/lib/TProfiler – 1.0.1. Jar to generate good we should use the jar package.

All you need to do is configure the JAR package into the startup parameters, along with a configuration file path.

This configuration file I copy the official explanation.

#log file name logFileName = tprofiler.log methodFileName = tmethod.log samplerFileName = tsampler.log #basic StartProfTime = 1:00:00 # endProfTime = 23:00:00 # Sampling duration eachProfUseTime = 10 # EachProfIntervalTime = 1 samplerIntervalTime = 20 # Port = 50000 debugMode = false needNanoTime = false # ignoreGetSetMethod = true #file paths  logFilePath = /data/work/logs/tprofile/${logFileName} methodFilePath =/data/work/logs/tprofile/${methodFileName} samplerFilePath =/data/work/logs/tprofile/${samplerFileName} #include & excludes items excludeClassLoader = Org. Eclipse. Osgi. Internal. Baseadaptor. DefaultClassLoader includePackageStartsWith = # need to monitor the bag Top. Crossoverjie. Cicada. Example. The action # don't need to monitor package excludePackageStartsWith = com. Taobao. The sketch; org.apache.velocity; com.alibaba; com.taobao.forest.domain.dataobjectCopy the code

The final startup parameters are as follows:

-javaagent:/TProfiler/lib/tprofiler-1.0.1.jar
-Dprofile.properties=/TProfiler/profile.properties
Copy the code

In order to simulate the troubleshooting of slow interface response, I implemented an HTTP interface with CICADA. Two time-consuming methods are called:

This way, when I launch the application, Tprofile records the method information it has collected in the directory I configured.

I access interface at http://127.0.0.1:5688/cicada-example/demoAction? Name = test&ID =10 after a few times it writes the detailed response for each method to tprofile.log.

Each column from left to right represents:

Thread ID, method stack depth, method number, time (ms).

But tmethod.log is still empty;

In this case, we just need to execute this command to brush the latest method sampling information to the tmethod.log file.

java -cp /TProfiler/tprofiler.jar com.taobao.profile.client.TProfilerClient 127.00.1.50000 flushmethod

flushmethod success
Copy the code

It accesses a service exposed by Tprofile that reads, parses tprofile.log and writes tmethod.log.

The port is the port in the configuration file.

Open tmethod.log again:

This is where the method information is logged.

  • The first line of numbers is the method number. You can go by this numbertprofile.logQuery each time in (Details).
  • The number at the end of the line is the last line number of the method in the source code.

In fact, most performance analysis is to count the average time taken by a method.

Log tprofile.log to generate the average elapsed time for each method.

java -cp /TProfiler/tprofiler.jar com.taobao.profile.analysis.ProfilerLogAnalysis tprofiler.log tmethod.log topmethod.log topobject.log
print result success
Copy the code

Opening topmethod.log is the average time it takes for all methods.

  • 4 is the number of requests.
  • 205 is the average time.
  • 818 is the total time.

It is consistent with the actual situation.

Method details take time

There may be other needs; For example, what if I want to query all the details of a method?

It’s not officially available, but it’s available, just with a little bit of trouble.

Let’s say I want to see the time breakdown of selectDB() :

The first thing you need to know is the number of the method, which can be found in tmethod.log.

2 top/crossoverjie/cicada/example/action/DemoAction:selectDB:84
Copy the code

Serial number is 2.

We already know that tprofile.log records details, so we can view them by using the following command.

grep 2 tprofiler.log
Copy the code

To view the details of each execution, use method number 2 in the third column.

But this way is obviously not friendly enough, need to filter interference, steps are also many; So I’m going to add that as well.

You only need to enter a method name to query the collected time details of all methods.

conclusion

Go back to the previous question; Online analysis with this tool yielded the following results.

  • Some methods do run faster and slower, but they are database-specific. Due to the large pressure of the database at present, the separation of hot and cold data, as well as database and table are prepared in the next step.
  • Make some write operations to the database asynchronous before the first operation to reduce the response time.
  • Considering the accesspinpointsuchAPM tools.

There are a lot of tools like Tprofile, just find one that suits you.

Before the use of a similar to the pinpoint distributed tracking tool should rely on a lot of this tool, so the follow-up may also do some customization, such as adding some visual interface, can improve the efficiency of the investigation.

Your likes and shares are the biggest support for me