Who has the speed! Who is to blame (technical analysis)

Xjjdoc.cn has a detailed classification of 200+ original articles, making it easier to read. Welcome to collect.

Original: Taste of Little Sister (wechat official ID: XjjDog), welcome to share, please reserve the source. Any reprint that does not retain this statement is plagiarism.

Warning: If you don’t have relevant industry experience, this article will be very difficult.

Late at night, the leader: “the interface that you write has a problem! Get up and see.”

Ding! When the messenger goes off, you know it’s time to work.

But after thinking about it, I think it’s impossible. My code, is a simple redis query ah, is not redis hung?

Your colleague posted all the evidence to the group, which is your link. A simple Get query takes an average of 2 seconds. Jstack, the promethus monitor, points the problem all the way to your interface!

Log in to Redis server and everything is ok. What to do? Want to carry a big iron pot of zhang Qiu so unclear and unclear?

1. Haste is sin

In this case, trust your instincts. Your interface is so fast and so good, you’re probably the one who stands out from the rest.

In some “high concurrency” environments, logs and tools can be very confusing when problems occur because resources are not isolated.

The problem is with the fastest, most requested interfaces, but that’s theoretically impossible.

As shown above. This is very common.

Most requests, through scheduling in the Tomcat thread pool, are actually processed by business. Of course, thread pools don’t do the dirty work. They delegate requests to resource pools, such as:

A database connection pool, executeTime-consuming statistical operationsandQuick query operations
A Redis connection pool, executedObstructive slow queriesandSimple GET SET
A pool of Http connections (HTTPClient, OkHTTP, etc.) to remotely call resources with varying speeds

In our normal coding, we usually share a pool like this. Because it’s easy to code, it’s a no-brainer.

But if your service itself is not split and isolated, the problem can be fatal. For example, you put a reporting interface and a high-concurrency C-side interface on one instance.

At this point, you may be cheated by the report interface.

2. An example

Let’s use database connection pooling as an example to illustrate this process, starting with the following basic information:

The Tomcat connection pool is configured as200a
The connection pool of MySQL is set to50B: Yes, it’s quite large
Interface A needs to invoke the time-consuming query, and the time-consuming is5 seconds
Interface B is very fast, query database response time is in200msThe following

The speed of the B interface, the request volume is far greater than the interface A, under normal circumstances peacefully.

One day, interface A suddenly had A large number of queries. As it took A long time, the 50 connection pools of the database were quickly filled up (interface B had A fast response and A short holding time, so the slow connections would be eaten by INTERFACE A).

In this case, both the requests of interface A and interface B need to wait at least five seconds to obtain the next database connection. Services can continue normally.

After a while, the state of the service looks like this:

The database connection pool is 50 connections, filled quickly and almost entirely with slow queries
Tomcat’s pool of 200 connections quickly fills up, most of which are fast interface B because of its high volume of requests and high speed
All interfaces are blocked on the Tomcat thread. As a result, it takes about five seconds to query even a non-database request

When faced with this kind of problem, we tend to use JStack to print the information stack, or look at some internal monitoring curve. Unfortunately, most of this information is deceptive, and the slow queries you see are not really slow queries.

From the analysis above on XJJDog, it should be easy to see where the problem lies: an unquarantined bottleneck resource causes a chain reaction of upstream resources.

But at work, XjjDog has more than once seen a classmate scramble. There’s a lot of evidence pointing to some fast, nice interfaces that have nothing to do with them.

Their happy screenshots, @related people, etc., extremely arrogant.

When this happens, you can do a preliminary analysis using the following script:

$ cat 10271.tdump| grep "waiting to lock " | awk '{print $5}' | sort | uniq -c | sort -k1 -r

26 <0x0000000782e1b590>
  18 <0x0000000787b00448>
  16 <0x0000000787b38128>
  10 <0x0000000787b14558>
Copy the code

In the above example, we found the stack locked to 0x0000000782e1b590 and found that it was all stuck to the read operation of the HttpClient. In actual scenarios, you can look at the top several lock addresses to find the commonalities.

These stacks, which display very little information, are the root cause of the problem.

3. How to solve it

Increasing the size of the Tomcat connection pool, or increasing the size of the connection pool, does not solve the problem and will most likely recur.

The best solution, of course, is to separate the time-consuming services from the normal services, such as the popular microservices. Your service query is slow and your access times out, which has nothing to do with my service.

However, the fact that your service is experiencing this kind of problem shows that your company is not equipped for this kind of transformation. You have to work on individual services.

The practice is isolation.

As shown in the figure above, we created two MySQL database connection pools in the same project, pointing to the same MySQL address. In this way, connection pool operations can be relatively independent of each other.

But for now, that’s not all, because your Tomcat connection pool is still shared.

Slow query related to the policy of obtaining connections from the connection pool, should be changed, do not wait for a long time, should use FailFast mode (obtaining connection timeout for a short time is also ok), otherwise the symptoms are the same.

The current popular concept of circuit breaker also practices this isolation to a certain extent.

End

We can also think of a similar scenario:

When the JVM STW occurs, the most affected interfaces are those with large, fast requests. And those time-consuming interfaces, because the usual is the bird, but no one paid attention to its abnormal situation.

A bunch of interfaces are connected to the same database. When the database jitter occurs, the interfaces that are fast and large are affected the most. Because those slow queries that take time, they behave that way all the time, no one suspects them.

As everyone knows, as long as these bad interface requests rise, it will be like a rat droppings, bad the whole pot of soup, all the requests will be dragged down.

It’s a bit like what we normally do: when you have more inefficient people, it slows down the whole project. Leaders have been wondering why so many skilled people are so inefficient.

That’s because they’re being held back. By focusing too much on individuals, the most fundamental problems are hidden beneath the surface.

R&d within a company should never be treated equally. Similar segregation should be achieved for employees with different technical pursuits.

Good team, smooth communication, consistent goals, high efficiency; Those who are good at slowing down projects should be placed on less productive teams to see them through.

Having said that, the point is this: not everyone understands this pattern, and few people care about the root causes. It takes a lot of effort to explain to your boss that there is nothing wrong with your interface.

“Boss, I found the reason. Because a slow MySQL query filled up Tomcat’s connection pool, resulting in a slow Redis Http request response.” Such a complicated relationship, it is really a headache.

“Very good,” said the leader, “this problem, you take the lead to solve it.”

You see, most leaders don’t focus on the cause of a problem. They focus on who can fix it, even if it’s not your problem. Who let you write good code, demand and do fast!

Xjjdog is a public account that doesn’t allow programmers to get sidetracked. Focus on infrastructure and Linux. Ten years architecture, ten billion daily flow, and you discuss the world of high concurrency, give you a different taste. My personal wechat xjjdog0, welcome to add friends, further communication.

Who has the speed! Who is to blame (technical analysis)

1. Haste is sin

2. An example

3. How to solve it

End

Related Posts

6 rounds of interviews for second-year students Several resets, summed up a few effective experience

Explanation of terms related to Ali Cloud messaging service

Redis interview questions and answers