Hello everyone, I am Xiaolou ~

One-time can optimize share again today, this is I just into company inherited the ancestral (laugh) project, the project has repeatedly mentioned in my article, I made a lot of performance optimization on it, such as “enhancing 18 times to remember a performance optimization of the article, comparing to some details optimization, this paper more macro performance optimization, An old actor, so to speak.

background

To get new people into the scene quickly, describe the background of this project again. This project is a self-developed Dubbo registry. The previous architecture diagram

  • The Consumer and Provider’s service discovery requests (registration, unregistration, subscription) are sent to the Agent, which acts as the Agent’s Agent
  • Registry and Agent maintain the long link of Grpc. The purpose of the long link is to timely push the change of the Provider to the corresponding Consumer. In order to ensure the correctness of data, a combination of push and pull mechanism is implemented. Agent will go to Registry to pull the subscribed service list at intervals
  • Agent and Service services are deployed on the same machine, similar to Service Mesh, which minimizes Service intrusion and enables fast iteration

Registry is the protagonist of today. For those familiar with Dubbo, it can be regarded as a ZooKeeper, while for those unfamiliar, it can be regarded as a Web application, which provides interfaces for registration, logout and subscription. Although it is written with Go, this paper has little to do with Go itself, and some pseudo-codes are also used to indicate it. So you can go ahead and watch it.

Do I have to do performance tuning

Before we do performance optimization, we have to answer a few questions, what are the benefits of performance optimization? Why do you have to optimize performance? How about not optimizing?

Performance tuning serves two purposes:

  • Reduce resource consumption and cost
  • Improve system stability

If it’s just to cut costs, it’s best to estimate how much you can cut costs before you do it. If you spend the better part of a month just saving a little bit of resources, it’s not worth it.

Going back to the registry, why do you want to do performance tuning?

When a Dubbo application is started, it initiates registration with the registry. If the registration fails, the application startup will be blocked.

At first, the project was not a big problem because there were not many apps, but when I took over the project, there were more and more apps.

The other side of the group is also gradually using containers to replace virtual machines and physical machines, in the peak will use capacity expansion to resist traffic peak, rapid expansion requires services to be able to start in a short time, no doubt is a big test for the registry.

The direct trigger for this optimization was a drill within the group. They found that the startup dependencies of one configuration center did not meet the standard, leading to the failure of capacity expansion. Therefore, after the review, all startup dependencies must meet certain performance requirements, and the standard was set as 1000Qps.

Hence the article.

Indicators to measure the

If you can’t measure, you can’t optimize.

Add metrics for the core interfaces (p99 / P95 / P90), error requests (P99 / P95 / P90).

Secondly, a pressure test was carried out on the project. I did not know the current performance, and the subsequent optimization could not prove its effect.

Take the registration interface for example, the performance of the registration at that time was about 40qps, keep that value in mind and see how we got to 1000qps step by step.

The requirements for a successful pressure test are as follows: P99 takes less than one second and no error is reported.

Where is the bottleneck

The key to performance optimization is to figure out where the bottleneck is, otherwise it’s a headless chicken.

What does the registration interface actually do? Let me draw a schematic diagram here

  • The entire process is locked to prevent concurrent operations
  • Create App and Create Cluster are used to Create applications and clusters. They are created only for the first time
  • Insert Endpoint inserts registration data, namely IP and port
  • The underlying storage of the system is based on MySQL, and Lock and UnLock are also pessimistic locks based on MySQL implementation

It can be seen from this flow chart that the bottleneck is mostly on the lock, which is a pessimistic lock. Moreover, the granularity is App, so the whole process is locked, and only one application of the same application is allowed to pass at the same time, which can be imagined how poor the performance is.

As for how MySQL implements a pessimistic lock, I’m sure you will, so I won’t expand it.

To prove my guess, I used a very clumsy but effective method. After each key node was executed, I recorded the time and finally printed it in the log so that I could see at a glance where it was slow. The slowest was locking.

Lock the optimization

Before optimizing the lock, let’s first figure out why the lock should be added. After repeated tests, code reading and documentation, I found that the thing is actually very simple. The lock is to prevent App, Cluster and Endpoint from repeatedly writing.

Why do we have to do this to prevent repeated writes? A unique index to a database? It’s impossible to verify, but that’s the way it is. How do you crack it?

  • The first is to see whether these tables can add a unique index, as far as possible to add
  • Second database pessimistic lock can be changed into Redis optimistic lock?

This is actually possible, because the client has a retry mechanism, if the concurrency conflict, we will initiate a retry, the probability of blocking this is very small.

For example, App and Cluster cannot add unique indexes due to some special reasons. The probability of their conflicts is very high. When the same Cluster is released, it is likely that 100 machines will pull at the same time, and only one of them will succeed. The remaining 99 are blocked by the lock when creating App or Cluster, retry may be initiated, retry may conflict, everyone will fall into infinite retry, and eventually time out, our service may also be overwhelmed by retry traffic.

What should I do? I was reminded of something called “double checklock” in the singleton pattern that I practiced writing when I first learned Java. Let’s look at the code

public class Singleton {
    private static volatile Singleton instance = null;
    private Singleton(a) {}private static Singleton getInstance(a) {
        if (instance == null) {
            synchronized (Singleton.class) {
                if (instance == null) {
                    instance = newSingleton(); }}}returninstance; }}Copy the code

In combination with our scenario, App and Cluster only need to ensure uniqueness when they are created, and the subsequent queries are conducted first. If they exist, there is no need to insert them, so we write pseudo code

app = DB.get("app_name")
if app == null {
    redis.lock()
    app = DB.get("app_name")
    if app == null {
        app = DB.instert("app_name")
    }
    redis.unlock()
}
Copy the code

Is it exactly the same as the double check lock? Why is this better performance? Because App and Cluster are only inserted for the first time, the probability of locking is very small. For example, in the scenario of capacity expansion, the logic of locking will not necessarily go to, and only when the application is created for the first time will it be really locked.

It is important to optimize scenarios with high execution frequency so that the revenue is high, and if the execution frequency is low, we can choose not to.

After this round of optimization, the performance of registration increased from 40qps to 430qps, a 10-fold improvement.

Read cache

After the last round of optimization, we also came to the conclusion that the basic information of an application or cluster is basically unchanged, so I wonder if I can read this information directly through the Redis cache.

Therefore, the object with basically unchanged information was added to the cache and tested again. It was found that the QPS increased from 430 to 440. The improvement was not much, but no matter how small the fly was, it was still a piece of meat.

CPU optimization

The optimization effect of the last round was not ideal, but I noticed a problem during the pressure test. I found that the CPU of Registry was greatly reduced, and I felt the bottleneck was transferred from the lock to the CPU. Speaking of CPU, this is easy to do ah, on the flame map, Go’s own Pprof can do it.

ParseUrl takes up too much CPU. Here is a simple popular science, a lot of Dubbo parameters are passed by URL, the registry gets the URL of Dubbo, need to parse the parameters, such as IP, port and other information in the URL.

Getting the CPU profile was a bit of a pain in the neck at first, because ParseUrl is a standard package of URL parsing methods, and writing one that is more efficient than it is almost impossible.

But look around to see where this method is called. The URL in the original request will parse the URL several times during execution. Why does the code write this way? Maybe the logic is too complicated, layer by layer of nesting, but the parameter passing between methods is not uniform, so it leads to such bad writing.

What about this situation?

  • Reconstruction, the URL resolution unified in one place, the subsequent transmission parameters will be transmitted after the resolution of the result, do not need to repeat the parsing
  • For URL resolution, a layer of cache is added in the granularity of each requested session to ensure that the URL is resolved only once

I chose the second method because it would require minimal changes to the code. After all, I had just inherited such a large, chaotic code that it was best to leave it as it was and move as little as possible.

And I am very familiar with this way, in the source of Dubbo have such treatment, Dubbo during deserialization, if it is repeated object, directly go cache rather than to construct again, code at org.apache.dubbo.com mon. Utils. PojoUtils# generalize

Take a little bit of it and feel it

private static Object generalize(Object pojo, Map<Object, Object> history) {... Object o = history.get(pojo);if(o ! =null) {
        returno; } history.put(pojo, pojo); . }Copy the code

With this in mind, change the ParseUrl to a cached mode

func parseUrl(url, cache) { if cache.get(url) ! = null { return cache.get(url) } u = parseUrl0(url) cache.put(url, u) return u }Copy the code

Since it is a session-level cache, a new cache is created for each session. This ensures that the same URL is parsed only once in a session.

You can see the results of this optimization, QPS directly to 1100, to achieve the target ~

One last word

Some people are going to spray after reading this, which is not performance optimization? This is clearly a pit filling! Yeah, you’re right. It’s just someone else dug the hole.

This paper will solve the performance optimization of the ancestral code at a minimum cost. Of course, I do not encourage everyone to cheat. I am also reconstructing this project, but there are different solutions at each stage.

I hope you can learn some basic knowledge of performance optimization through this article, starting from the question of why to do, establish a measurement system, find bottlenecks, step by step optimization, according to the data feedback timely adjust the direction of optimization.

Let’s call it a day and see you next time.


Search attention wechat public number “bug catching master”, back-end technology sharing, architecture design, performance optimization, source code reading, problem solving, practice.

Good historical articles recommended

  • “Oh, my God, the code for Go got busted.”
  • What is dachang’s preferred Agent technology?
  • “Just released” Java development manual Huangshan edition, I help you circle out the changes!”
  • This Dubbo Registry extension is kind of interesting!