background

In the microservice architecture, different microservices have different network addresses, while the client is called through a unified address. There needs to be a communication bridge between the client and the server, which generates the microservice gateway. The microservice gateway can connect the client and the microservice, provide unified authentication mode, manage the interface life cycle, do better load balancing, fuse current limiting, provide convenient monitoring, and provide unified risk control entrance.

Today, we will introduce dubbo microservices Gateway, an in-house microservices gateway that provides HTTP protocol to Dubbo protocol conversion. What is relevant to this article is its core point: Dubbo generalization call. The dubbo website describes the generalization call as

“The generalized interface invocation approach is mainly used when the client does not have API interfaces and model class elements. All POJOs in the parameters and return values are represented by Map, which is usually used for framework integration. “

Dubbo calls the most common way is to introduce the provider defined jars, used to describe the interfaces, but if it is a gateway to introduce all the dubbo provider jar package is not reality, and if you need the new interface will need to redistribute the gateway, so you need to use the generalized call to solve this problem, the website provide screenshots of the example code is as follows:

Problem description

This gateway has been running stably since it went online, until one day there was a problem, frequent full GC, CPU increase, error rate soared, but the call usage of the interface did not increase. The machine was restarted immediately, and a memory dump file was kept. After a week of analysis, there was no conclusion. It happened again recently, and the situation was similar.

Full gc frequently

The old area is filled

The CPU increases, and the error rate increases

Troubleshoot problems

  • Start with memory dump

It can be concluded from the monitoring that there is a memory problem. Then analyze the memory files dumped at the beginning and use the Mat plug-in of Eclipse to analyze them

RegistryDirectory object up to more than 7000, direct positioning to The RegistryDirectory may have a memory leak. Follow up with the owner of this object

Found to be com. Alibaba. Dubbo. Remoting. Zookeeper. Curator. CuratorZookeeperClient $CuratorWatcherImpl, search the object of the class

This object is also very large. Check out the source code for Dubbo here (this article is based on 2.6.6). First find com. Alibaba. Dubbo. Remoting. Zookeeper. Curator. CuratorZookeeperClient $CuratorWatcherImpl created

There is only one place to create it, followed by addChildListener

It is called only when the node subscribes to ZooKeeper. Continue to search for the place that subscribes to ZooKeeper and find two places

  • CreateProxy of ReferenceConfig calls doRefer of RegistryProtocol to subscribe to ZooKeeper.
  • Another is that there is a thread in FailbackRegistry that continually retries the path that failed to subscribe and recovers when zK reconnects.

Suspect the second first

When the Dubbo application disconnects from ZooKeeper and reconnects to ZooKeeper, the recover application performs re-subscription to ZooKeeper. This is easy to simulate. Test the offline interruption point to see if the CuratorWatcherImpl object is regenerated when ZooKeeper is disconnected and reconnected. The result, of course, is not generated, because the Dubbo application caches the CuratorWatcherImpl object. For the same URL subscription, the same CuratorWatcherImpl object is returned and not regenerated.

  • Look to the Internet for answers

Change ideas to start again, go to the Internet to see if anyone has encountered the same problem, online search, found github.com/apache/dubb… Almost the same problem as mine. Is it a release problem? But after several times of contrast rejected, because usually released all right, but at this time of an accident? And the network card traffic was not high during the two outbreaks.

After rejecting this issue, another issue was found, github.com/apache/dubb…

This was quickly rejected because the main problem was that the reference was not cached, which is mentioned in dubbo’s documentation

The faulty Dubbo gateway caches the reference (interface+version+group+timeout is used as the key, and timeout is used to dynamically adjust the timeout period of the interface). In theory, reference will not be generated repeatedly

  • “Waiting for a rabbit”

Problem really deadlock, fortunately to see such an article, “Netty heap memory leak investigation feast” (click the original view), the author in the face of the troubleshooting problem, online implanted a monitoring code to help locate the problem. I was wondering if I could put in some code to see what all those subscriptions were. Look through dubbo’s source code to find a spot where you can get a subscription

ZookeeperRegistry contains zkListeners for which urls are subscribed to. Simple, periodic detection and print logs. Get this variable by reflection

Code changes, tests, puts a machine on the line and waits. After a whole afternoon, finally caught

It was found that one of the services was repeatedly subscribed for many times, and the subscribed URL was only timestamp different, and only one service did this. It was suspected that this was related to the service itself, and it was found that there was no provider in the service

Do services without providers have repeated subscriptions? Try using the gateway to call a service without provider (omit the replay process). The problem is easy to troubleshoot if it can reproduce. The point of interruption can be found according to the call stack

When referenceConfig is generated, the proxy is initialized. If it is initialized, it is ignored. The problem is createProxy

If check=true and the provider does not exist, createProxy will throw an exception. CreateProxy subscribed to ZooKeeper and cached the RegistryDirectory object. Let’s see if there are a lot of errors on those days

It seems that there are not many RegistryDirectory objects, which can be found through debugging

The URL of each subscription is generated, that is, timestamp is different and is cached in the URL variable, then loop this URL to refer, that is, the first time there is a URL in the URL that needs to be subscribed, the second time is 2, and the third time is 3. Gauss’s calculation of 1+2+3+4+… +100 problem, 100 requests will result in 5050 RegistryDirectory objects.

The solution is simple: change check=true to check=false.

conclusion

  • This is the bug dubbo, 2.7.5 version already will subscribe to the URL of the timestamp removed, only to subscribe once a URL issue which was mentioned before, https://github.com/apache/dubbo/issues/4587, But at that time, the issue did not solve our question;
  • Set reference’s check to false on generalization calls, otherwise memory leaks may occur; Common calls (XML configuration) have no impact, because check=true If there is no provider application startup failure.
  • The difficulty of troubleshooting is as follows: Fault locating based on monitoring, code, and logs < fault that can be stably regenerated < Fault that cannot be stably regenerated (occasionally) < Fault that cannot be regenerated (only once). This topic describes the fault that cannot be regenerated.