takeaway

Netty is an asynchronous event-driven network communication layer framework, used for rapid development of high availability and high performance server network framework and client program, it greatly simplifies the TCP and UDP socket server network programming.

Netty is based on JDK NIO, why not directly based on JDK NIO or other NIO frameworks:

  1. Using NIO that comes with the JDK requires too much conceptual understanding and programming complexity.
  2. Netty’s underlying IO model switches freely, all with minor changes.
  3. Netty’s built-in unpacking, exception detection, and other mechanisms take us away from the gritty details of NIO and focus on the business logic.
  4. Netty addressed a number of JDK bugs including empty training.
  5. Netty made a lot of small optimizations for threads and selectors, and the Reactor thread was carefully designed to achieve very efficient concurrent processing.
  6. It comes with a variety of protocol stacks, allowing us to handle almost any general protocol without having to do it ourselves.
  7. Netty community active, feel free to have a mailing list or issue.
  8. Netty has undergone extensive online validation of major RPC frameworks (Dubbo), messaging middleware (RocketMQ), and big Data communication (Hadoop), and is extremely robust.

background

Recently, I am making a websocket-based middleware for long connection. The server uses netty-Socketio framework that implements socket. IO protocol (based on Websocket protocol, providing long polling degradation capability), which is implemented by NetTY. Since I am familiar with Netty, and Netty has a better reputation than other frameworks that also implement socket. IO, I chose this framework as the underlying core.

Of course, any open source framework is not immune to bugs, and we encountered an out-of-heap memory leak Bug when using this open source framework. The value of Meituan has always been “pursuit of excellence”, so we want to challenge and find the Bug, and this article is the process of the problem and troubleshooting. Of course, if you want to see the conclusion, you can skip to the end and read the summary.

The problem

One morning, we suddenly received an alarm that there was a large number of 5XX on the Nginx server.

We use Nginx as layer 7 load for the server WebSocket, and an outbreak of 5XX usually indicates that the server is unavailable. Since Nginx alarms do not specify which machine is unavailable, then we went to CAT (meituan-Ping unified monitoring platform, now open source) to check the indicators of the whole cluster, and found the following two anomalies:

A machine erupts GC (garbage collection) at the same time, and a JVM thread is blocked at the same time.

Then we begin the long search for out-of-heap memory leaks.

The screening process

Stage 1: Log4j2 is suspected

Because threads were blocked a lot, we first thought to locate which threads were blocked, and finally found out that Log4j2 was blocking Netty’s NIO thread. After the NIO thread blocks, it’s 5xx for Nginx because our server can’t handle client requests.

Next, we looked at the Log4j2 configuration file.

We found that the appender that prints to the console forgot to comment it out, so our first guess is that the project prints too many logs, and Log4j2 prints to the console synchronously blocking, and that’s what caused the problem. So next, we commented out this line for all the machines online, thinking we would be “done”, but only a few days later, the 5XX alarm came knocking again. It seems that the problem is not as simple as we first thought.

Phase 2: Suspicious logs emerge

Then we had to force our way through the logs, especially around the point of failure, and found another suspicious spot:

In a very short time, allocate failed to 64(bytes) of direct memory(…) A log (a dozen log files at a time, hundreds of megabytes each) throws an OutOfDirectMemoryError that Netty itself encapsulates. To put it plainly, it is the heap outside the memory is not enough, Netty has been in “shout injustice”.

Out-of-heap memory leak is a very depressing term to hear. Since this problem is as hard to detect as a C memory leak, the first thing to think of is to check for exceptions before OOM bursts. Then check all indexes related to the machine on CAT, check all logs before OOM log, and find no exception! This time the heart has “ten thousand horses pentium”……

Phase 3: Locate the OOM source

There’s nothing I can do but stare at these obnoxious OOM logs, hoping the answer will pop up in front of me, but that’s wishful thinking. At the end of the day, a flash of light suddenly appeared in front of my eyes. The log lines at the bottom of the OOM became dazzling. (Why didn’t YOU try to check the log before? The word “out-of-heap memory leak” probably scared me. The lines are…. PlatformDepedeng.incrementMemory()… .

Netty (Netty, Netty, Netty, Netty, Netty, Netty, Netty, Netty, Netty) So, I’m going to go through the code and find this logic, right here in the PlatformDepedent class.

DIRECT_MEMORY_COUNTER. If you find that the used memory is greater than the upper limit of the out-of-heap memory (as specified by the user), raise a custom OOM Error. The text inside the exception is exactly what you see in the log.

Next, verify that this method is called when out-of-heap memory is allocated.

Sure enough, Netty counts each time it allocates out-of-heap memory. Think of this, the train of thought began to slowly clear, and the mood also began to change from “autumn wind rustling” to “spring sunshine”.

Phase 4: Reflection for out-of-heap memory monitoring

CAT has no abnormal monitoring of the out-of-heap memory (it is not statistically accurate, it has been maintained at 1M), and here we confirm that the out-of-heap memory has nearly exceeded the upper limit, and we already know which field is used by the bottom layer of Netty for statistics. So the first thing we need to do is reflect this field, and then we’ll do our own statistics on Netty’s use of out-of-heap memory.

The off-heap statistics field is DIRECT_MEMORY_COUNTER, which can be reflected and checked periodically to monitor the growth of Netty’s off-heap memory.

So we take this field by reflection and print it every second, why do we do that?

Because, through our previous analysis, there was nothing suspicious about the OOM phenomenon until it exploded. Then there are only two cases, one is a moment of a large amount of out-of-heap memory allocation, resulting in OOM; One is that the off-heap memory grows slowly until at some point the last straw knocks the machine down. After this code was added, we packed up and went live.

Stage 5: Slow growth or sudden surge?

Once the code goes live, the initial memory is 16384K (16M). This is because online we are using pooled off-heap memory, which is 16M per chunk by default.

But after a while, the memory began to slowly surge, and there was no sign of release. Twenty minutes later, the memory usage was as follows:

Here, we guess that it may be the second case mentioned above, that is, OOM caused by slow memory growth. Since the memory growth is too slow, we adjust the load weight of the machine to twice that of other machines, but it is still growing at the number of K level. It happened to be a Friday, just for the weekend.

After the weekend, we connected to the jumper as soon as we got to the company, logged in to the online machine, and started tail -f to continue to check the logs. After typing the command, he hit the return key in anticipation:

As expected, the memory kept growing slowly, and over the weekend, the memory out of the heap had soared to nearly one GIGAByte. At this time, AN old saying came to my mind: “If you work hard enough, an iron pestle can be ground into a needle.” The out-of-heap memory is slowly growing at a rate of a few kilobytes, but as long as this continues, there will be a time when the memory will be exhausted (the online out-of-heap memory limit is set at 2 gigabytes).

At this point, we begin to ask ourselves: why is memory growing slowly, and with what? Since our application is a client-facing WebSocket, is it possible that every time a user comes in, interacts with the user and leaves, the memory grows a little and is not released? With this question, we started the offline simulation process.

Stage 6: Offline simulation

After the server is started locally, the unit of monitoring out-of-heap memory is changed to B (because the local traffic is small, one client at a time is planned to connect). In addition, the local memory is also used unpooled (the memory number is small, so it is easy to see the problem). After the server is started, the console prints the following information

When no clients are connected, out-of-heap memory is always zero, as is expected. Then, with great excitement, open your browser, type in the url, and start the simulation.

Our simulation flow is: create a new client link -> disconnect the link -> create a new client link -> disconnect the link again.

As shown in the figure above, Connect and Disconnect are the establishment and closure of a connection. The logs in the green boxes in the figure above are the life cycles of the two connections respectively. As we can see, the memory jumps 256B each time the connection is closed and then does not release. At this point, the problem is further narrowed down, it must be that a Bug in the framework was triggered when the connection was closed, and this Bug allocated 256B of memory before triggering, as the Bug was triggered, the memory was not released. After the problem is reduced, the next to start “masturbation source”, catch bugs!

Stage 7: Offline screening

Next, we will restart the local service and start the full offline verification process. Looking at the Netty-Socketio framework’s Disconnect event (which is called when the client WebSocket connection is closed), it is almost certain that the memory requested before and after the Disconnect event has not been freed.

When using idea Debug, choose to suspend only the current thread so that the console can still see the off-heap statistics thread printing logs when we step through.

After the client connected and then closed, the breakpoint went into the onDisconnect callback, we purposely stayed there for a while longer and found that the console memory did not surge (7B was not analyzed for the moment, just to know that after the client connection was disconnected, we held the breakpoint and the memory did not start to increase). Then something magical happens. We release the breakpoint and let the program run:

When Debug is dropped, memory immediately increases!! At this point, we already know that the “bug” won’t get very far. In Debug, if the current thread is suspended, the current thread must have requested out-of-heap memory somewhere, and then did not release it, continuing to “speed up” the source code.

In fact, with every single step, we observe the memory surge of the console. Soon, we came to this place:

Until this line is executed, the console’s memory is still 263B. Then, when the line is executed, it immediately goes from 263B to 519B (up 256B).

As a result, the scope of the Bug Narrows further. The breakpoint is on the client.send() line, the breakpoint is closed, and the client is directly connected to this method. The subsequent process is a bit long, because it is related to Netty’s time-propagation mechanism, so it is omitted here. Finally, we trace to the following code, handleWebsocket:

At this point, we see something very suspicious. On the breakpoint in the figure above, a call to Encoder allocates a chunk of memory, and immediately after the call, our console is 256B. Therefore, we suspect that the memory applied here must not be released, it then calls encoder. EncodePacket () method, guessing that the content of the packet is written to the 256B memory in binary mode. Next, we trace this encode code, and after a single step, we locate this line of code:

This code converts a field in packet to a char. However, when we pre-execute with IDEA, we throw a class with an angry NPE!! In other words, after the framework has applied for a section of memory, the encoder has broken itself and dug a deep hole in the NPE. Finally, the memory cannot be freed (the outer layer of the out-of-heap memory free logic, now can not execute). And it keeps piling up until it collapses under the last straw. Here the source code, interested readers can go to their own analysis, limited to space reasons, here will not expand the narrative.

Stage 8: Bug resolution

Now that the Bug has been found, it’s time to fix the problem. Here we just need to resolve the NPE exception and we can Fix it. Our goal is to make the subType field not empty. So we first locate where the packet is defined through the thread call stack of IDEA:

We found the DEBUgger panel of IDEA, fixed our eyes on the packet object, and then moved the cursor online to locate it at the speed of light. SubType = null; subType = null; subType = null; From there, fixing the Bug is easy.

This is a connection closure event, so we specify a field called DISCONNECT (you can dig into socket. IO for another day). This Bug is triggered when the connection is closed.

The process to solve this Bug is: download the source code of this framework to the local, then add this line, and finally rebuild it, change the name in the POM, and push it to our company’s warehouse. In this way, the project can be used directly.

After fixing the Bug, it’s customary to go to GitHub and find the Commit that caused the Bug:

dzn

Stage 9: Offline verification

Once everything is in place, we do local validation, and once the service is up, we frantically establish connections, frantically disconnect, and watch what happens to out-of-heap memory:

Bingo! No matter how disconnected we are, out-of-heap memory doesn’t grow. At this point, the Bug is basically fixed, and of course the last step is to push the code online for verification.

Stage 10: Online validation

In this online validation, we avoided the mundane logging method. We “sprayed” the out-of-heap memory metric onto CAT, and then observed out-of-heap memory over a period of time:

After a period of time, the out-of-heap memory has stabilized. At this point, our “bug trip” to this end. Finally, we also do a small summary for you, I hope to help you.

conclusion

  1. Do not be afraid of memory leaks outside the heap. Analyze them carefully and patiently. You can always find ideas.

  2. If you are using Netty out-of-heap memory, you can monitor the usage of out-of-heap memory by yourself. You do not need to use third-party tools.

  3. Narrow it down until the Bug is found. When we confirm that the execution of a thread causes a Bug, we can step it, or execute it in two ways. After locating a line of code, we can follow the code, and then continue to step it or execute it in two ways to locate the code that causes the Bug. This method always works, and you always find the Bug you want.

  4. Good command of idea debugging, so that our “catch insects” fast as lightning (” Flash “is so). Here, the most common way to debug is through pre-executed expressions and through thread calls to the stack, staring at an object and knowing its definition, assignment, and so on.

Finally, I wish you all can find their “long crush” Bug!

Author’s brief introduction

Flash, who joined Meituan-Dianping in 2014, is mainly responsible for the unified long connection of meituan-Dianping mobile terminal. We welcome peers to have technical exchanges.

recruitment

At present, our team is responsible for the infrastructure construction of Meituan Dianping, supporting the rapid development of almost all of the company’s businesses, such as hotels, take-out, in-store, taxi-hailing and finance. Join us, you can experience the scene of tens of millions of online connections, daily tens of billions of requests, you will face the challenge of high concurrency, high availability of the Internet, have the opportunity to touch Netty in the long connection field of various scenarios. We sincerely invite students with passion, ideas, experience and ability to work together with us! Welcome interested students to send resume to chao.yu#dianping.com consultation.

reference

  1. What is the Netty
  2. Netty source code analysis server start full parsing