This article will be useful for you to find out if you are using Netty to leak out of the heap

background

Recently, I am working on a webSocket-based long connection middleware. The server uses a Netty-Socketio framework that implements socket. IO protocol (based on WebSocket protocol, which provides long polling degradation capability). Since I am familiar with Netty and compared with other frameworks that also implement socket. IO protocol, this framework has a better reputation, so I choose this framework as the underlying core.

Any open source framework can not avoid the existence of bugs, when we use this open source framework, we encountered a bug of memory leakage out of the heap, in view of the netty relatively familiar, so we want to challenge, find the bug (bug), the following is the phenomenon and troubleshooting process. If you want to see the conclusion, you can go to the end.

The phenomenon of

One morning, the nGINX server receives an alarm of 5XX

We used Nginx as a seven-tier load on the server Websocket, and 5XX outbreaks usually indicate that the server is not available. Since the current NGINX alarm does not subdivide which machine is unavailable, next, I go to CAT (Comment on the unified Monitoring platform of Meituan) to check the indicators of the whole cluster and find the following two exceptions

Some machine breaks out at the same point in time, and at the same time, the JVM thread blocks

Then begins the long journey of checking for memory leaks out of the heap.

The screening process

Phase 1: Suspect log4j2

The first thing I wanted to do was to locate which threads were blocked. Finally, I found out that log4j2 was blocking netty’s NIO thread (because the scene was not saved in time, so the screenshot was missing). After the NIO thread was blocked, our server was unable to process client requests. 5xx for Nginx.

Next, look at the log4j2 configuration file

Found that the print to the console of the appender forget commented out, so I print log is overmuch, initial guess because of this project and log4j2 print to the console is a synchronized block print, next, the online all machines of this line commented out, think you’re done, unexpectedly, after a few days, 5 xx warning comes knocking at the door, The question, it seems, is not so simple.

Phase 2: Suspicious logs emerge

Next, I had to go to the log, check the log before and after the point of failure, and found a suspicious place

Failed to allocate 64(bytes) of direct memory(…) The log (a few dozen log files at a moment, each of several hundred meters), which throws an OutOfDirectMemoryError encapsulated by Netty itself, in other words, there is not enough memory out of the heap and Netty is Shouting foul.

I was a little upset when I heard the term “out of heap memory leak”, because it was as difficult to detect as c memory leak. The first thing I thought of was to check for exceptions before OOM broke out. Then I checked all machine related indicators on CAT, checked all logs before OOM log. Nothing unusual was found! At this time my heart began to scold…

Phase 3: Locate OOM sources

But there’s nothing to do but stare at the awful OOM log and hope the answer pops up. At the end of the day, a flash of light appears in front of your eyes, and a few lines below the OOM log become dazzling. (Why didn’t you check the log before? It is estimated to be scared by the word “out of heap memory leak” ==), these lines are…. PlatformDepedeng.incrementMemory()… . If there is enough external memory in the heap, it is netty’s own statistics. Can you find the statistical code? After finding the statistical code, we can see the external memory statistical logic in Netty? So, next, scroll through the code and find the logic in the PlatformDepedent class

This is a DIRECT_MEMORY_COUNTER operation that will raise a custom OOM Error if it is found that the used memory is greater than the maximum (specified by the user). The text in the exception is exactly what we saw in the log.

Next, verify that this function is called when memory is allocated out of the heap

Sure enough, netty counted every time it allocated memory out of the heap, and that made my mind clearer and my mood better.

Phase 4: Reflection performs out-of-heap memory monitoring

Now that the cat on outside the heap memory monitoring without any exception, there should be no accurate statistics, has been maintained at 1 m), and here we confirm outside the heap memory is almost more than limit, and already know netty underlying which field is used to statistics, then the first thing to do, is the reflection to this field, Then we do our own statistics on Netty’s use of out-of-heap memory.

The off-heap memory statistics field is DIRECT_MEMORY_COUNTER, which can be accessed by reflection and checked periodically to monitor the growth of netty off-heap memory.

We take this field by reflection and we print it every second. Why would I do that?

Since, from our previous analysis, there was nothing suspicious before the OOM explosion, there are only two situations where there is a sudden amount of out-of-heap memory allocated that causes the OOM to explode, or where the out-of-heap memory grows slowly until it reaches a point where the last straw overwhelms the machine. After this code is added, it is packaged online.

Stage 5: Slow growth or sudden surge?

After the code goes live, the initial memory is 16384K (16M). This is because we are using pooled out-of-heap memory online. By default, a chunk is 16M.

After a while, the memory began to slowly surge, and there was no sign of release, 20 minutes later, the memory is as follows

Here, guess it was mentioned earlier in the second case, that is memory OOM caused by slow growth, because the memory is growing too slowly, then adjust the machine load weight twice of the other machines, but is still a few K levels on growth, this day just right is on Friday, anyway he a weekend to open to see.

After a happy weekend, the first time I got to the company, I connected to the switchboard, logged in to the online machine, and began to tail -f to continue to check the log. After typing the command, I hit the enter key with great anticipation

As expected, the memory has been growing slowly, a weekend of time, the heap of memory has been fast to a G, this time, I actually thought of an idiom: as long as kung fu deep, iron pestle ground into a needle! Even though the number of K’s and K’s of the out-of-heap memory is increasing, as long as it continues, there will always be a time when the memory will burst (the online out-of-heap memory limit is set to 2 gb).

At this point, I began to ask myself: why does memory grow slowly, and with what? Because our application is a user-oriented WebSocket, is it possible that every time a user comes in, interacts, and then leaves, memory grows a little bit and doesn’t release it? With this in mind, I started the offline simulation.

Phase 6: Offline simulation

When the server is started, the console prints the following information: if the server is started, it will monitor the out-of-heap memory in units of B (one client at a time due to low local traffic), and it will also use non-pooled memory (smaller memory numbers make it easier to see the problem)

In the absence of client access, out-of-heap memory remains 0, not surprisingly. Next, with a very excited mood, open the browser, and then input the url, began our simulation tour.

Our simulation process is: create a new client link -> disconnect -> create another client link -> disconnect

As shown in the figure above, a connect and disconnect indicate the establishment and closure of a connection. The logs in the green boxes in the figure above show the lifetime of the two connections, respectively. As we can see, memory increases by 256B each time the connection is closed and then is not freed. Here, the problem is further narrowed. It must be that when the connection is closed, a bug in the framework is triggered. After narrowing the problem, the next start of the source code to catch bugs!

Phase 7: Offline investigation

Next, I will restart the local service and start the complete offline screening process. By looking at the frame disconnect event (called when the client websocket connection is closed), you can basically confirm that the requested memory was not freed before or after the disconnect event

Here, when using IDEA Debug, choose to suspend only the current thread so that the console can still see the out-of-heap memory count thread printing logs while we step trace.

After the client connection is connected and closed, the breakpoint enters the onDisconnect callback. I specially stayed here for a while, and found that the console memory did not soar (the memory of 7B has not been analyzed for the moment, just know that after the client connection is disconnected, our breakpoint is held, and the memory has not started to soar). Next, The magic happens. I release the breakpoint and let the program run

Memory surges when debug is loose!! By this time I knew that the bug couldn’t fly very far. During debug, the current thread is suspended, so the current thread must have applied for out-of-heap memory somewhere, and then did not release, next, the rush, deep source code.

With each step, I watched the console memory soar, and soon we were here

Until this line is executed, console memory is still 263B, and then, immediately after this line is executed, it goes from 263B to 519B (increased by 256B).

Then, the scope of the bug is further narrowed, I run the program, release the client again, the breakpoint in the client.send() line, then close the client, and then directly into the method, the subsequent process is a bit long, because of netty’s time propagation mechanism, I will skip here. Finally, I follow the following code, handleWebsocket

Here, I see a very suspicious place. On the line above the breakpoint, the call to encoder allocates a chunk of memory, and immediately after the call, our console is running 256B, so I suspect that the memory is not freed. He then calls the encoder. EncodePacket () method, which is supposed to write the contents of the packet to this 256B memory in binary mode. Next, I follow this encode code, and after single step, locate this line of code

This code converts the value of a field in packet to a char. However, when I use the idea preexecution, it throws the class an angry NPE! Framework that is to say, after application to a memory, at the time of encoder, GG, dug a pit of NPE themselves, resulting in memory cannot be released (the outermost layer outside the heap memory release of logic, now unable to perform the), and then accumulate more more, save more more, until the last straw, heap memory so burst, Here the source code interested readers can themselves to analyze, limited to space reasons, here is no longer analyzed.

Phase 8: Bug resolution

We need to fix the NPE exception. Our goal is to make the subType field not empty. We first locate where the packet is defined through the thread call stack of IDEA

We find the debugger panel of idea, we keep our eyes on the packet object, we move the cursor online, and then we go to the speed of light to locate the packet object, which is already in our previous code. We look at the subType field, Sure enough, it was null, and then it was easy to fix the bug.

Since this is a connection closing event, I gave it a field called DISCONNECT (dig deeper into socket. IO protocol later), but this bug is triggered when the connection closes, so it’s a little rude! = =.

The process to solve this bug is: download the source code of the framework to the local, and then add this line, and finally, I re-build, change the name in the POM, push to our company’s warehouse, so that my project can be directly used.

After fixing a bug, it’s customary to go to Github and find the commit that caused the bug

dzn

Phase 9: Offline validation

Once everything is in place, first of all, let’s do local validation. Once the service is up, I frantically set up the connection, frantically disconnect it, and watch what’s going on out of the heap

Dude, no matter how much you disconnect, the out-of-heap memory stays the same, so far, the bug is basically fixed, and of course, the last step, we push the code online for verification.

Phase 10: Online validation

For this online validation, we avoided the log method of comparing dirt, we sprayed the out-of-heap memory index onto cat, and then looked at the out-of-heap memory over a period of time

After a while, we found that the out-of-heap memory has stabilized, and our journey to catch bugs is over! Finally, let me wrap up my local insect hunting tour

conclusion

1. Do not be afraid of memory leakage outside the heap. Analyze carefully and patiently, and you can always find ideas.

2. If you are using netty out-of-heap memory, you can monitor your out-of-heap memory usage without using a third-party tool. I am using reflection to get out of the heap memory.

3. Narrow it down until you find the bug. When you confirm that a thread’s execution is causing a bug, you can single step or double execute, locate a line of code, follow the code, and then continue to single step or double to locate the code that caused the bug, this method is always successful, and eventually you will find the bug.

4. Proficient in debugging idea, so that you catch insects as fast as lightning, here, the most common way of debugging is to pre-execute expressions, and through the thread call stack, dead stare at an object, you can master the definition of this object, assignment and so on.

Finally, I wish everyone can find their own bug!