The story background

Beep beep beep, production accident, memory leak!

Yesterday afternoon, I suddenly received a message from operation and maintenance that there was a memory leak in the production environment of a system in the branch. Please help to check it.

The screening process

The first step is to log

The abnormal log sent by the branch looks something like this (in view of the company’s regulations forbidding screenshots, taking photos and disseminating any information, I found a similar error on the Internet below) :

LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.

Recent access records: 

#1: 
  io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:273)
  io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
  io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
  io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
	io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	java.lang.Thread.run(Thread.java:748)
Copy the code

ByteBuf is not released, causing a memory leak.

Step two, look at the memory metrics

Since we know that the memory leak is caused, we should immediately ask o&M to check the memory usage, especially the out-of-heap memory usage (because Netty is used). According to o&M feedback, the in-heap memory usage is normal while the out-of-heap memory remains high.

OK, at this point it is clear that there is an out-of-heap memory leak.

At this point, there are two steps, one is to change the gateway into Zuul pressure observation, the other is to check the memory leak problem.

Step three, ask for the code

I asked the person in charge of the branch project to give me the code. When I opened it, I was amazed. It was a simple Spring Cloud Gateway project, which also contains two classes, one is AuthFilter for permission verification, and the other is XssFilter for anti-attack.

Spring Cloud Gateway uses Netty, zuul 1.x uses Tomcat.

Step four: Initial suspicion

A quick scan of the code of each class shows that the code associated with ByteBuf in XssFilter, but there is no obvious message that ByteBuf has not been released. It is easy to mask this class and see if there are any memory leaks.

But how do you detect memory leaks? You can’t just delete this class and run on production.

The fifth step, parameters and monitoring transformation

PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf, PooledDirectByteBuf

Direct memory, that is, out-of-heap memory.

Why turn off pooling?

Because pooling is a cache of memory, it allocates 16M of memory at a time and does not release it immediately, enabling pooling is inconvenient to observe unless you debug slowly.

So how do you turn off pooling?

In Netty, all bytebufs are created using a locator called ByteBufAllocator. There is a default allocator in the interface ByteBufAllocator. Find the default allocator and the location where it was created to see the relevant code.

public interface ByteBufAllocator {

    ByteBufAllocator DEFAULT = ByteBufUtil.DEFAULT_ALLOCATOR;
}
public final class ByteBufUtil {

    static final ByteBufAllocator DEFAULT_ALLOCATOR;

    static {
        // This article is from work vertical consumption tong elder brother read source code
        String allocType = SystemPropertyUtil.get(
                "io.netty.allocator.type", PlatformDependent.isAndroid() ? "unpooled" : "pooled");
        allocType = allocType.toLowerCase(Locale.US).trim();

        ByteBufAllocator alloc;
        if ("unpooled".equals(allocType)) {
            alloc = UnpooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: {}", allocType);
        } else if ("pooled".equals(allocType)) {
            alloc = PooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: {}", allocType);
        } else {
            alloc = PooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: pooled (unknown: {})", allocType); } DEFAULT_ALLOCATOR = alloc; }}Copy the code

Io.net ty.allocator. Type = io.Net ty.allocator. Type = io.Net ty.allocator.

OK, add this parameter to the JVM startup parameter and assign it to unpooled.

How do you turn off pooling and be able to see in real time if memory is leaking?

Netty’s PlatformDependent class counts all direct memory usage.

Recently has been in the study of Netty source, so, I understand Netty’s various details very clearly, this article comes from work longitudinal consumption tong elder brother read source code, recently also in preparation, such as the back of finished, began Netty column creation.

So, all we need to do is write a timer that periodically prints this statistic. Here, I’ll give the code directly:

@Component
public class Metrics {

    @PostConstruct
    public void init(a) {
        ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
        scheduledExecutorService.scheduleAtFixedRate(()->{
            System.out.println("used direct memory: " + PlatformDependent.usedDirectMemory());
        }, 1.1, TimeUnit.SECONDS); }}Copy the code

Just drop it in the same or lower directory as the launch class.

Now that you’ve pooled and monitored, it’s time to debug.

The sixth step is preliminary debugging

Run the startup class directly and observe the logs.

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
Copy the code

At the beginning, the direct memory is fine, it’s always zero.

If you send a request to Netty, it is 404. If you send a request to Netty, it is 404. If you send a request to Netty, it is 404.

Step 7: Modify the configuration

I quickly launched a SpringBoot project and defined a request to modify the Gateway configuration so that it could be forwarded:

spring:
  cloud:
    gateway:
      routes:
      - id: test
        uri: http://localhost:8899/test
        predicates:
        - Path=/test
Copy the code

Step 8, debug again

After modifying the configuration, start two projects at the same time, one gateway, one Springboot, send the request, observe the direct memory usage:

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031
Copy the code

Sure enough, memory was not freed.

Step 9 Delete the XssFilter

To verify your initial suspicions about the XssFilter, delete it, start the project again, send the request, and observe the direct memory usage.

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031
Copy the code

The problem is still there, and it’s the same size as the previous leak.

Netty uses guess to determine the size of each memory allocation. The initial guess is 1024.

@Override
public ByteBuf allocate(ByteBufAllocator alloc) {
    return alloc.ioBuffer(guess());
}
Copy the code

Did not think Netty has such a lovely side ^^, cough cough, off topic, forcibly pull back!

And then, there’s a 7B that stores newlines and carriage returns and things like that, and those 7Bs don’t get released, and they add up to 1031.

private static final byte[] ZERO_CRLF_CRLF = { '0', CR, LF, CR, LF };
// 2B
private static final ByteBuf CRLF_BUF = unreleasableBuffer(directBuffer(2).writeByte(CR).writeByte(LF));
// 5B
private static final ByteBuf ZERO_CRLF_CRLF_BUF = unreleasableBuffer(directBuffer(ZERO_CRLF_CRLF.length)
        .writeBytes(ZERO_CRLF_CRLF));
Copy the code

Well, that’s interesting. Since it’s not XssFilter, could it be AuthFilter?

Step 10: Kill the AuthFilter

Kill the AuthFiler, restart the project, send the request, and observe direct memory:

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031
Copy the code

The problem remains, again, the familiar memory size.

At this point, my train of thought is not smooth, the following is the road to run off.

Step 11: Think

After removing XssFilter and AuthFilter, there is only one launcher class left and, of course, a new monitor class.

I think I’ve discovered something new. If I find something wrong with Spring Cloud Gateway, I’ll be able to brag about it.

Since memory allocation is not free, let’s find where memory allocation is and make a breakpoint.

From the previous analysis, we already know that the memory allocator used is UnpooledByteBufAllocator, so make a breakpoint in its newDirectBuffer() method, because we have a direct memory leak here.

Step 12: debug step by step

UnpooledByteBufAllocator newDirectBuffer() = newDirectBuffer();

// io.netty.handler.codec.ByteToMessageDecoder.channelRead
@Override
public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
    if (msg instanceof ByteBuf) {
        CodecOutputList out = CodecOutputList.newInstance();
        try {
            first = cumulation == null;
            MSG is a ByteBuf
            cumulation = cumulator.cumulate(ctx.alloc(),
                    first ? Unpooled.EMPTY_BUFFER : cumulation, (ByteBuf) msg);
            // 2. Decoding, this article is from the vertical consumption tong elder brother read source code
            callDecode(ctx, cumulation, out);
        } catch (DecoderException e) {
            throw e;
        } catch (Exception e) {
            throw new DecoderException(e);
        } finally {
            if(cumulation ! =null && !cumulation.isReadable()) {
                numReads = 0;
                // 3. Release memory
                cumulation.release();
                cumulation = null;
            } else if (++ numReads >= discardAfterReads) {
                // We did enough reads already try to discard some bytes so we not risk to see a OOME.
                // See https://github.com/netty/netty/issues/4275
                numReads = 0;
                discardSomeReadBytes();
            }

            int size = out.size();
            firedChannelRead |= out.insertSinceRecycled();
            // 4. Read the rest of the value in out
            fireChannelRead(ctx, out, size);
            // 5out.recycle(); }}else{ ctx.fireChannelRead(msg); }}Copy the code

It took several hours, especially ChannelPipeLine inside accidentally jumped over, and had to start over again, really can only step by step.

This method is mainly used to convert ByteBuf to Message. Message is a Message, which can be understood as a simple Java object. The main logic is indicated in the above code.

Cumulation. Release (); The MSG (=cumulation) reference count is 4 before this line is called and is 2 after this line is freed.

After the following steps 4 and 5, out is recovered, but MSG is still not recovered. The problem must be in this part.

I have been twisted here, including the code in decode have been repeatedly looked at many times, here is not released MSG content after converting the object is DefaultHttpContent, it represents the BODY of the Http request, but here is the body of the Http request return value.

This also confused me, I tried, Http request body does not seem to go to this logic, and again and again to find the Http request body, made no progress for a long time.

By 9pm, the office was empty and the lights were off (during the pandemic, only a few people per department were allowed to visit each day), and I packed up and went home.

Step 13: Take a cab home

In the car, thinking about it all the time, recalling the whole process, could IT be that I was going in the wrong direction?

Spring Cloud Gateway has been out for quite a while now and I haven’t heard of any memory leaks, so I’m starting to doubt myself.

No, I have to write a project at home and run it using the Spring Cloud Gateway.

Step 14: Write a project that uses the Spring Cloud Gateway

When I got home, I quickly turned on my computer and started writing a project using Spring Cloud Gateway and a Project using SpringBoot. I turned on the monitoring, removed the pooling function, started the project, sent requests, and observed the direct memory.

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
Copy the code

Nani Asiba, by this point, it’s clear, it’s not Spring Cloud Gateway, what is it?

I must have used the wrong posture, but the company project, there is nothing else, I have deleted all the classes, only the startup class.

No, poM files.

Turn on the jumper, log in to the company computer, and look at pom.xml, which is filled with references to SpringBoot or SpringCloud itself.

Yes, there is a common package, which is written by the branch itself. Click on the common package, which references three JAR packages

Ouch I time Ao, at this time, I really want to scold niang, this all what thing ~~

In fact, I should have thought of the POM problem when I deleted AuthFilter, at that time, only YY Spring Cloud Gateway may have a bug, a head into the problem.

We know that the Spring Cloud Gateway uses Netty as the server to receive requests and then forward them to the downstream system. It’s really a funny thing.

Step 15: Kill Tomcat

In the POM file, remove the Tomcat JAR package, restart the project, send the request, and observe the direct memory:

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
Copy the code

Oh, there’s no problem. It’s Tomcat.

So, how did Tomcat do this? Plus Tomcat can also respond to the request normally, the request can also be forwarded normally, returned to the client, and, more terrible is that the internal is indeed the use of Netty request read and write response, really a bit magical.

Step 16: Discover the new World

To verify this, let’s exit the laptop, go back to my own computer, add Tomcat to the POM, and start the project

Could it be that Tomcat and Netty are listening on the same port at the same time?

Take a look at the project startup log:

Connected to the target VM, address: '127.0.0.1:52162', transport: 'socket' . ____ _ __ _ _ /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \ ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \ \\/ ___)| |_)| (_ | | | | | | | |)))) 'there comes | |. __ _ - | | | _ _ - | | | _ \ __, | / / / / = = = = = = = = = | _ | = = = = = = = = = = = = = = = | ___ / / _ / _ / _ / : : Spring the Boot: : (v2.2.6. RELEASE) the 2020-05-19 08:50:04. 7896-448 the INFO [main] com. Alan. Test. Application: No active profile set, falling back to default profiles: The default 08:50:04 2020-05-19. 7896-829 the INFO [main] O.S.C loud. Context. The scope. GenericScope: BeanFactory ID = 082e67ca-d4c7-3a8C-b051-e806722FD225 2020-05-19 08:50:04.998 INFO 7896 -- [main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat initialized with port(s): 8080 (HTTP) 08:50:05 2020-05-19. 7896-006 the INFO [main] o.a pache, catalina. Core. StandardService: Starting the service [Tomcat] 2020-05-19 08:50:05. 006 INFO 7896 - [the main] org. Apache. Catalina. Core. StandardEngine: Starting Servlet engine: Apache Tomcat / 9.0.33 08:50:05 2020-05-19. 7896-859 the INFO [main] O.S.C.G.R.R outeDefinitionRouteLocator: The Loaded RoutePredicateFactory [After] the 2020-05-19 08:50:05. 860 INFO - 7896 [main] O.S.C.G.R.R outeDefinitionRouteLocator : RoutePredicateFactory [Before] 2020-05-19 08:50:05.860 INFO 7896 -- [main] o.s.c.g.r.RouteDefinitionRouteLocator : RoutePredicateFactory [Between] 2020-05-19 08:50:05.860 INFO 7896 -- [main] o.s.c.g.r.RouteDefinitionRouteLocator : RoutePredicateFactory [Cookie] 2020-05-19 08:50:05.860 INFO 7896 -- [main] o.s.c.g.r.RouteDefinitionRouteLocator : RoutePredicateFactory [Header] INFO 7896 -- [main] o.s.c.g.r.RouteDefinitionRouteLocator : The Loaded RoutePredicateFactory [Host] 2020-05-19 08:50:05. 860 INFO - 7896 [main] O.S.C.G.R.R outeDefinitionRouteLocator : [Method] 2020-05-19 08:50:05.860 INFO 7896 -- [main] o.s.c.g.r.RouteDefinitionRouteLocator : The Loaded RoutePredicateFactory [Path] 2020-05-19 08:50:05. 860 INFO - 7896 [main] O.S.C.G.R.R outeDefinitionRouteLocator : The Loaded RoutePredicateFactory [Query] 2020-05-19 08:50:05. 860 INFO - 7896 [main] O.S.C.G.R.R outeDefinitionRouteLocator : RoutePredicateFactory [ReadBodyPredicateFactory] INFO 7896 -- [main] o.s.c.g.r.RouteDefinitionRouteLocator : RoutePredicateFactory [RemoteAddr] 2020-05-19 08:50:05.860 INFO 7896 -- [main] o.s.c.g.r.RouteDefinitionRouteLocator : This paper is derived from the worker longitudinal red elder brother read the source code 2020-05-19 08:50:05 consumption. 7896-860 the INFO [main] O.S.C.G.R.R outeDefinitionRouteLocator: RoutePredicateFactory [CloudFoundryRouteService] 2020-05-19 08:50:06.731 INFO 7896 -- [main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8080 (HTTP) with the context path '08:50:07 2020-05-19. 7896-304 the INFO [main] com. Alan. Test. Application: Started Application in 4.271 seconds (JVM Running for 5.0)Copy the code

It turns out that only Tomcat was started, so how did it hand over requests to Netty for processing?

Step 17, tomcat -> Netty

NIO divides socketChannels into ServerSocketChannel and SocketChannel. ServerSocketChannel is created when the service is started. SocketChannel is used to listen for incoming client connections, and SocketChannel represents the connection between the client and the server.

SocketChannel is created through ServerSocketChannel.

Netty divides these channels into nioxxxxChannel, EpollXxxChannel and so on according to different protocols. The channels for each protocol are also divided into NioServerSocketChannel, NioSocketChannel, etc.

On Windows, NioXxxChannel is used by default, and NioServerSocketChannel is created through NioServerSocketChannel, which is true if Netty is used normally.

The following is a thread stack for NioSocketChannel creation in normal Netty use:

However, our scenario is Tomcat + Netty. What about that?

At this point, hit a breakpoint in the NioSocketChannel constructor, send a request, find the breakpoint in the NioSocketChannel constructor, and observe the thread stack (from bottom up) :

Tomcat ->spring->reactor->reactor-netty->netty -> tomcat->spring-> reactor-netty->netty

This is a little bit more complicated, and we’ll talk about it in more detail when we have time.

Step 18, memory leak

As you can see from the above, Tomcat eventually handed over the processing of the request to Netty, but why the memory leak? This remains a problem.

After my comparison, the problem is the code in step 12, which adds a task to the NioEventLoop in fireChannelRead() when using a normal Netty request. Called MonoSendMany. SendManyInner. AsyncFlush:

final class AsyncFlush implements Runnable {
    @Override
    public void run(a) {
        if(pending ! =0) { ctx.flush(); }}}Copy the code

If tomcat + Netty is used, the client will receive the result of the response. If tomcat + Netty is used, the client will receive the result of the response. The data is sent to the client (presumably via the Tomcat connection rather than the NioSocketChannel itself), this is a legacy issue, I will look into it later, now I am a bit messy.

conclusion

This production event, although the whole code is relatively simple, but it still took quite a long time, now summarize a few points:

  1. Don’t be suspicious of open source frameworks, especially when they are used by many people like Spring.
  2. When you can’t find the cause of a problem, consider taking a break, relaxing and changing your mind.
  3. Why tomcat and Netty can coexist in the Spring Cloud Gateway is a problem, and an issue should be raised to the official. When both exist, it would be better to stop the program directly.

I’m working on my Netty column, and the old guys are waiting for me.