preface

Recently, our team took over a legacy project, which is an HTTP service project written in Node.js language. The general logic of the project is to implement a layer of proxy logic through Node.js, do traffic transit, and add some key buried points to show the request record (and some advanced usage, tamper with the data related to the request). However, it has nothing to do with this obstacle arrangement, so I will not repeat it. When we first took over, we were told that the project had to be restarted regularly due to memory leak and other unknown reasons, otherwise users would experience obvious delay or the service would be directly unavailable. So while we were doing human operations, we started reading the source code, understanding the project, analyzing the cause of the problem and trying to fix it.

The screening process

The project is buried with key indicators, including memory, CPU and other indicators, as well as user traffic data statistics. We see two relatively important indicators:

  1. Memory trends
  2. Number of Active handles



As originally indicated, there was a memory leak, and the indicator dropped off a cliff in less than an hour due to the service being restarted. Here we can briefly introduce what an Active Handle is. This will extend to a keyword that is often involved in a Node.js technology stack: Event Loop.

A brief introduction to the Event Loop

To briefly explain how the Event Loop works, if we have a file and write something like this:

console.log('hello')

Then we run the file as node XXX and the process exits when it is finished.

And if we write a file like this:

console.log('hello')

setInterval(() => {}, 1e3)

If this file is run again as Node XXX, the terminal will stay in the process and will not exit, because we created a timer, which can be understood as an Active Handle, and the running process of the Event Loop is to execute all the synchronization code. If there is no Active Handle, exit directly. If there is a Handle, enter the Event Loop that we are familiar with. We can get the number of Active handles through a non-public API:

process._getActiveHandles()

console.log(process._getActiveHandles().length) // 0

If we run this code through Node XXX, we get a 0, but if we go directly to the interactive command line through Node, we get a 2. That’s because the interactive command line will bind stdin and stdout, which is why you’re stuck in an interactive command line environment. Not the terminal.

Asynchronous APIs such as process.nextTick, net.listen, and fs.xxx can be considered an Active Handle. As long as a similar scenario exists, the process can always run.

What is an Active Handle? Why does this indicator help us troubleshoot problems?

To analyze problems

In the process of looking at the project code, we found that the core logic of the project is to do the forwarding of traffic, and the way to achieve this function is to do it through two sockets binding each other, the specific code:

export function pipeSocket(from: Socket, to: Socket) {
  from.on('data', chunk => {
    to.write(chunk)
  })
  from.on('close', () => {
    to.destroy()
  })
}

It seems that there is nothing wrong with it. However, combined with the abnormal fluctuation of the number of active handle we have seen above, we have reason to suspect that it is related here. At the same time we are abnormal indicators in the server when the problem of the investigation, found some clues. We by performing netstat – an | awk ‘/ ^ TCP / {+ + y [$NF]} END {for (w y) in print w, y [w]}’ to view your current connection to distinguish the type of quantity statistics, similar results were obtained:

As you can see, there are a lot of ESTABLISHED types, which is different from my initial guess. I thought it would be a common problem with CLOSE_WAIT and TIME_WAIT, but I didn’t expect it to be an abnormal number of ESTABLISHED types. In the TCP connection state definition, ESTABLISHED represents a successful connection, which can be understood as a normal data transfer state:

But at the same time there are so many transmission link is clearly not in a normal state, so we by performing netstat NAT | grep ESTABLISHED | awk ‘{print $5}’ | awk – F: ‘{print $1}’ | sort | uniq -c to further look at what exactly IP is occupied the most connection is indeed found 127.0.0.1, this is because the project through the forwarding agent logic to a specific port local service to complete, the actual agent logic is provided by the service.

No mapping due to IP sensitivity.


The general situation is that about dozens of data have been found out, and one data of 127.0.0.1 corresponds to more than 2W.

Let’s take a closer look at 127.0.0.1 to find out why there are so many connections in transit and why they are not closed.

So we by performing a ss – aoen | grep ‘127.0.0.1 | grep’ ESTAB | tail -n 10 related to sampling to get top ten sockets statistics, we found some clue here, in the last column of the statistical information in details, We found that many sockets do not have a timeout.

This is a bit confusing, because our system is actually set to keep alive, and if the data on the other side is no longer being transmitted, the connection should only be maintained for 2 hours (our system is set at 7200), which should not produce such a number of connections for 2 hours. The results obtained from the above ss command also help us to be more certain. Some sockets have not been released normally and have been in the ESTABLISHED state. In order to collect accurate data, we input the following two commands respectively, and distinguish the statistics according to whether the timer is included:

Ss - aoen | grep '127.0.0.1 | grep' ESTAB | grep -v 'timer:' | wc - # 28636 l ss - aoen | grep '127.0.0.1 | grep' ESTAB  | grep 'timer:' | wc -l # 115

Gap is very terrible, we’ll find some of the ss through sampling port, through lsof -p < process ID > | grep < port > way of screening view the file descriptor information. /proc/< process ID>/ FD /

= ‘keepalive’;

Based on that information, you pretty much know what the problem is.

To fix the problem

Once the problem was identified, we started to fix it, since the target was clear, to find the places where KeepAlive was not set and to add KeepAlive. This brings us back to the key function above:

export function pipeSocket(from: Socket, to: Socket) {
  from.on('data', chunk => {
    to.write(chunk)
  })
  from.on('close', () => {
    to.destroy()
  })
}

As mentioned earlier with Active Handle, Socket is also one of these functions, and since it involves the establishment of a connection, we have focused on this function at first after our previous exploration. By looking at the net module documents, we find that the default Socket is not add keepalive: https://nodejs.org/api/net.ht… So the simple solution is to add the Socket setKeepAlive operation inside the function. While we were already looking at the NET module documentation, we took a closer look at the Socket and found that not only is there no KeepAlive by default, but there is also no timeout default.

Incidentally, we also added timeout processing. This is because the project is used as a proxy service, and all global requests will pass through the service. It is inevitable that users will occasionally, imsubjectively or accidentally access some illegal resources, which will cause the request to be in a pending state. So we’ll add a timeout to prevent this from happening.

In fact, the Socket has many events, including error, end, timeout, and so on. Even if we set a timeout, the Socket will not close automatically, but will trigger a timeout event. The actual shutdown needs to be performed manually. In close events, destroy is currently used. The preferred option is to use end, which sends a FIN to the opposite end, and destroy, which destroys the current connection. In the case of an event, end triggers end and close. Destroy only fires Close.

So after the changes we made above, the function looks like this:

export function pipeSocket(from: Socket, to: Socket) {
  from.setTimeout(60 * 1e3)
  from.setKeepAlive(true, 60 * 1e3)
  from.on('data', chunk => {
    to.write(chunk)
  })
  from.on('close', () => {
    to.end()
  })
  from.on('error', () => {
    to.end()
  })
  from.on('end', () => {
    to.end()
  })
  from.on('timeout', () => {
    from.destroy()
  })
}

Validation issues

After the above code is modified and committed, we try to run for a while, and again check the number of TCP connections through the command:

Ss - aoen | grep '127.0.0.1 | grep' ESTAB | grep -v 'timer:' | wc - # 191 l ss - aoen | grep '127.0.0.1 | grep' ESTAB '|  grep 'timer:' | wc -l # 261

Again through the netstat – an | awk ‘/ ^ TCP / {+ + y [$NF]} END {for (w y) in print w, y [w]}’ validation TCP each state the number of connections:

TIME_WAIT may look like a lot, but unlike previous ESTABLISHED, it doesn’t always exist. It closes and creates loops due to multiple requests, so you’ll see a lot of TIME_WAIT every time you enter a command, which is normal for TCP closure.

Both the memory indicator and the Active Handle indicator we focus on are in a normal state:



conclusion

Summarize the problems encountered in this troubleshooting. The main problem is the use posture of the Socket, and the resource is not released in time.

  1. The Socket has no timeout set
  2. Socket does not set KeepAlive
  3. Socket status monitoring is not perfect
  4. Socket Destroy is an inelegant way to close

After the above several processing, the encountered users can not access, access speed is slow, forced to restart regularly and other problems have been solved.