Business background

  1. Production and consumption model
  2. Producer passAsynchronous HTTPPull data from the peripheral system and push the data to the callbackqueue
  3. Consumers fromqueueEach element may be parsed into one or moremetric metaobject
  4. The SDK provided by the monitoring system willmetric metaObject push toMetric QueueIn the
  5. The monitoring system SDK reports the data to the monitoring platform
  6. Because we are only testing several peripheral systems at present, the amount of data is not large, about 2000 data per minute

Phenomenon of the problem

  1. Receipt of applicationmemory,CPUHigh usage alarm, check, both are close100%
  2. Restart the quick recovery service and observe the status of service resources
  3. As the service duration increases, so does the memory usage
  4. As the memory usage increases, so does the CPU usage. At the start of the service, CPU utilization was low, less than 10%. But in terms of the amount of data obtained by pulling, there is no situation that the data is getting bigger and bigger, and it is basically equal

Troubleshoot problems

According to the known phenomenon of the problem, it can be found that there are several points worth paying attention to:

  1. Inner existence keeps growing slowly. What kind of problems might cause this phenomenon?
    • Memory leak: The reclaimed object is not reclaimed, and as time goes on, more and more dirty objects are collected until OOM. It’s a BUG and needs to be fixed
    • Too many requests: If the heap capacity is 1000 units, the system can process 90 units per second (recycle: recycle after executing the service logic), and the request volume is 100 units per second. If the app keeps running like this, there will always be a point in time when the heap will burst and OOM. In this case, we need to improve the throughput of the system, to optimize
  2. CPU usage increases as memory usage increases. That’s the first reason that came to my mindGCIn fact, it isGCWe just have to verify that

In fact, the above two analyses have provided a lot of useful information. Next, we just need to verify our conjecture based on the business background and finally fix the problem

Problem conjecture

Leaving aside the problem of high CPU usage, which is caused by GC, that is not the point, the point is why GC is happening so often? Given the growing memory usage and business background, make the following assumptions

The rate of production exceeds the rate of consumption

That is to say, production is too fast and consumption is too slow. If be this kind of circumstance, what phenomenon can you have?

  1. As time goes on, inQueueMore and more objects are piled up in
  2. As the number of non-recyclable objects increases, the heap becomes less and less free
  3. The rate of production and the rate of consumption remain constant, meaning that the amount of heap space required per unit of time remains constant
  4. The JVM needs to perform GC on a case-by-case basis to ensure that there is enough space available to allocate objects
  5. Combined with the above four points, as the heap becomes smaller and smaller, the amount of free space required per unit time stays the same, meaning that the GC interval becomes smaller and smaller, meaning that GC becomes more frequent
  6. As GC becomes more frequent, CPU utilization increases
  7. The end result: The system crashes and the CPU keeps doing invalid GC because each GC fails to reclaim a valid object

From the point of view of the problem, it’s very similar to the problem we have. But the context of the business suggests that this is unlikely:

  1. The amount of data is very small, the system can completely process over, there is no time-consuming operation
  2. The test can be as simple as looking directly at consumption. If this is the case, there must be a significant delay in consumption. Or you can print out the rate of production and consumption, and compare it

A memory leak

If it is a memory leak, then you need to use some tools to troubleshoot, the general idea is as follows:

  1. throughjmapView large objects in the heap. There are two main types of large objects:Take up a lot of space,Number of
  2. In conjunction with Step 1, think about what part of the code might be wrong
  3. If you still don’t have any ideas for Step 2, useMAT,jvisualvmLet the tools analyze it

Verify the guess

The CPU usage is too high

We have already assumed that frequent GC is causing excessive CPU usage, so let’s verify:

  1. jstat -gcutil pid
  2. top -Hp pid
  3. jstack pid |grep 0xxx
  4. It was found that several GC threads had high CPU usage, which was verified

Memory leak conjecture verification

  1. jmap -histo pid |more
  2. Locate key objectsxxEventAnd found that the number of it exceeded100W
  3. Check the relevant business code, mainly from the following aspects:Source of creation,holders,Destruction of point. Sometimes it can be hard to find. It takes a little patience

According to the above three points, the related sources of xxEvent are finally found: producer, consumer and Queue.

One thing to note above is that xxEvent objects exceed 100W. Given the rate at which we process 2,000 pieces of data per minute, it is obviously unreasonable to have more than 100W objects in the Queue: 12W pieces of data are generated in an hour, and 100W means at least eight hours, during which the consumer does not consume anything at all, which in reality is impossible.

So, this must be a memory leak

Question why

  1. Used in the projectDisruptorThe framework
  2. createDisruptorObject, we need to pass in oneringBufferSizeIs the maximum length of the queue
  3. DisruptorObject is created when it’s initializedringBufferSizeThe size of the array, simultaneously based on what we passed ineventFactorycreateringBufferSizeZero objects, and this corresponds to zeroxxEvent
  4. In the coderingBufferSizeSize of the1024 * 1024That is, roughly100W. In other words, inDisruptorWhen the object is initialized, it’s done100WaxxEventObject creation
  5. DisruptorThe bottom layer is array-based and holds the consumption pointer and the production pointer internally. After the consumer consumes the element, it does not remove the element from the array directly, but changes the position of the consumption pointer. If the production pointer reaches the end of the array, the conclusion overwrites the preceding element
  6. Combine the above five points to analyze again:DisruptorWhen the object is created, it’s done100WaxxEventObject created, but because at this timexxEventObject is an empty object with no data in it, so it doesn’t take up much memory at this point. As the producer continues to produce elements in the queuexxEventIt keeps getting filled in. It keeps getting filled inxxEventMore and more objects. Because these objects are referenced by the queue, they are not recycled even after consumption. As a result, there is less and less available memory, which eventually leads to unrecyclable objects filling up with new generation and old generation, and then the JVM keeps doing invalid GC

Problem solving

  1. ringBufferSizeSet to512, deploy and observe the symptoms
  2. In the Java nativeBlockingQueueIn fact, it can meet the demand

supplement

1. There are asynchronous pull data mentioned above

  1. By looking at the code, it is found that there is a scheduled task that passes every 10 secondsAsynchronous HTTPPull peripheral system data. Consider one asynchronous thread for each peripheral system, but there is a limit to the number of threads
  2. In this task, data is pulled asynchronously from all peripheral systems in sequence. Here’s the problem: if we have 100 peripheral systems now, and because it’s asynchronous, we might send 100 requests within 0.1 seconds, and then the task is finished and executed again 10 seconds later. Multiple asynchronous requests may arrive at the same time and throw data toqueueIn the. If the result of 100 requests is that the consumer can consume for 3s, the remaining 7s will be idle, during which the consumption logic may consume a large amount of CPU, resulting in increased time consumption for other services in the system. Wouldn’t it be better to consume at a constant rate and spread the request over the remaining 7s? However, this will lead to a longer delay in consumption, or combined with specific business scenarios to do it