This paper introduces the frequent OOM screening of falcon-graph module and its solutions.


Review of previous article:
Discussion on Dcoker Security Support (Part 2)



Business background

Falcon-graph is responsible for persisting monitoring data for users to query and summarize.

In early April, open-Falcon’s business volume gradually increased from 0.29billion counter to now 0.32 billion, resulting in an average 8% increase in graph cluster memory ratio (now:73%), Machine load (1min) increased by 5% on average (now:18).

Summary of the site conditions during the three days shows that the cluster memory will increase at 20:00, and OOM phenomenon will occur on some machines, and OOM phenomenon will occur on some machines at irregular time points.

The screening process

1. Check the service itself

Go Pprof was called to analyze the performance of the service itself, and the following information was found at the problem site:

CPU:



mem:


Compared with the normal state, CPU allocation of each function has no big change, but MEM is rising.

Since the data inflow is stable, it is suspected that problems such as blocks during persistence cause disk write speeds to slow down and memory data to pile up.

Go pprof query block information as shown in the figure below:

The total information is 0, excluding that the service has a function block to hold the write process. Start looking for other services on the machine

2. Check other services on the machine

(1) At 20:00 every day, graph-Clean consumes a large amount of CPU in a short time, leading to a rapid increase in load (>30), as shown below:

(2) After on-site investigation and discussion with colleagues, it was found that in the data transfer service, there would be a sudden increase in CPU consumption due to a large number of instantaneous TCP connections and data verification, resulting in an instantaneous increase in load to 32. The diagram below:

The solution

1. The graph-Clean code is not reasonable

  • Modify the Graph-Clean code to average the peak and reduce the frequency, thereby reducing the CPU overhead

  • Testing cluster tests (completed)

  • Online gray scale observation of 1 machine (completed, solved the problem of consuming a large amount of CPU in a short time, and this problem did not occur to graph service OOM after deployment)

  • Online progressively grayscale for other machines (completed, observed for a week without this problem resulting in GRAPH service OOM)

2. Mixing for Transfer /graph service

  • Separate Transfer/Graph for separate deployment (gradually split internationalization process)

3. Comb graph code and modify unreasonable data structure according to tuning graph to reduce system overhead.

This article was first published on the public account “Miui Cloud Technology”. Click here to read the original article.