Abstract: Recently, I encountered a memory leak problem in my work, and the operation and maintenance students called to solve it urgently. Therefore, besides solving the problem, I also recorded the common solutions to the memory leak problem in the system.

Recently, I encountered a memory leak problem at work, and the operation and maintenance students called to solve it urgently. Therefore, besides solving the problem, I also systematically recorded the common solutions to the memory leak problem.

First of all, we have made clear the phenomenon of this problem:

1. The service was online once on 13th, but from 23rd, there was a problem of increasing memory. After the instance was restarted, the increasing speed was faster.

2. The services are deployed on chips A and B respectively, but almost all the pre-processing and post-processing share A set of codes except model reasoning. While CHIP B has A memory leak warning, chip A has no abnormal.

Idea 1: study old and new source code and two – party library dependency differences

According to the above two conditions, the first thing that comes to mind is the problems introduced by the 13th update, which may come from two aspects:

1. Develop your own code

2. Binary dependency code

From the above two perspectives:

  • On the one hand, the source code of the two versions was compared with Git historical information and BeyondCompare tool respectively, and the part of A and B chip codes processed separately was mainly read, and no abnormality was found.
  • On the other hand, using the PIP list command to compare the binary packages in the two mirror packages, only the versions dependent on the PYTZ time zone tool have changed.

After research and analysis, it is considered that the possibility of memory leakage caused by this package is not high, so it is temporarily put down.

At this point, digging through old and new source code changes to find memory leaks seems a bit of a stretch.

Idea 2: Monitor the memory change difference between the old and new versions

At present, the commonly used python memory detection tools include Pympler, Objgraph, tracemalloc and so on.

First, we observed the TOP50 variable types in the old and new services using the ObjGraph tool

The common objraph commands are as follows:

Objgraph.show \_most\_common\_types(limit=50) \Copy the code

In order to better observe the change curve, I made a simple encapsulation, so that the data is directly exported to CSV file for observation.

stats = objgraph.most\_common\_types(limit=50)  
stats\_path = "./types\_stats.csv"  
tmp\_dict = dict(stats)  
req\_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())  
tmp\_dict\['req\_time'\] = req\_time  
df = pd.DataFrame.from\_dict(tmp\_dict, orient='index').T  
  
if os.path.exists(stats\_path):  
    df.to\_csv(stats\_path, mode='a', header=True, index=False)  
else:  
    df.to\_csv(stats\_path, index=False)
Copy the code

As shown in the picture below, I ran a batch of images on the old and new versions for an hour, and everything was as steady as the old dog, with no fluctuation in the number of types.

At this point, I thought of myself in general before testing or on-line will be a batch of abnormal format pictures taken to do a boundary verification.

Although these abnormalities, test students must have been verified before the line, but the dead horse as a live horse doctor took it to test.

The quiet data breaks down, as you can see in the red box below: dict, function, method, tuple, traceback, and other important types start to creep up.

At this time, mirror memory is also increasing and there is no convergence.

As a result, there is no way to confirm whether it is an online problem, but at least one bug has been located. When I checked the log, I found a strange phenomenon:

Under normal circumstances, the log should print the check_image_type method only once in the exception stack.

But the reality is that the check_image_type method circulates and reprints multiple times, and the number of reprints increases with the number of tests.

This section of exception handling code has been revisited.

Exception declarations are as follows:

The exception throwing code is as follows:

The problem

After thinking, I probably figured out the root of the problem:

Each exception instance is defined as a global variable, which is thrown when the exception is thrown. When the global variable is pushed onto the exception stack, it is not recycled.

So as the number of calls to incorrectly formatted images increases, so does the amount of information in the exception stack. And because the exception also contains the requested image information, memory increases by megabytes.

However, this part of the code has been online for A long time. If the problem is really caused here, why didn’t there be any problem before and why didn’t there be any problem on the A chip?

With the above two questions in mind, we conducted two tests:

First of all, it was confirmed that this problem would also occur on the previous version and A chip.

Secondly, we checked the online call records and found that a new customer had just been connected recently, and a large number of pictures with similar problems were used to call the service of a certain office (most of which were B chips). We found some online examples, and we observed the same phenomenon in the logs.

As a result, the above questions are basically explained. After fixing this bug, the memory overflow problem no longer appears.

Advanced way of thinking

Reasonably, this seems like a good time to call it a day. But I asked myself a question: what if this line of log was not printed in the first place, or if the developer was lazy and did not print the exception stack?

With that in mind, I went on to investigate the ObjGraph and PyMpler tools.

Previously, memory leaks have been identified in the case of abnormal images, so let’s focus on what is different at this time:

Using the following command, we can see what variables are added to memory and how much memory is added each time an exception occurs.

Objgraph. Show_growth (limit=20)

2. Use the PyMpler tool

from pympler import tracker  
tr = tracker.SummaryTracker()  
tr.print\_diff()  

Copy the code

With the following code, you can print out which references these new variables came from for further analysis.

gth = objgraph.growth(limit=20)  
for gt in gth:  
    logger.info("growth type:%s, count:%s, growth:%s" % (gt\[0\], gt\[1\], gt\[2\]))  
    if gt\[2\] > 100 or gt\[1\] > 300:  
        continue  
    objgraph.show\_backrefs(objgraph.by\_type(gt\[0\])\[0\], max\_depth=10, too\_many=5,  
                           filename="./dots/%s\_backrefs.dot" % gt\[0\])  
    objgraph.show\_refs(objgraph.by\_type(gt\[0\])\[0\], max\_depth=10, too\_many=5,  
                       filename="./dots/%s\_refs.dot" % gt\[0\])  
    objgraph.show\_chain(  
        objgraph.find\_backref\_chain(objgraph.by\_type(gt\[0\])\[0\], objgraph.is\_proper\_module),  
        filename="./dots/%s\_chain.dot" % gt\[0\]  
    )
Copy the code

Using graphviz’s Dot tool, convert the graph format data generated above into the following image:

dot -Tpng xxx.dot -o xxx.png
Copy the code

Here, due to the large number of basic types such as dict, list, frame, tuple and method, it is difficult to observe, so filtering is done here first.

Memory added a call chain for ImageReqWrapper

Memory added a traceback call chain:

With this prior knowledge, it is natural to focus on traceback and its IMAGE_FORMAT_EXCEPTION counterpart.

However, it is important to consider why the above variables are not recycled when the service call is completed. In particular, all traceback variables cannot be recycled after IMAGE_FORMAT_EXCEPTION is called. In the meantime, with a few small experiments, I’m sure we’ll soon be able to locate the root of the problem.

So far, we can draw the following conclusions:

Because the exception thrown cannot be collected, the corresponding exception stack, request body and other variables cannot be collected, and the request body contains image information, so each such request will result in a MB-level memory leak.

In addition, python3 has a built-in memory analysis tool, tracemalloc, which can be used to observe the relationship between lines of code and memory, although it may not be accurate, but it can provide some clues.

import tracemalloc  
  
tracemalloc.start(25)  
snapshot = tracemalloc.take\_snapshot()  
global snapshot  
gc.collect()  
snapshot1 = tracemalloc.take\_snapshot()  
top\_stats = snapshot1.compare\_to(snapshot, 'lineno')  
logger.warning("\[ Top 20 differences \]")  
for stat in top\_stats\[:20\]:  
    if stat.size\_diff < 0:  
        continue  
    logger.warning(stat)  
snapshot = tracemalloc.take\_snapshot()
Copy the code

If the article helps you, give it a “like” before you leave