TensorFlow Serving is widely used in CTR(click-through Rate) estimation scenarios due to its convenience and stability. However, memory growth occurs during its operation, and related issues are submitted to Github community. And the current state is Open. This article shares two TensorFlow Serving memory leak problems discovered by iQiyi deep Learning platform in practice, and fixes and submits PR to the community. Here we will detail the background of the problem and the solution process, hoping to be helpful.

I. Background introduction

TensorFlow Serving (TF Serving) is Google’s open-source high-performance reasoning system used to deploy the TensorFlow model. It has the following characteristics:

  • Both gRPC and HTTP interfaces are supported
  • Supports multiple models and versions
  • Support model hot update and version switching

In addition, IQiyi has opened source [XGBoost Serving] to support GBDT inference services based on TF Serving, which also inherits the above features of TF Serving.

In general, deploying inference services using TF Serving provides good stability and performance, especially when CTR anticipates high service latency and stability requirements. However, on the [issue list] at TF Serving Github, there are frequent reports of OOM problems with growing memory during runtime. A typical memory issue [<

Server is killed>>] was proposed in 2018 and is currently open.

Iqiyi deep learning platform also encountered two similar problems in practice. The online reasoning service of our business was deployed through Docker container, but IT was found that TF Serving was OOM due to increasing memory in some cases. The background to each of these problems is described below.

Raw Serving Tensor input

Let’s first introduce the two ways of input of TF Model Saved Model features. One is that there is only one input to the model, and the placeholder input is a String with tF.examples; The other is that you have multiple inputs to the model, and each input placeholder corresponds to the Tensor of the characteristic. Taking the TF.Estimator API as an example, The API ` tf, respectively. The estimator. The export. Build_parsing_serving_input_receiver_fn ` And ` tf estimator. Export. Build_raw_serving_input_receiver_fn `. Using the saved_model_cli command, you can clearly see the difference in input between the two models:

Both methods are handled slightly differently on the client and TF Serving server. As shown in the following figure, the input method using TF Examples needs to be serialized to String at the client first. Then we use the Parse Example op in the TF Serving model to deserialize the String into the Tensor of the input characteristics and perform the forward calculation of the model. With Raw Serving Tensor you lose both of those parts. The serialization and deserialization operations of TF Examples will bring certain loss to the performance of the end-to-end inference service, so our CTR class services basically use the second model input method.

Having introduced the background of model input, let’s look at what the phenomenon of the problem is. One day, when the business came online, it noticed that the memory in TF Serving container monitoring suddenly started to grow and showed no signs of stopping.

We quickly replicated the problem offline and used GperfTools to do memory profiling. The following image is an excerpt from the PROFILING PDF file. For more details, see the attachment in the [Pull Request].

Can see DirectSession: : GetOrCreateExecutors this function through the hash table emplace created a lot of String cause memory continuously increase. Executors_ is an unordered_map that stores the map of the model input signature to ExecutorsAndKeys.

DirectSession: : GetOrCreateExecutors function inside logic is as follows:

According to the logic above, if the number of inputs is 10, then the combination of feature inputs is 10! = 3,628,800, if a key has 100 bytes, it needs about 350MB of memory, if more than 10, it needs more than 3G of memory, which is the cause of the memory leak.

But the source of memory leak should be sent to the TF PredictRequest inputs inside characteristic sequence of Serving in the constantly changing, which leads to DirectSession: : GetOrCreateExecutors function has not find matching the input, Then create a new string to insert into executors_. The PredictRequest sent to TF Serving is generated through the protobuf definition: inputs are a string map to TensorProto.

A further look at the Protocol Buffers document defines that the order of map iterations is undefined, and the code cannot rely on items in the map being in a particular order.

Looking back at the first two model inputs, there is only one model input using tF.examples and this problem does not occur. Our model uses multiple Raw Tensor inputs, and we have at least 10 input features, which is why we have this problem.

Found the cause of the memory leak, but why the sudden increase in memory, the business has been running online for a long time. In business feedback, the client that sent the request changed the logic of constructing the request. The order in the code was fixed before, but now it is not fixed. Although the Protocol Buffers definition says that the order of map iterations is undefined, if the order of inserts is the same, then the order of iteration is expected to be fixed, but we cannot rely on this undefined implementation. That cleared up the whole problem.

* * * * * * * * * * if TF = GetOrCreateExecutors: github.com/tensorflow/… In TF Serving, always sort their inputs: github.com/tensorflow/… Sort in TF Serving to omit strings::StrCat and lookups in TF.

Third, the service suddenly increased and concurrent requests

The TF Serving container continues to appear OOM during peak traffic, and the number of OOM events can be seen on the platform monitor.

You can also see in the log that the TF Serving process of the container is killed.

We also noticed a strange phenomenon in the platform monitor that the number of processes (or threads) in the TF Serving container suddenly increased, and after the number of processes (or threads) increased to a certain point, the container was killed by OOM.

This phenomenon is quite confusing. What process (or thread) is suddenly added? We analyzed the THREAD model of TF Serving earlier. The main worker threads are controlled by the Intra and Inter OP configurations of the model and should not be augmented. So we looked at the message log on the physical machine and found that the thread that was killed was grpcPP_sync_ser. Grpcpp_sync_ser requests physical memory, and then breaks the page_fault, but the container’s memory has been allocated, so it is killed by cgroup.

When you look inside the container, sure enough, THE TF Serving process spawns many grpcPP_sync_SER threads.

Based on the above analysis of logs and monitoring, we roughly restored the event process as follows:

It was clear that the root of the problem was GRPC Server, so we looked at the GRPC source code to see how grPCPP_SYNc_SER thread is generated to handle GRPC messages. By searching for the thread name, you can quickly find the location where the thread was created. It is a good code practice to always give the thread a specific name so that problems can be located easily.

Continue to analyze the logic in MainWorkLoop function. The overall logic is relatively complex, and the related logic is probably as follows.

As you can see from the above logic, if there is no Resource Quota to GRPC, GRPC Server will constantly create new workerThreads to process messages in the queue. Normally, if the load is not high, workerThreads will exit soon after processing messages. However, when the server is under high load and receives a large number of requests suddenly, many workerthreads will coexist at the same time and apply for a large amount of memory, resulting in OOM.

So the solution to the problem is to set Resource Quota and we submit a PR to TF Serving: github.com/tensorflow/… , code added a limit on the maximum number of GRPC threads.

One suggestion here is that all code using GRPC Server should add Max Threads limits to avoid service being overwhelmed by unexpected requests. Moreover, this problem also inspires us that a stable online service must have traffic control, so as to ensure that degraded service can be provided in case of emergency, otherwise it will further deteriorate the whole service.

Four,

This paper introduces the practical process of iQiyi deep learning platform to fix two problems of TensorFlow Serving memory leakage. After these two problems were fixed, the online TensorFlow Serving service continued to run smoothly without any memory leaks.

## References

Github.com/tensorflow/…

Github.com/gperftools/…

Developers.google.com/protocol-bu…

github.com/grpc/grpc