In the first two sessions, we shared the problems related to front-end and mobile terminal development in our daily work. If you are interested, you can recommend reading at the end of the article. In this issue we share three topics: Golang object pool reducing GC pressure, concurrency control in FFmpeg, static and dynamic diagrams for Paddle, and hope to help you improve your skills.

01GOLang Object pool reduces GC pressure

Sync.pool is Golang’s built-in object pooling technology that can be used to cache temporary objects to avoid the consumption and stress on GC caused by frequent creation of temporary objects. Sync.pool cached objects can be cleared at any time without notice, so sync.pool cannot be used for persistent object storage scenarios. Sync.pool is not only concurrency safe, but also locks free by introducing CAS operations in atomic packages, meeting the need to replace locks in concurrent scenarios by closer atomic operations at the CPU and operating system levels.

1.1 the use of

Sync. Pool initialization requires the user to provide an object constructor New. Users use Get to Get objects from the object pool and Put to return objects to the object pool. The whole usage is relatively simple.

1.2 the principle

In the GMP scheduling model, from the thread dimension, logic on P is executed in a single thread, which provides conditions to solve the concurrency of coroutines on P. Sync. Pool takes full advantage of this GMP feature. For the same sync.Pool, each P has its own local object Pool, poolLocal. Each P has its own memory pool poolLocal that stores P local objects. Each poolLocal has a private and a poolChain. Private is simply an interface type. Write precedence over poolChain is also read precedence. PoolChain refers to a linked list of several ringbuffers. RingBuffer is used because the ring structure facilitates memory overcommitment and ringBuffer is a contiguous memory, which facilitates CPU Cache.

PoolChain stores the head and tail of each ringBuffer. Head and TAIL are not independent variables. Only one headTail variable is available, expressed in sectors. This is because the headTail variable packs the head and tail together: the high 32 bits are the head variable and the low 32 bits are the tail variable, which is actually a very common lock-free optimization. A poolDequeue may be accessed simultaneously by multiple Ps, such as object theft logic in the Get function, which can cause concurrency problems. For example, if only one ring buffer space is left, head-tail = 1. If multiple Ps access the Ring buffer at the same time, both ps may get objects without any concurrency measures, which is definitely not expected. Sync.pool takes advantage of the CAS operation in the atomic package without introducing a Mutex lock. Both P’s may get the object, but when headTail is finally set, only one of them will successfully call the CAS and the other one will fail.

Concurrency control in 02FFmpeg

2.1 Problem Description

Recently, the business needs to conduct video stitching and synthesis in an exploratory project. Since the format, size and bit rate of the original video clips are different, in order to get a relatively smooth stitching effect, it is necessary to first conduct video size leveling, coding format and bit rate unification. Had a research on FFmepg command, and conducted a series of transformations, such as video clipping, video filling, video scaling, etc.), all obtained the expected effect, but when the command is integrated into actual business scene, has suffered a memory is blasting process is killed, CPU capacity task execution time is too long and even failures, And the further upgrade of CPU configuration problems did not improve much, so started the investigation of FFmpeg command thread control.

2.2 FFmpeg thread control

As a powerful multimedia processing tool, FFmpeg contains several powerful lib libraries. The process of FFmpeg processing multimedia files is as follows:

The key calculation steps are encoding, decoding and data modification, and FFmpeg thread control also provides three parameters for thread control, in FFmpeg document, related parameters for thread control are described as follows:

-filter_threads nb_threads (global) 

Defines how many threads are used to process a filter pipeline. Each pipeline will produce a thread pool with this many threads available for parallel processing. The default is the number of available CPUs.

Filter_threads implements thread control for simple filters, with the default number of threads being the number of available CPU cores

-filter_complex_threads nb_threads (global) 

Defines how many threads are used to process a filter_complex graph. Similar to filter_threads but used for -filter_complex graphs only. The default is the number of available CPUs.

Filter_complex_threads implements thread control for complex filters. The default number of threads is also the number of available CPU cores

threads integer (decoding/encoding,video) 

Set the number of threads to be used, in case the selected codec implementation supports multi-threading. 

Possible values: 

‘Auto, 0’ automatically select the number of threads to set

The Default value is “auto”.

Threads implements thread control for codecs. The premise is that the codecs used support multithreading parallelism. The default number of threads has been described by automatically in the documentation.

Run the time command to view the FFmpeg command time and CPU usage parameters. The test results on a 4-core machine are as follows:

-i-filter_complex-threads 1 -y 4.54s user 0.17s system 110% CPU 4.278 total -i -filter_complex-threads 2 -y 4.61s User 0.29s system 189% CPU 2.581 total-i-filter_complex-threads 4-Y 4.92s user 0.22s system 257% CPU 1.993 total-i -filter_complex-threads 6 -y 4.73s user 0.21s system 302% CPU 1.634 total -I -filter_complex-threads 8 -y 4.72s user 0.19s system 315% CPU 1.552 total-i-filter_complex -y 4.72s user 0.22s system 306% CPU 1.614 total-i-filter_complex -y -filter_complex_threads 1 -y 4.63s user 0.13s system 316% CPU 1.504 total -i -filter_complex -y -filter_complex_threads 2 -y 4.62s user 0.20s system 304% CPU 1.583 total -i - filter_complex_y-filter_complex_threads 4 -y 4.58s user 0.27s system 303% CPU 1.599 totalCopy the code

Through the experiment, it is found that without thread control, there is almost no parallel space for my clipping + scaling +gblur size flattening operation, and filter_complex_threads increases the system state time by increasing the number of threads, with almost no gain for overall time and CPU utilization. For the codec part, CPU utilization increases and time consumption decreases with the increase of the number of threads, but the overall data is not linear. For a single command, when the number of threads is set to 2, CPU consumption and time consumption are relatively cost-effective.

2.3 summarize

1. FFmepg, as a compute-intensive processing tool, has a large demand for CPU, and FFmpeg provides three parallel control parameters for concurrent control of different types of commands respectively. However, whether the specific command can be concurrent is related to its implementation principle, which needs to be analyzed in detail.

2. Codec, as a key link in FFmpeg video processing, is relatively time-consuming and CPU-consuming. Making good use of Threads parameters can better speed up processing and control CPU usage

Static and dynamic drawings of 03PADDLE

The concept of static and dynamic diagrams

Static diagram: like c++, compile before run. Therefore, there are two stages: Compile Time and Runtime. At CompileTime, the complete model needs to be defined, and Paddle generates a programDesc, which is then optimized using Transplier. At Runtime, executors run using programDesc.

Dynamic diagrams: Similar to Python, there is no compilation phase, so you don’t need to define the model beforehand. For each line of network code, the corresponding calculation results can be obtained simultaneously.

Comparison of advantages and disadvantages:

Static chart: Paddle only supports static chart at the beginning, so there are more related support and documentation. It is also better than dynamic graph in performance. But debugging will be more troublesome. Dynamic graph: easy to debug, can dynamically adjust the model structure. But the execution efficiency is low.

Question 1: How to determine whether the current mode is static graph mode or dynamic graph mode

  • Static diagram mode: Static modules are used in your program, or you need to build executors and execute the defined model using executor.run(program).

  • Dynamic graph mode: dygraph module is used in the program. Starting with Paddle2.0, dynamic graph mode is turned on by default.

  • Note: Some apis only support static/dynamic graphs. For example, apis involving variable values generally only support dynamic graphs. When an error is reported with imperative/dygraph, you need to confirm whether the dynamic graph API is called in the static graph mode.

import numpy as np import paddle import paddle.fluid as fluid from paddle.fluid.dygraph.base import to_variable print Main_program = fluid.program () Startup_program = fluid.program () paddle.enable_static() with fluid.program_guard(main_program=main_program, startup_program=startup_program): Data_x = np.ones([2, 2], NP. float32) data_y = Np. ones([2, 2], NP. float32) # Y = fluid. Layers. Data (name='x', shape=[2], dType ='float32') y = fluid. dtype='float32') x = fluid.layers.elementwise_add(x, y) print ('In static mode, after calling layers.data, x = ', In static mode, after calling layers.data, x = var elementwise_add_0.tmp_0: LOD_TENSOR.shape(-1, 2).dtype(float32).stop_gradient(False) place = fluid.CPUPlace() exe = fluid.Executor(place=place) exe.run(fluid.default_startup_program()) data_after_run = exe.run(fetch_list=[x], feed={'x': data_x, 'y': data_y}) print ('In static mode, data after run:', data_after_run) #In static mode, data after run: [array ([[2, 2], [2. 2.]], dtype = float32)] # dynamic graph model with fluid. Dygraph. The guard () : X = np.ones([2, 2], nP. float32) y = NP. ones([2, 2], nP. float32) # To_variable (x) y = fluid. Dygraph. To_variable (y) print ('In dygraph ') mode, after calling dygraph.to_variable, x = ', x) # In DyGraph mode, after calling dygraph.to_variable, x = Tensor(shape=[2, 2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,[[1., 1.],[1., 1.]]) x = fluid.layers.elementwise_add(x,y) print ('In DyGraph mode, data after run:', x.numpy()) #In DyGraph mode, data after run: [[2. 2.] [2. 2.]]Copy the code

Question 2: How to debug in static graph mode

  • You’ll create a Print operator and Print the tensor content you’re accessing.

Question 3: How to transform dynamic graph into static graph

  • Based on the advantages and disadvantages of dynamic graph, dynamic graph mode can be used in model development stage, and static graph mode can be used in training and reasoning stage.

  • On functions that require static and static conversion, use @paddles.jit.to_static to decorate. Or use the paddles.jit.to_static () function to transform the entire network.

import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.jit import to_static

class MyNet(paddle.nn.Layer):
    def __init__(self):
        super(MyNet, self).__init__()
        self.fc = fluid.dygraph.Linear(input_dim=4, output_dim=2, act="relu")

    @to_static
    def forward(self, x, y):
        x = fluid.dygraph.to_variable(x)
        x = self.fc(x)
        y = fluid.dygraph.to_variable(y)
        loss = fluid.layers.cross_entropy(input=x, label=y)
        return loss


net = MyNet()
x = np.ones([16, 4], np.float32)
y = np.ones([16, 1], np.int64)
net.eval()
out = net(x, y)
Copy the code

Recommended reading:

Baidu Programmer development Guide to Avoid pitfalls (Mobile part)

Baidu programmer development guide to avoid pit (front-end)

Baidu engineers teach you tips for improving r&d efficiency quickly

Baidu front-line engineers talk about the changing cloud yuan

【 Technology gas station 】 Reveal Baidu intelligent test scale landing

Brief discussion on the three stages of Baidu intelligent test