This article is compiled from Tencent senior technical expert Bao Jinlong’s speech on LiveVideoStack online sharing. Through his own practical experience, he explains the high performance video reasoning engine optimization technology in detail.

By Bao Jinlong

Organizing/LiveVideoStack

Good evening, everyone. It’s a great honor to have this opportunity to come back to LVS and discuss some issues with you. I must have played my first LVS in 2017. It’s almost 4 years now.

Today’s content is inference engine optimization techniques, of course, there is a premise, mainly on the end. The storage contradiction of von Neumann system has been the main contradiction for several decades. For this problem, if it is on Nvidia graphics card, or domestic flintlock NPU, their solution is to use the fastest HBM memory, HBM2.53D stacked memory to increase the bus bandwidth, and add 36 layers of Cache in between. For example, in terms of the power consumption and area of the chip on the end, whether GPU or DSP, L2’s Cache is basically about 1 M, and the memory is shared by all chips, LPDDR3 and LPDDR4. The power consumption is very low. But performance is 10, 20 times worse than desktop DDR and HBM memory. Therefore, the optimization on the end, or need to start from the overall design of the reasoning engine, the execution speed of the operator itself, and the replaceable operator itself, that is, from the software development to optimize, because the hardware in a short period of time to improve 10 times, 20 times, is actually very difficult.

01. Optimize your thinking

The current optimization ideas, although there are five items, can actually be divided into three categories.

The first type is from the Framework perspective, such as the first and second items, the idea is to change the data arrangement, so that the pipeline of calculation is executed more efficiently. This is optimized from the perspective of Cache utilization, which reduces bus access and power consumption. The third is to do some optimization of the operator itself, without changing the principle of the operator algorithm, to see if it can be executed faster. The fourth is bypass optimization, which is carried out by the principle of equivalent substitution. There are only two things listed here, and there are more complicated optimizations, but we don’t have time to go into anything too complicated. The fifth term is, if you want to use motion compensation, then there’s the question of where the motion vector comes from, and then there’s an estimate of the motion vector.

The first part is data alignment. This is a very basic problem. Video data is a two-dimensional image sequence that is continuous in time. In the spatial domain, that is, in a single frame image, there are two dimensions of horizontal and vertical direction, and a lot of redundancy in time and content. One of the very strong features of video data is that within the frame, for example, there is a strong correlation between the two dimensional data blocks that are segmented, for example, flat blocks are flat blocks, and if there is a texture, the texture is continuous. The second characteristic is that the correlation of adjacent 2d data blocks is also relatively strong. The third feature is that frames are associated with each other through motion vectors, which is actually a Patch with strong correlation between three-dimensional arrays. The strong correlation is that not only texture features are similar, but also high-frequency and low-frequency signals are similar respectively (that is, spectral features are similar). So we’re looking at it both in the time domain and in the space domain. Correlation creates a lot of opportunities for algorithms to speed up, which we’ll talk about later.

At present, the conventional inference engine data processing mode is Planar structure, which may be multi-channel, such as RGB, or three channels, such as YUV, 420 format, 444 format. The execution is line scan, one line from left to right, and then the second line from left to right, the same way television was scanned in the early days, until one frame is processed. This method is universal and can be executed regardless of the type of data. But it also has some problems. First, this is done row by row, so local dependencies of the data cannot be taken care of in this way. Second, this is the whole graph is scanned, so the data throughput is very large. The whole image scan can be understood as the whole image is a block, and this block is the largest chunk. In general, if the graph is large, for example, if the y-channel data is 1 to 2 M in size, L2 Cache will be 1 M in size. This is a typical Cache bump phenomenon, which means that by processing one Filter and then processing the next Filter, the data that you read last time will have been flushed out completely. You have to load it back in. So, the problem is relatively serious, both speed and power consumption, performance is very poor.

So to solve this problem, we can slice up the data in the same way that we encode and decode the video, and the regular chunk is 8×8. In general, 8×8 blocks are textured, localized, and contain enough features inside. There are pros and cons to being too big or too small. So, the general block storage is spanned. If it is an 8×8 block, the first 8 pixels are in the first row, which successively has a large span. If it is processed with the current ai-suitable, large concurrent instruction, such as AVX512, which can do 64 int 8 multiplications at a time, the DSP is even larger, generally 128~768 pixels. 768 data organization is RGB 16×16 blocks, with three channels, can be processed in one time. There’s more, like 32×32. We often see some NPU designs on the Internet, such as Musk’s NPU for car navigation, or Google’s TPU-1, 2, 3 generation, the degree of concurrency is improving all the time, but basically stays at the same level, not too big, not too small. This is very inefficient if you read across in 420 format. An 8×8 block, for example, requires eight addresses, eight reads of eight bytes each, and only one computation instruction is executed, with eight writes. It’s actually very, very inefficient. So how can we improve it?

It’s actually quite simple. The improvement is to cancel the span of 8×8 blocks and store them continuously. For example, if you put 8 consecutive lines into 64 bytes, the next 7 lines are cancelled, that is, 8 rows of data are placed on this line. The advantage of doing this is that if you’re dealing with 8×8 blocks, for example, you only need one address, and the address is very neatly aligned, and then you need one read, one compute, and one write, and that’s 7 or 8 times faster. But, in general, we don’t get seven or eight times faster. The reason for this is that efficient processing of TILE data requires data conversion, stacking, and rearrangement of data during operations (as we’ll see later).

The TILE format requires data rearrangement. First, the input data is in conventional Planar, RGB, or YUV 420 format, and the conversion is to cancel the span. However, the general inference engine is required to copy once, it does not directly process the raw data, requires Pad- border padding, and may also exist format conversion, rearrangement can be combined with this. The mutation function permutation is a very fast instruction that executes as fast as memcpy, meaning that the cost of the mutation is negligible. Secondly, the data in operation is not always address aligned, so it may be necessary to rearrange the misaligned address of the motion vector. The diagram above shows how to rearrange input format transformations more efficiently. Instead of reading 8 bytes at a time, or 8 lines at a time, you read 64 bytes at a time, 8 lines at a time, and rearrange them to form an array of 8 88s. Then the efficiency is the highest, and the instruction parallelism is saturated and maximum. Let’s look at an example of address rearrangement, which is the simplest example. You’ll have a chance to try more examples.

The current assumption is that the address of block A is (0,0), and the adjacent ones are B, C, and D, all of which are 8 apart and the addresses are aligned. I need to access block G at address 6,5, so I have an operation vector. If it is in 420 format, an unaligned access will do, or a simple concatenation of the aligned. But in TILE format, it takes a little more work. First, it needs to splice to E, that is, splice BLOCK E between A and C, and splice block F between B and D. But these two blocks are different from the G block, because they’re in a horizontal region, and they have to be rearranged in X coordinates. Currently, AVX512’s instructions look like this, moving 64 bits is done with three instructions. If later more advanced instructions are done at once, so much the better. But it is clear that the TILE format was not taken into account when AVX512 was designed. Also, notice that the instruction is suffixed with an X, which means there is an extension. Basic Intel instructions, such as C = 5, are constants, but constants are too inefficient to use variables. The following is an equivalent implementation, which is to form an equivalent variable align process by rearranging another permute, a and B vectors.

If it’s on Qualcomm’s HVX, it’s a lot easier. Qualcomm’s execution is also 5 line instructions, but the first 4 are a class of instruction Valign, which supports both constants and variables, and the execution action is exactly the same. Finally, a vswap operation similar to AVX512 Blend generates a block. These instructions are low latency instructions, and access is much faster, so the cost is basically very small. It would have been better if the cost had been spread out with further optimization in subsequent execution.

02. Calculation pipeline

Next, let’s look at the second part. In order to calculate, a new method Framework is required. Previously, one frame and one Filter are executed in raster scan mode on the whole frame, such as S0, S1, S2, S3, and S4. In fact, the Cache is never hit, and the Cache speed is very slow on mobile devices. Generally speaking, according to our experimental data, the power consumption of DDR memory access is greater than that of chip computing instructions. To solve this problem, we want to minimize the amount of data in the middle of all these processes, such as Cache utilization, to improve a bit. So how do you do that? To design a super assembly line. That is, these filters can be executed in parallel on the Pipeline at multiple steps from S0 to S4. Parallelism means that the order of data processing is changed from a per-frame order dependence to a per-row order dependence, so that from the frame point of view, data processing is parallel. In this way, the data is basically stored in the Cache, and the data is read without too much bus access, and the power consumption is very low.

We’re doing TILE or ROW, or super pipelining if it’s 420, so it’s one action one ROW. If the format is TILE, you need a secondary TILE in addition to the ROW.

We’ll start with engine optimization, but data prefetching is important for both regular engines and optimized, specialized, tightly stacked engines. Because the data Cache on the mobile chip is very small, if the data is not prefetched, it is almost not hit. In general, we also see a pipeline structure under row scanning. Before the loop starts, one row of data needs to be prefetched, then the next row is prefetched, and the current row is processed. The process_row function takes a long time to process, usually more than 1000 clocks. A data load from the bus to the Cache is roughly 100 Clock, but may be slower, 80 Clock for a better machine, 150 Clock for a worse machine. So if process_row executes too fast, the prefetch is not efficient enough. For example, if the head th_row is not pre-fetched during processing, the pre-fetch will fail. But if you’re dealing with a row, you have enough time. If many kinds of filters are carried out at the same time and the pipeline has multiple levels, data prefetch is much more efficient than raster scan. As shown in the figure, the prefetch is performed sequentially. For example, the black line puts the prefetch and the data load together, so the prefetch is performed once. There is no need to do the prefetch since the data in the middle is very small. You used to prefetch n times, and now you’re really only prefetching the first time. As shown in the figure, the prefetch pointer points to the green line, and the red line after it is delayed 1, and the yellow line after it is processed. So prefetching is very important, and without prefetching, it’s impossible for an engine to achieve very high performance.

Next comes the design of the super assembly line. As you can see, we have a relatively simple example here, which has four levels. The first level is prefetching a 1×1 load, the green is 5×5 convolution, the blue is 3×3 deconvolution (you just think you have it), and then the yellow is 7×7 post-processing. This obviously has a data dependency on the front, and if its Tap condition is not met, subsequent operations basically cannot continue. There is a processing logic here. If you look at the code in the figure above, processing one line at a time is essentially using four Pointers to the process function, executing one at a time from top to bottom, such as executing the second load line immediately after. Then data integrity clearly cannot be satisfied. Because the second-level Filter is 5×5, it can only be executed in step 6 after the processing in step 5, so it fails. So you go all the way to 5 and then you go to 6, and you find that the data is complete, and you go to 6. After 6, I’m going to do it at the next level, and I’m going to try to get to 9, which obviously doesn’t work. Break from 6, return to 7, and 8 is complete. So when I try again, 9 is complete. And then the next level goes to 15, which is incomplete, and then it goes back to 11. 11 is followed by 12 and 12 is followed by 13 and 14. Since this is an upsample operation, two rows are processed, at which point 15 can be executed. After execution, if 20 is not satisfied, return to 16. The logic goes on and on. In fact, the execution here is much more efficient than the original, whole sheet processing, and the code is very simple. As you can see, the code is actually beautiful, and the more beautiful the code, the more efficient it is.

It is not clear whether the format is 420 or TILE; the data dependencies are the same in either format. TILE is a special format with some specificity. First, prefetching should not be performed as a row, but as a TILE. As shown in the figure, the green is the prefetch pointer, the red is the processing pointer to the next block, and the yellow is the processed block. So when process_TIle_ROW processes a row, it’s actually outside the loop, prefetching the first TILE, then the next TILE, processing the current TILE, and the loop is complete. The dependency judgment is that since a TILE is 8×8, the maximum Tap that supports a Filter is 17, which is easy to calculate, the center point plus the wings 8+8. Neither the convolution kernel nor the filter is so big, 7×7 and 9×9 are the limits, generally 3×3 and 5×5 are sufficient. So, it’s pretty simple, you just look at the next TILE of the Filter and say, “No, no, Fail.”

In TILE format, a TILE is equal to 8 rows of the original ROW, which consumes more Cache. Typically, a 1080P single channel, under the previous pipeline, consumes 50 K. In this case, it’s only 50 kelvin, which is a lot more than you might think. I know a lot of engines that consume 2 or 300 megabytes of buffer when processing data. If you design an engine like this, and you only use 50 k of data, that’s amazing. But in reality, there is a format conversion, I/O conversion, and a minimum of 2 frames of buffering. As mentioned above, the Cache is 50 K, but the data type may be floating point or int 32. 50×4 = 200 K is relatively large, and if multiplied by 8, it exceeds L2’s capacity. The handling of this case is very simple. When executing it, divide the data into multiple tiles, which we now call segments. Now this Segment is a quarter of the original Segment, so pipeline consumption is a quarter of the original Segment. This is a very simple transformation operation.

3. The operator

We’ve covered TILE formats and pipelining pretty quickly, so let’s move on to part 3. The first is operator merging. The cost of reading and writing a very small operator is too large to compute. Here we have to do a format conversion, like int 8 to int 32, float to int. For example, simple addition and subtraction, or even the opening, some software analog opening instructions require a large amount of calculation, but generally hardware instructions, this Delay is very small. In this case, the idea is, this function is a plus, and the execution is a loop, and the old algorithm was just executing in rows, and this row is written as a plus. Load A takes 5 cycles, load B takes 5 cycles, add only takes 1 cycle, and write store takes 5 cycles. It’s really slow. In fact, if you load without prefetching, you might have 50 cycles or 100 cycles, which is a very bad case, and we see it in a lot of engineering code. If you prefetch, you lose 5 cycles. On the right is a square root function that does a similar thing, reading, calculating, and then writing. So, we think it takes six cycles to take the square root. Here, we do three calculations, take the square root of a and B, and add up the square root of A and B, so the cost of the instruction adds up to 48.

So let’s pick a new function, combine all of these operations, and just write one line of code. So in this case, let’s look at the cost, load A and load B are both 5, so it’s 27, so it’s twice as fast. It’s a very simple idea.

Similarly, there is a multichannel. In general, convolution is 8, 16, 32. If we take a convolution, such as AVX512, DPBUSD, in the case of 3×3, it is asymmetrical, so evenly distributed among three vectors, occupying the lower three bits in the Slot of each vector. The data is also organized in the same way. Each vector is calculated by DPBUSD and executed three times to obtain SUM. There are also other costs, such as reading. So try to do all the convolution sums at once, because there’s a cost to the arrangement. Swap rearrangement is required. Once rearranged, it is best to put it in a register and use it all at once. So here’s an example of eight convolution kernels, which you can think of as const arrays, one loop at a time. Direct output after completion.

We might have a problem with operator fusion. When we model, an Element, an operation, a function alone, is a static sequence of execution. If you merge, there’s a lot of permutations. Assuming we now support 180 operations, of which at least 90 are combined, the number of permutations is astronomical, and a static engine is impossible. Build a separate execution code for each model as it reads in. In terms of timing, there are three types of construction. First, on the development side, such as when compiling with a Mac laptop or a server, generate an SO directly and export the SO into APK when publishing. In the second way, we put the model on the end. For example, the mobile end uses ARM CPU to execute the compilation process, merges the source code operator, and finally calls LLVM compiler to generate optimized code, which is completed by ARM CPU. The third option is to do it on the Device, OpenCL is compiled on the GPU, it’s also possible on the DSP, actually, it’s up to you. From a security point of view, we need to distinguish between Host and Device, otherwise Host is ok. But there’s a problem, because you’ve all used OpenCL, and on the side you have to compile your code, and it actually takes quite a long time to compile your code, usually a few hundred milliseconds. Generally, when we do A/B experiments, the APP starts up A few hundred milliseconds longer, which is quite fatal. If you find this difficult, there is another way. The second thing I suggest is that the optimized binary does not need to be compiled by the compiler. Instead, we simply convert each operator into binary unsigned char data, and then combine the operator unsigned char data according to the model before running. If the stitching code optimization level is very high, the stitching cost can be ignored, generally the stitching speed is as fast as 1/1000 of a second.

04. Operator bypass optimization

The next step is operator bypass optimization. First, how to make Cache utilization higher and power consumption lower, and then to make the operator more efficient, but the operator itself does not carry out algorithm optimization. Bypass optimization has a basic principle. The current 8*8 block is divided into two parts, one Edge of the trapezoid. These features can be obtained by simple and fast algorithm, which does not require a complex model and has high cost performance. On the other hand, complex features can only be calculated by depth method. These two aspects are complementary, under certain conditions, they can be interchangeable, fast algorithm can replace the depth algorithm. In other words, some data blocks with simple algorithm output error acceptable, or zero error, then local replacement is feasible, that is, a change in a block, the entire network output will not be affected. In general, determining substitution requires an algorithm. For example, the motion vector, whether the residual is zero or not, the prediction algorithm has a cost, the cost is much lower than the benefit. If the speed is 3 times faster and the cost is 10 percent, then the benefit is still relatively large compared to the two parts. Otherwise, it wouldn’t have to be done. In my own experience, I did a prediction of texture complexity for an algorithm using THE TV (Totally Variance) method, but the implementation had problems. For example, the cost of a 720 graph needs to be determined to be 36 ms. If the algorithm goes through the whole process without any bypass, it will be finished in 36 ms. However, after the prediction, the speed will be slower.

Next we look at texture complexity analysis. Conventional complexity is actually a TV algorithm, but it can be implemented very quickly and cheaply. For example, if I have an 8×8 block, the first thing I want to do is determine that it’s flat, and I want to calculate an average AVG. If it is a flat block, and the variance of each point relative to the mean, or equivalent abs, is less than a value or zero, then the texture is flat. The advantage of this is, for example, if you have a convolution operation, you have to multiply each point, but in the flat case, it degrades, avG is the same, so when you factor it out, the sum on the right-hand side becomes a constant, o(n * * 2) or O (n * * 3) degrades to O (1). Of course, there are other more complicated cases, and the idea is similar, where you have to make an equivalent substitution if the bypass is true. Let’s look at the following example: gradient direction judgment, is a direction in a TV operation. For example, if you take 8 points in the direction of 45 degrees, the variance of 8 points with the mean is very small, or 0, this direction is obviously a boundary. It is also possible to calculate gradients in multiple directions by using Sobel operator and Laplacian operator, but from the perspective of reliability, THE TV method is still recommended.

Flatness textures can have a significant degradation. Similarly, there is an obvious degradation of O (1), namely motion compensation. There is only one pixel compensation in the codec, but the depth inference engine must do all the steps. The intermediate data of the surrounding network cannot be ignored because the pixels are copied, and the network output is not equivalent. So you need to do full motion compensation at all levels. The first is pixel compensation. The residuals of the two data blocks are 0, and we do filter processing for the two blocks respectively. It can be considered that the residuals of the two blocks are also 0. So once you’ve done F of b0, you don’t have to do F of b1, you can just copy it. This is actually an o(1) operation. The second is convolution compensation. Convolution has intermediate output results. If only motion compensation is done and the intermediate data is empty, then the convolution pixels are dependent and the output result of the network is abnormal. Therefore, the intermediate result of convolution, namely the intermediate result of the equivalent filter, also needs motion compensation. Also important output intermediate data, such as FeatureMap, needs to be compensated. These compensations come at a cost, either by estimating the cost of copying, or by simply calculating the cost. Generally speaking, the benefits of motion compensation are very large on PC, or in the case of no power consumption pressure and high bandwidth. On a mobile platform, if the memory bus is very slow, consider how complex the filter is to replace. There is usually a balance point, beyond which there is a benefit, if not, then motion compensation fails.

05. Fast motion estimation

Let’s move on to the last part, where we introduce fast motion estimation. Motion compensation requires a Block Match process, or optical flow, and each pixel needs a motion vector, generally we use 8×8 motion vector. Usually, our algorithm can be combined with the decoder to obtain a motion vector, but there are residual codec, which may interfere with your algorithm so much that in most cases the motion vector cannot be used. In this case, you need to get the motion vector yourself. If it is a common encoder motion estimation, the cost is very large, will be slower than your own depth algorithm, so there is no need to use this method to obtain. However, in recent years, some fast algorithms have emerged, namely fast motion estimation, which has several characteristics. First, compared to the large search window used by the original algorithm, the fast algorithm window has the initialized predicted motion vector, because of the presence of this predicted motion vector, the window can be small. We assume that the matching target is as shown in the figure above. The original search algorithm that has not been optimized uses a large search window, while the fast algorithm with the prediction vector can make the window very small, that is, the search times are very small, so that convergence can be achieved. Because the prediction vector is taken from a different source, it doesn’t necessarily fit, it varies, it’s not like copying pixels, the vector can’t be used directly, you have to redo the search process, but it’s much faster than the search without the prediction window.

So, how do you get the prediction vector? Let’s assume that we have a sequence from frame0 to frame1, and mv1 has been retrieved by searching. We can predict, for example, by searching from frame0 to frame2, the current block can be considered to be moving at a uniform speed, then extend the motion vector to frame2, mv2 is very simple, mv1 multiplied by 2, determine the window on this basis, and then search again. However, there are other methods, such as reverse searching, from frame1 to frame0. The Block on which this point is located can use the motion vector, and the motion vector can be used to search frame0 in reverse, and frame2 can also be used, or it can be used in reverse. There are many possibilities, there should be a similar algorithm in the new version of VVC, it is not complicated.

Next comes the second method of acquisition, Ankor (anchor). What is Ankor? As shown in the figure, the data is divided into many tiles, but not all tiles are searched. We only search for interrow and column, which is a quarter of the total number, and then use the quarter of tiles as Ankor for motion search. For the remaining three quarters, the motion vector obtained by the neighboring Ankor is used as the prediction vector. The order in which Ankor is selected and executed varies from architecture to architecture. For example, with a GPU, there is no time dependency between Ankor and simultaneous searches are fine. If it is a DSP CPU, you can use raster Scan, then the next Ankor can use the previous Ankor, which is faster. If it’s parallel, there’s no problem, and a quarter of a Block search costs less.

And then there’s the question, there are two possibilities for failure. First, the direction of the predicted fundamental vector is wrong. Second, the window is not big enough. In either case, Ankor search could fail. In general, we do not enlarge the window, or turn in place to change the direction of the operation, but in low resolution to perform exactly the same search, but the search algorithm remains the same. But since the resolution is a quarter, the equivalent window is actually 2×2. Search the TILE with a lower resolution, multiply the motion vector by 2, and initialize the current TILE. We get the new window and avoid using the slower, larger search window. In fact, you can think of it as a level 2 pyramid, with more complex ones having levels 3. But since we have three forecasting methods, level 3 is not a big payoff, level 2 is fine.

The last point is search. So now we have motion vectors, we have Windows, so how do we search? A traditional Raster Scan is a line by line, 16×16 structure, 256 searches, which is obviously not acceptable. We introduced the downhill search, using variable step size. The first step is 4. For example, in a 16×16 window, search a diamond of 13 points, search 13 points for the first time, and then use the point with the smallest residuals as the new direction. The second time you change it to step 2, the blue area becomes the new search window, and if you get a really ridiculous result you’ll fail. If the residuals continue to shrink, the step size changes to 1, which is the green window, and finally outputs the motion vector of the minimum residuals.

06. Optimize revenue data

That’s it for fast motion estimation. There are three optimization methods mentioned earlier, and more methods are not discussed this time. Generally these three optimization methods are done, will get a very obvious benefit. For example, the first method has a 3x return, the second method also has a 3x return, and the third method has a 4x return. As you can see, two examples are given here. The first is super-resolution. If you’re using the traditional 420 format, optimized with AVX2 instructions, that’s less than 200 FPS. Because this algorithm is mobile-oriented. On a PC, data can be scary. But optimizations, such as using the TILE format and not using texture analysis motion compensation, can be very fast, up to 1000 FPS. So, when you take it a step further, you add texture analysis, bypass analysis, and you pick it up another 40 to 60 percent. The second example is the traditional algorithm VBM3D, which is also extremely profitable because we use a fast motion estimation algorithm. Its main operations are on Block Match, and when we speed up Block Match, the benefits are very obvious. If it is 420 format, open source implementation online, with 5 frame sequence, AVX2-optimized, 1080 data performance is 1 FPS, if it is 8×8 AVX512, then it is more than 100 FPS. If you add Early Skip, unlike exercise compensation, this is not an exercise reference. This mode means that some data can be processed in advance in some cases during DCT 3D transformation. Early Skip also doubles the revenue, eventually hitting 220 FPS on 1080p. This is basically considered to be real-time. A very slow 1 FPS video algorithm, up to 220 FPS, is the process of moving from offline to real-time.

That’s all for today’s share. Thank you.