This article is edited from Tencent senior technical expert Bao Jinlong’s speech on LivevideoStack online sharing. Based on his own practical experience, he explained in detail the high performance video inference engine optimization technology.

By Bao Jinlong

Organizing/LiveVideoStack

Good evening, everyone. It’s my great honor to have this opportunity to come to LVS again and discuss some issues with you. My first LVS was in 2017, so it’s been nearly 4 years now.

Today’s content is inference engine optimization techniques, of course, there is a premise, mainly on the end. The storage contradiction of von Neumann system has been the main contradiction for decades. The solution to this problem on NVIDIA graphics cards, or on domestic Suiyuan NPUs, is to increase the bus bandwidth with the fastest HBM memory, HBM 2.53D stacked memory, and 3 to 6 layers of Cache in the middle. However, there are some problems in doing so on the terminal, such as the power consumption and area of the chip on the terminal. Whether it is GPU or DSP, L2’s Cache is basically about 1M, and the memory is shared by all the chips. Besides, LPDDR3 and LPDDR4 have very low power consumption. But the performance is 10 to 20 times worse than desktop DDR and HBM memory. Therefore, the optimization on the end still needs to start from the overall design of the inference engine, the execution speed of the operator itself, as well as the replaceable of the operator itself. In other words, the optimization should be carried out from the software development, because it is actually very difficult to improve the hardware by 10 or 20 times in a short time.

01. Optimize your thinking

The current optimization ideas, although there are five, can actually be divided into three categories.

The first category is from the perspective of the Framework, such as the first and second terms. The idea is to change the data arrangement to make the execution of the computing pipeline more effective. This is optimized from the perspective of Cache utilization, which reduces bus access and power consumption. And the third one is to optimize the operator itself, to see if it can be performed faster without changing the principle of the operator algorithm. The fourth is bypass optimization, which is carried out by the principle of equivalent substitution. There are only two things listed here, and there are more complex optimizations, but we don’t have enough time to get too complicated. The fifth term is, if you want to use motion compensation, then you have the question of where the motion vector came from, and then you have an estimate of the motion vector attached to it.

The first part is the data arrangement. This is a very basic problem. Video data is a two-dimensional image sequence that is continuous in time. In spatial domain, that is, in a single frame image, there are horizontal and vertical dimensions, and there is a lot of redundancy in time and content. One of the very strong features of video data is that within frames, if you have two dimensional blocks, you have very strong correlation, if you have flat blocks you have flat blocks, if you have textures you have textures that are continuous. The second characteristic is that the adjacent two-dimensional data blocks are also strongly correlated. The third feature is that frames are associated with each other through motion vectors, which is actually a Patch, and three-dimensional array has strong correlation. The strong correlation shows that not only the texture features are similar, but also the high frequency and low frequency signals are respectively similar (that is, the spectrum features are similar). This is from both the time domain and the space domain. Correlation creates a lot of opportunities for algorithms to speed up, and we’ll get to that in more detail.

At present, the conventional inference engine data processing method is actually Planar structure, which may be multi-channel, such as RGB, or three channels, such as YUV, 420 format, 444 format. The way it works is by scanning a line, one line from left to right, then the second line from left to right, in the same way that old TV sets scanned it, until it gets to one frame. This method is universal, no matter what type of data can be executed. But it also has some problems. First, it is performed on a row basis, so data local dependencies cannot be taken care of in this way. Second, this is the entire graph sweep, so the data throughput is very large. The way the whole graph is scanned is understood to mean that the whole graph is a block, and that block is the largest block. In general, if the graph is large, for example, the Y channel is 1 or 2 megabytes in size, then the L2 Cache is 1 Megabyte in size. This is a typical case of Cache bump, which means that by the time one Filter is processed and the next Filter is processed, the data from the last read has been completely flushed. You need to reload it. So, the problem is relatively serious, both the speed and power consumption performance is very poor.

So to solve this problem, we can block the data in the same way that we codec and decode video, and the normal block is 8×8. In general, an 8×8 block takes care of the texture, has locality, and contains enough features inside. There are pros and cons to being too big or too small. So, the typical chunked storage has a span. If it is an 8×8 block, the first 8 pixels are in the first line, which in turn has a large span. If it is processed with the current instruction suitable for artificial intelligence and large concurrent, such as AVX512, which can do 64 int 8 multiplication at a time, DSP is even bigger, usually 128~768 pixels. The 768’s data organization is in RGB 16×16 blocks with three channels that can be processed in one go. There are many more, like 32×32. We often see some NPU designs on the Internet, such as Elon Musk’s NPU for car navigation, or Google’s TPU-1, 2, 3 generation. The level of concurrency is improving all the time, but basically it stays at this level, not too big, not too small. If you read in 420 format with a span, you will find this is very inefficient. For example, an 8×8 block requires 8 addressings, 8 reads, 8 bytes each time, only one computation instruction is performed, and 8 writes are performed. That’s actually very, very inefficient. So how can you improve it?

It’s actually quite simple. The improvement method is to cancel the span of 8×8 blocks and store them continuously. For example, if you put 8 rows into 64 bytes in a row, the next 7 rows are canceled, so that this row contains 8 rows of data. The advantage of this is that, for example, if you’re dealing with an 8×8 block, you only need one address, which is very neatly aligned addresses, and then you need one read, one compute, one write, which is 7 or 8 times faster. But, in general, we don’t get seven or eight times faster. The reason is that efficient processing of TILE data requires data conversion, stacking, and data rearrangement (more on this later).

The TILE format requires data rearrangement. First of all, the input data are in regular Planar format, RGB or YUV 420 format, to convert is to cancel the span. But most inference engines need to copy once, it doesn’t deal directly with the raw data, it needs to Pad- boundary padding, and there may be format conversion, and rearrangement can be integrated with this. The permutation function is a very fast instruction that executes as fast as memcpy, meaning that the cost of merging is negligible. Secondly, the data in the operation is not always address aligned, so it may be necessary to rearrange the motion vector addresses that are not aligned. The diagram above shows how to rearrange the input format more efficiently when converting. Instead of reading 8 bytes at a time, or 8 lines at a time, you read 64 bytes and 8 lines at a time, which, when rearranged, forms an array of 8 88s. Then the efficiency is the highest, and the degree of instruction parallelism is saturated and maximum. Let’s look at an example of address rearrangement. This is the simplest example. If you have more examples, you have a chance to try them out.

For now, let’s assume that block A has the address (0,0), adjacent to block B, C, and D, all spaced 8 apart, and the addresses are aligned. I need to access block G, and the address is (6,5), so I have an operation vector. If it is in 420 format, do a misaligned access, or do a simple concatenation with alignment. But in the TILE format, there’s a little more work to do. First, it needs to be stitched into E, that is, block E is stitched between two blocks A and C, and block F is stitched between two blocks B and D. But these two blocks are different from the G block, they’re in a region on a horizontal line, and they have to be rearranged in x-coordinates. At the moment, the instructions for AVX512 work like this, moving 64 bits in three instructions. If later, more advanced instructions are done all at once, so much the better. Obviously, the TILE format was not considered when AVX512 was designed. Also, notice that the directive is suffixed with an X, which means there is an extension. Basic Intel instructions, such as c= 5, are constants, but constants are so inefficient that they are often used as variables. The following equivalent implementation is to form an equivalent variable align after the rearrangement of another two vectors permute, a, and b.

If it’s on Qualcomm’s HVX, it’s much easier. Qualcomm executes the same 5 lines of instructions, but the first 4 are a class of instructions called valign, which supports both constants and variables and performs the exact same actions. Finally, there is a VSwap operation similar to AVX512 Blend that merges to produce a block. These instructions are low latency instructions, access memory is much faster, so the cost is basically very small. If the subsequent implementation is further optimized and the cost can be spread out, the effect will be even better.

02. Computing pipelining

Now, let’s look at the second part. In fact, the Cache is never hit. The speed of the Cache on the mobile terminal is very slow. In this case, the Cache can be used as a Raster scan on the entire frame. Generally speaking, according to our experimental data, the power consumption of DDR memory access is larger than the power consumption of chip computing instructions. To solve this problem, we want to minimize the intermediate data of all these processes, such as Cache utilization, to improve. So how do you do that? You have to design a super pipeline. That is, these filters can be executed in parallel on the Pipeline at multiple steps from S0 to S4. Parallelism means that the sequence of data processing is changed from a sequential dependency per frame to a sequential dependency per row, so that the data processing is parallel from the point of view of frames. In this way, the data is basically stored in the Cache, which does not require much bus access and consumes very little power.

We’re using tiles or rows, but if we’re using 420 we can use super pipelining, so it’s a ROW per action. If it is in TILE format, you need a secondary structure, TILE, in addition to the ROW ROW.

We’ll start with engine optimization, but data prefetching is very important, whether it’s a normal engine or an optimized, specialized, tightly packed engine. Because the data Cache on the mobile chip is very small, if the data is not prefetched, it will basically be in the state of miss. In general, we see that line scanning is also a pipeline structure. First, you need to prefetch the row before the loop starts, then you need to prefetch the next row in the loop, and then process the current row. The process_row function takes a long time to process, typically more than 1000 clocks. So a data load from the bus into the Cache is roughly 100 clocks, maybe even slower, maybe 80 clocks on a better machine, maybe 150 clocks on a worse machine. So if process_row executes too fast, prefetching is not efficient enough. For example, if prefecth_row is not prefetched during processing, the prefetching fails. But if you’re dealing with one row, that’s plenty of time. If there are many, many filters running at the same time, and the pipeline has multiple levels, data prefetching is much more efficient than Raster Scan. As shown in the figure, the prefetch is performed in sequence, such as the black line, the prefetch and the data load together, then the prefetch is done once, there is no need to do any further prefetch, because the intermediate data is small. So you’ve done n prefetches, and now you’ve actually done the first prefetching. As shown in the figure, the prefetching pointer points to the green line, followed by the red line with more Delay 1, followed by the yellow line with the processed line. So prefetching is very important, and without prefetching, the engine can’t achieve very high performance.

Next up is the design of the super assembly line. As you can see, if you look at the simple example that we have here, it has four levels. The first level is prefetching a load of 1×1, green is 5×5 convolution, blue is 3×3 deconvolution (just think you have it), and yellow is 7×7 post-processing. This obviously has a data dependency on the front, and if its TAP condition is not met, subsequent operations can not continue basically. You need a processing logic here. If you look at the code in the figure above, processing one line at a time is essentially using four Pointers to the process function, executed one at a time from the top to the bottom, for example, executing the second line immediately after the load. Data integrity is obviously not satisfied. Because the second level Filter is 5×5, step 6 cannot be executed until step 5 has been processed, so it fails. So you go all the way to 5 and then you go to 6, and you see that the data is complete, so 6 is done. After 6 is executed, the next level processes it again, tries 9, which obviously doesn’t work. Then directly break from 6 and return to execute 7. After processing, 8 will be complete. Then when I tried again, I found 9 perfect. And then you go down to 15,15 is not perfect, and then you go back to 11. 11 followed by 12 and 12 followed by 13 and 14. Since this is an upsample operation, we can process two rows, and then 15 can be executed. After execution, if 20 is not satisfied, return to 16. The logic just keeps repeating. In fact, the execution is much more efficient than the original, whole sheet processing, and the code is very simple. As you can see, the code is actually very beautiful, the more beautiful the code is more efficient.

It is not clear whether it is in 420 format or TILE format, but the data dependencies are the same in either format. Tiles are a special format with some specificities. First, prefetching should not be performed as a row, but as a TILE block. As you can see, the green pointer is the prefetch pointer, the red pointer is the processing pointer, and the yellow block is the processed block. So when process_tile_row processes a row, it’s actually outside of the loop, prefetching the first TILE, then fetching the next TILE, processing the current TILE, and the loop is done. Since a TILE is 8×8, the maximum Tap that supports a Filter is 17, which is easy to figure out. The center point plus two wings is 8+8. Neither the convolution kernel nor the filter is so big. 7×7 and 9×9 are the limits. Generally, 3×3 and 5×5 are enough. If the Filter fails to handle the next TILE, it will return Fail. If the Filter fails to handle the next TILE, it will continue.

In TILE format, a single TILE is equivalent to eight rows of the original ROW, and the Cache consumption is greater. Generally speaking, a single 1080P channel consumes 50 K in the previous pipeline. So in this case, it’s only 50 Kelvin, which is probably a lot more than you think. I know a lot of engines that consume 2 or 300 MB of data buffers when processing data. If you design an engine like this, and it only consumes 50 kilobytes of data, that’s amazing. But in reality, there is a format conversion, input/output conversion, and a minimum buffer of 2 frames. If the Cache is a floating-point type, or if the Cache is a floating-point type, or if the Cache is a floating-point type, or if the Cache is a floating-point type, or if the Cache is a floating-point type, or if the Cache is a floating-point type, or if the Cache is a floating-point type, or if the Cache is a floating-point type, or if the Cache is a floating-point type, then the Cache is 50×4 = 200 k. In this case, the data is divided into tiles, which is now called a Segment. Now that this Segment is 1/4 of its original size, the pipeline consumption is also 1/4 of its original size, which is within the L2 Cache consumption range. This is a very simple transformation operation.

3. The operator

We’ll quickly wrap up the TILE format and pipeline, and move on to the third part. The first is operator merging. The cost of reading and writing operators that are very computationally small is too great for the computation. So we’re going to do a format conversion, so int 8 becomes int 32, float becomes int. For example, simple addition and subtraction, or even square root. Some software analog square root instructions require a large amount of computation, but generally they are hardware instructions, and the Delay is small. So in this case, the idea is that this function is a plus, and the execution is a loop, so the original algorithm is going to execute on a row, and this row is going to be a plus. Load a takes 5 cycles, load b takes 5 cycles, add takes 1 cycle, and store takes 5 cycles. It’s really slow. In fact, if you load it without prefetching, then it could be 50 or 100 cycles, which is a very bad situation, and we see that in a lot of engineering code. If you prefetch it, you’re going to lose 5 cycles. On the right side of the graph is a square root function that performs a similar operation, reading, calculating, and then writing. So, we think it takes 6 cycles to take the square root. Here, we perform three computations, taking the square root of A and B respectively, and adding the results of A and B together, so that the instruction cost adds up to 48.

So let’s pick a new function, combine all of these operations, and simply write it in one line of code. In this case, let’s look at the cost again, so load A and load B are both 5, so that’s 27, so it’s twice as fast. It’s a very simple idea.

Similarly, there is a multi-channel. In general, convolution is 8, 16, 32 groups. If we do a convolution, say with AVX512, DPBUSD, in the 3×3 case, it’s asymmetric, so it’s evenly distributed across the three vectors, occupying the lower three bits of each vector’s Slot. The data is also organized. Each vector is computed by DPBUSD and executed three times to get SUM. There are also other costs, such as reading in. So you want to try to do all the convolution sums at once, because there’s a cost to scheduling. You need Swap to rearrange it. After rearranging it, you’d better put it in a register and use it up once. So here’s an example of eight, and you can think of the convolution kernel as a const array, one loop at a time. When it’s done, output it directly.

We might have a problem with operator fusion. When we make a model, an Element, an operation, or a function alone, is a static and sequential execution. If you combine, there are many, many permutations and combinations. Assuming that we now support 180 operations, at least 90 of them are combined, this permutation is an astronomical number, and a static engine is impossible. When the models are read in, a separate execution code is built for each model. In terms of timing, there are three kinds of builds. First, on the development side, such as using a Mac laptop, or using the server to compile, directly generate a SO, release the SO into the APK. In the second way, we put the model on the end, for example, the mobile end uses ARM CPU to execute the compilation process, combines the source code operators, and finally calls LLVM compiler to generate optimized code, which is completed by ARM CPU. OpenCL is compiled on the GPU, and it is also available on the DSP. Actually, it is entirely up to you to choose. From a security point of view, we need to distinguish between Host and Device, otherwise the Host is fine. But, there’s a problem, because you’ve all used OpenCL, and you have to compile your code on the side, and the time it takes to compile your code is actually quite long, usually a few hundred milliseconds. Usually when we do A/B experiments, the APP starts up for A few hundred milliseconds longer, which can be fatal. If you find this a bit laborious, there is another way. The second thing I suggest is that the optimized binary does not need to be compiled by a compiler. We can simply convert each operator into binary unsigned char data, and then directly assemble the unsigned char data of these operators according to the model before running. At this time, there is no need to optimize the process. Assuming that the optimization level of splicing code is very high, the splicing cost can be ignored, and the general splicing speed is up to one thousandth of a second.

04. Operator bypass optimization

Next is operator bypass optimization. We first talked about how to make the Cache more efficient and consume less power, and then we talked about how to make the operator more efficient, but the operator itself is not algorithmic optimization. There is a basic principle of bypass optimization. On the one hand, the processing of simple features, gradient direction, the current 8*8 block is divided into two parts of the region, trapezoid of an Edge, these features can be obtained through a simple fast algorithm, do not need a more complex model, cost-effective. Complex features, on the other hand, can only be calculated using the depth method. These two aspects are complementary, under certain conditions, they can be interchangeable, fast algorithm can replace the deep algorithm. That is to say, some data blocks with simple algorithm output, the error is acceptable, or zero error, then local replacement is feasible, that is, a piece of change, the entire network output will not be affected. In general, an algorithm is required to determine substitutions. For example, the motion vector, whether the residual error is 0 or not, and the prediction algorithm all have a cost, and the cost is much lower than the benefit. If after the replacement, the speed is increased by 3 times, at the same time, there is a 10% cost, then the two parts are relatively large. Otherwise, it won’t have to be done. I have a personal experience with an algorithm where I did a texture complexity prediction using the TV (Totally Variance) method, but the implementation was problematic. For example, a graph of 720 needs to be judged at a cost of 3-6ms. If the algorithm does not bypass the whole process, it will finish the process in 3-6ms. After the prediction, the speed will be even slower.

Next we look at texture complexity analysis. General complexity is actually a TV algorithm, but it can be implemented very quickly and at very little cost. So if you take an 8×8 block, you first want to determine if it’s flat, and you want to calculate an average AVG. If it is a flat block, the variance of each point with the mean, or equivalent of abs, and the cost is less than a value or zero, then the texture is flat. The nice thing about this is, if you have a convolution operation, you have to multiply at each point, but in the flat case, it degrades, the AVG is the same, so when you factor it out, the sum on the right-hand side becomes a constant, O (n 2) or O (n 3) degrades to O (1). And of course there are other, more complicated cases, where the idea is similar, where you have to make the equivalent substitution if the bypass is true. Let’s look at the following example: gradient direction determination, which is the TV operation in one direction. Let’s say I take 8 points in the direction of 45 degrees, and the variance of those 8 points with the mean is very small, or 0, so this direction is definitely a boundary. You can also compute gradients in multiple directions, using the Sobel operator, using the Laplace operator, but from a reliability point of view, the TV method is recommended.

Flat textures can have a significant degradation. Similarly, there is a significant O (1) degradation, or motion compensation. There is only one pixel compensation for codec, but the deep inference engine must do all the links. Because the pixel is copied, the intermediate data of the surrounding network cannot be ignored, and the network output is not equivalent. So at all levels need to do adequate motion compensation. The first is pixel compensation. The residuals of the two data blocks are 0. If we filter the two blocks respectively, we can assume that the residuals of the two blocks are also 0. Once you do F of b0, you don’t have to do F of b1, you just copy it. This is actually an O (1) operation. The second one is convolution compensation. The convolution has intermediate output results. If only motion compensation is done and the intermediate data is empty, then the convolution pixel is dependent and the output result of the network is abnormal. Therefore, the intermediate result of the convolution, that is, the intermediate result of the equivalent filter, also needs to carry out motion compensation. Also, important intermediate output data, such as FEATUREMAP, need to be compensated. These offsets come at a cost. You have to estimate the cost of copying, or you have to calculate the cost of just doing the hard work. Generally speaking, on PC, or in the case of no power consumption pressure and relatively high bandwidth, the benefits of motion compensation are very large. On mobile platforms, if the memory bus is very slow, weigh how complex the filter is to be replaced. Generally there is a balance point, beyond the balance point is a gain, if not, then the motion compensation is a failure.

05. Rapid motion estimation

Now we move on to the final section, where we introduce fast motion estimation. Motion compensation requires a Block Match process, or optical flow, where each pixel requires a motion vector, and we usually use 8×8 Block motion vectors. Usually, our algorithm combined with the decoder can obtain a motion vector, but there are residuals when codec, which may interfere with your algorithm so much that the motion vector can not be used in most cases. In this case, you have to get the motion vector yourself. If it’s motion estimation on a common encoder, the cost is particularly large and will be slower than your own depth algorithm, so there’s no need to get it this way. However, in recent years, some fast algorithms have emerged, namely fast motion estimation, which has several characteristics. First, compared with the original algorithm which used a large search window, the fast algorithm window has the initialized predictive motion vector, because of the existence of the predictive motion vector, the window can become small. We assume that the matching target is shown in the figure above. The original search algorithm that has not been optimized needs to use a large search window, while the fast algorithm has a prediction vector, and the window can be made very small, that is, the search times are very small, and then the convergence can be achieved. Because the prediction vector is taken from a different source, it doesn’t necessarily fit, it has a variety of variations, unlike copying pixels, the vector can’t be used directly, you have to re-execute the search process, but the search is much faster than without the prediction of the window.

So, how do we get the predictive vector? Let’s assume that there is a sequence, frame0 to frame1, that MV1 has retrieved by search. For example, we can make a search from Frame0 to Frame2, so we can assume that the current block is moving at a constant velocity. Then we can extend this motion vector to Frame2. MV2 is simple, MV1 is multiplied by 2, and we can determine the window based on this, and then we can search again. However, there are other methods, such as a reverse search, frame1 to frame0. The Block on which the dot is located can then be searched using this motion vector, which refers in reverse to Frame0. Frame2 is also available and can be searched in reverse. There are a lot of possibilities, there should be a similar algorithm in the new version of VVC, it’s not complicated.

This is followed by the second acquisition method, the Ankor (anchor). What is Ankor? As shown in the figure, the data is divided into many tiles, but not all tiles are searched. We only search rows and columns, which is a quarter of the total number. Then, we use this quarter of tiles as Ankor for motion search. For the remaining three quarters, the motion vector obtained by the neighboring Ankor is used as the prediction vector. The order in which Ankor is selected and executed varies depending on the architecture. If it’s a GPU, there will be no time dependencies between Ankor and the search will be done simultaneously. If you have a DSP CPU and you can use the Raster Scan, then the next Ankor can use the previous Ankor, which is faster. If it is parallel, there is no problem, and the cost of searching a quarter number of blocks is relatively small.

And then there’s the problem, there are two possibilities for failure. First, the fundamental vectors are not predicted in the right direction. Second, the window is not big enough. Either case can cause Ankor search to fail. In general, we will not enlarge the window, or turn in circles to change the direction of the operation, but in the low resolution of the exact same search, the search algorithm is unchanged. But since the resolution is a quarter, the actual equivalent window is 2×2, and the search on the low-resolution TILE is done by multiplying the motion vector by 2 as the initialization of the current TILE. We get new Windows and avoid using larger search Windows that are slower. In fact, you can think of it as a 2-level pyramid, or more complex one with 3 levels. But since we have three forecasting methods, the payoff at level 3 is not that big. We can just use level 2.

The last one is search. So we have motion vectors, we have Windows, so how do we search? Traditional Raster Scan searches line by line, 16×16 structure, 256 searches, which is obviously not acceptable. We introduced the downhill search, using variable step size. The first step is 4. For example, under the 16×16 window, search the diamond of 13 points. Search 13 points for the first time, and then use the point with the smallest residual as the new direction. The second time I change the step size to 2, so the blue area becomes the new search window, and if I get a really ridiculous result it fails. If the residual continues to shrink, the step size is changed to 1, which is the green window, and finally the motion vector with the minimum residual is output.

06. Optimize revenue data

So much for fast motion estimation. Three optimization methods have been covered, and more are not discussed this time. In general, if you do all three optimization methods, you will get a significant benefit. For example, the first method has 3 times the return, the second method has 3 times the return, and the third method has 4 times the return. As shown in the figure, here are two examples. The first is super resolution. If you’re using the traditional 420 format, AVX2 instruction optimization, that’s less than 200 FPS. Because this algorithm is mobile oriented. On a PC, the data is scary. But after optimization, such as using TILE format, without motion compensation for texture analysis, the speed is very fast, up to 1000 FPS or more. Then, when you add texture analysis and bypass analysis, the speed increases by another 40 to 60 percent. The second example is the traditional algorithm VBM3D, which is also extremely profitable because we use a fast motion estimation algorithm. All of its main operations are on the Block Match, and when we speed up the Block Match, the benefits are obvious. If it’s 420 format, an open source implementation on the web, with a 5 frame sequence, AVX2 optimized, 1080 data has a performance of 1 FPS, and if it’s 8×8 AVX512, it’s over 100 FPS. If you add Early Skip mode, unlike sports compensation, this is not the way sports reference works. This mode refers to the DCT three-dimensional transformation, in some cases can be processed in advance of part of the data. Early Skip also doubles revenue, eventually reaching 220 FPS on 1080p. This is basically considered to have reached the level of real-time. A very slow 1 FPS video algorithm, up to 220 FPS, is the process of moving from offline to live.

That’s all for today’s share. Thank you very much.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

High performance video inference engine optimization technology

01. Optimize your thinking

02. Computing pipelining

3. The operator

04. Operator bypass optimization

05. Rapid motion estimation

06. Optimize revenue data

High performance video inference engine optimization technology

01. Optimize your thinking

02. Computing pipelining

3. The operator

04. Operator bypass optimization

05. Rapid motion estimation

06. Optimize revenue data

Related Posts

Front-end interviews 3+1 daily — Day 631

B Video Downloads: How to download B Video to local

Youtube Video Downloads: How to download Youtube videos to local