This paper is from TCSVT2020 article “A Low-power Motion Estimation Architecture for HEVC Based on A New Sum of Absolute Difference Computation.”

In HEVC, motion estimation (ME) is the most power consuming part of the encoder, while SAD takes up 50% of the power consumption of ME. In this paper, a new SAD calculation method is proposed to reduce power consumption. (Note: Power consumption in this paper is equal to computational complexity)

Two stage rapid IME

Integer Motion Estimation (IME) enables parallel processing of PU within a CU. Generally speaking, the optimal MV of a large block is in a rectangular region, which is determined by the maximum and minimum values of the optimal MV of each small block comprising the large block, as shown in Fig.1.

It can be seen that the large search window is restricted to the rectangular area. The small search space reduces the computation. Based on this, a two-stage algorithm can be used for motion estimation. In the first stage, Full Search (FS) is performed on all 4×4 sub-blocks of CU. The search starting point of each sub-block, i.e. MVP, is determined by the candidate MV of the AMVP list, as shown in Fig.2. The search window is 64 pixels around the starting MVP. To reduce computational complexity and increase horizontal data reuse, the first phase only searches each row of the search window.

In the second stage, the MV of the 4×4 sub-block calculated above can be used to limit the motion search range of the larger PU (i.e., the rectangular region in Fig.1), and the motion search of the large PU can be performed directly in the limited rectangular region. The search window of the second stage is determined by the adaptive result of the 4x4PU motion estimation of the previous stage. In the second stage, a parallel IME (full search algorithm) was used to calculate the cost of each chunk of PU at the same time. The cost of the chunk of PU was obtained by adding the SAD of its components.

Experiments were performed on HM13.0 with the test sequence classB to classE configured as LDP, QP={22,27,32,37}, and the first-stage search range was [-64,64]. The experimental results show that the average search points of bulk PU in the second stage IME is 70.

The construction of the usual AMVP candidate list requires the use of the motion information of adjacent CU, which results in the inability of individual CU in a CTU to be processed in parallel. The proposed method can solve this problem. As shown in Fig.1, the first stage has prepared the motion information required by the second stage IME (location and size of the large PU search window). In addition, MV of 4x4PU obtained in the first stage can be used as AMVP candidate in the second stage, that is, MV of 256 4X4 blocks in CTU can be used as MVP candidate, forming Super Advanced Motion Vector Prediction (SAMVP). First, these 256 MVPS are used to determine the search center for all PUs in the CTU. The candidate with the minimum 64×64 SAD cost from the SAMVP list is selected as the search center for the phase 2 IME. Second, the MVPS in the SAMVP list can be used to encode sports information. Usually the MV of the 4×4 subblock in the upper left corner of a large PU is used as its MVP. For example, the 8x16PU in Figure 3 uses MV11 as its MVP.

There are 256 MVPS in SAMVP. In this way, only 256 SAD packets need to be calculated by reusing SAD packets of small PU blocks to calculate the cost of large PU blocks.

In addition to achieving higher coding performance with SAMVP, the use of two-phase parallel IME can also achieve higher data reuse rates. Only 4 pixels need to be loaded for each new search point in the first phase and 64 pixels in the second phase. Since the second-stage search window contains an average of 70 search points, the loading overhead is minimal. The experimental comparison between the traditional one-stage parallel IME algorithm and the two-stage parallel IME algorithm in this paper shows that the reference pixel load of the traditional method is 6 times that of the proposed algorithm.

Low complexity calculation method of SAD

In this paper, a new SAD calculation method is proposed to reduce the complexity by about 50%.

The calculation method of SAD between the MxN block at (L,k) position in the current frame and the candidate block in the reference frame is as follows:

Where (I,j) is the displacement between the reference block and the current block. Formula (1) can be transformed as follows:

The second and third terms are the pixel sum of the current block and the reference block respectively, denoted as Sc, SR(I,j). The first term is the difference when the current block pixel is smaller than the reference block pixel, denoted as SDCR(I,j). In order to simplify symbolic representation, (l,k) in the formula was removed, and format (2) was rewritten as follows:

SAD of the current block and the reference block whose displacement is (I +1,j) is:

Comparing formulas (4) and (3), it can be found that some calculations are repetitive. First, Sc is the same for all (I,j) and there is no need to double-count in the search for the smallest SAD. We just need to compute the minimum of -2sdCr (I +1,j) -Sr (I +1,j).

Second, SR(I +1,j) can be rewritten as follows:

When calculating the SR of a search point, you can use the calculated SR of its adjacent search points. So instead of MxN minus 1 addition, now you only need 2M addition/subtraction.

Finally, you can reduce the number of operations required to calculate the SDCR of adjacent search points. The SDCR contains only two basic operations: compare and add. Comparisons can be made by subtraction and sign bit checking. Only if (c-r) is negative (i.e. sign bit =1) does it need to be added to the final result.

Use a one-dimensional example to show the calculation process, assuming (C0,C1… Cn-1) N pixels of the current row, (R0,R1… RS+ n-1) is a reference pixel row containing S search points. In order to calculate SDRC, N comparison operations must be performed. However, you can reuse the results of the previous search points through predictive computation. For example, if C0>=C1 is known and C1>=R1 is obtained after calculating the first search point, then C0>=R1 can be directly obtained without comparison operation. As shown in Fig.4.

Assuming that Ti, I = 0, 1… S is the ith search point. The predicted results of Ci and Ci+1 are expressed as CiCi+1, as shown in Fig.4 (note that CiCi+1 only needs to be computed once for the current frame). The calculation results of Ci+1 and Ri+1 and CiCi+1 can be used to obtain the comparison results of CiRi+1. When the comparison result is 1, the difference is negative; otherwise, it is non-negative. If CiCi+1=0 and Ci+1Ri+1=0 then you can get Ci>=Ri+1 without making any extra comparisons.

Assume that the joint probability of Ci>=Ci+1, Ci+1>=Ri+1 is P. After N comparisons of the first search point, 1+(1-p)(n-1) comparison operations are required for each of the remaining (s-1) search points. If (n-1) pre-comparison operations of the rightmost pixel of the current pixel are included (as shown in Fig.4, C7R7,C7R8,C7R9… C7R12), the total number of comparisons required is:

Extend the above conclusion to the two-dimensional case, assume that the current block is MxN and the search window is SxS, and conduct a full search. The total number of comparisons required is:

When the comparison results are obtained, only the negative (C-R) results need to be added to the SDCR. So let’s say the probability of negative is Q. Then, the addition operand needed to calculate SDCR is:

Then the total operand needed to calculate SDCR is:

When SR is calculated, 2M addition operations are required for each search point, with a total of SxS search points, so the total number of addition operations is:

Finally, the number of operations required to calculate the current block’s minimum SAD is:

The traditional method to calculate the minimum SAD requires MNxSxS comparison times and (Mn-1)xSxS addition times. The total number of operations required is:

Then the ratio of operation times reduced by the algorithm in this paper is:

The ratio of reduced operations depends on the statistical results of pixels. To test the efficiency of the proposed algorithm, HM13.0 is used to encode the reference sequence classB to classE, configured as LDB, QP=22, search window ±64, and maximum CU 64×64. Therefore, M = N = 64, S = 129. The average P=0.47 and Q=0.51 of all sequences were obtained through experimental simulation. Based on this probability value and Formula (6), it can be concluded that the ratio of reduced operation times is about 0.45. This means that you can reduce the complexity of SAD calculations by about 45%.

If you are interested, please pay attention to wechat public account Video Coding