Deep learning operator optimization -FFT

Author: YanJianWen | kuang apparent MegEngine architects

background

In the field of digital signal and digital image, the study of frequency domain is an important branch. Our daily “processing” of images are pixel level, known as the image of the airspace data. Airspace data represent details that we can “read”. If we take the same image as a signal and analyze the spectrum, we can get the frequency domain data of the image. Look at the following set of images (sources). The highlights in the frequency domain are the low-frequency signals, representing most of the energy of the image, which is the main information of the image. The dark point is the high frequency signal, which represents the edge and noise of the image. As can be seen from the group of images, compared with Goofy, the approximate low frequency signal of Goofy retained the “contour” of Goofy, while the increase of its high frequency signal made the background noise more obvious. Frequency domain analysis allows us to understand the composition of the image, and to do more abstract analysis and detail processing.

Goofy and Degraded Goofy

The realization of spatial and frequency domain image conversion tools, is the Fourier transform. Since image data is spatially Discrete, we use DFT (Discrete Fourier Transform) and its Inverse IDFT (Inverse Discrete Fourier Transform). Cooley-tuckey developed a faster algorithm FFT (Fast Fourier Transform) based on DFT.

DFT/FFT has some further applications in the field of digital images. For example, DFT based DCT (Discrete Cosine Transform) is used in image compression JPEG algorithms (source) and image watermarking algorithms (source).

JPEG coding is realized by color space conversion, sampling block, DCT transformation and quantization coding. The use of DCT transform can distinguish the low-frequency information from the high-frequency information of the image, and compress a small amount of low-frequency information and a large amount of high-frequency information in the process of quantization and coding to obtain the size compression. It can be seen from the picture of cat’s face that the image quality deteriorates with the increase of compression ratio, but the subject information is still retained.

Cat face different JPEG quality (compression ratio)

The image watermarking algorithm converts the original image to the frequency domain through DCT, selects the appropriate position to embed the watermarking image information, and converts back to the original image through IDCT. In this way, the changes to the original image are not easy to be detected, and the watermark can be extracted by operation.

DFT/FFT also has extended applications in the field of deep learning. For example, FFT can be used to reduce the amount of convolution computation, and FFT_Conv algorithm has become a common deep learning convolution algorithm. In this paper, we will explore the principle of frequency domain algorithm and optimization strategy.

Principle and optimization of DFT

The formula

Whether multidimensional DFT operation or DCT/FFT_Conv based on DFT, the underlying computing unit is DFT_1D. Therefore, the optimization of DFT_1D is the basis of the entire FFT-class operator optimization. Calculation formula for DFT_1D:

$X_{k}=\sum_{n=0}^{\mathrm{N}-1} x_{n} e^{-j 2 \pi k \frac{n}{N}} \quad k=0, \ldots, N-1$

Xnx_ {n}xn is the input signal of length N, e−j2πknNe^{-j 2 \ PI k \frac{n}{n}}e−j2πkNn is the NTH root of 1, and XkX_{k}Xk is the output signal of length N. The matrix form of the formula is:

$\left[\begin{array}{c}X(0) \\ X(1) \\ \vdots \\ X(N-1)\end{array}\right]=\left[W_{N}^{n k}\right]\left[\begin{array}{c} \left.x(0\right) \\ x(1) \\ \vdots \\ x(N-1)\end{array}\right]$

Properties of unit complex roots

WNnk=e−j2πknNW_{N}^{nk} =e ^{-j 2 \ PI k \frac{N} {N}}WNnk=e−j2πkNn in DFT_1D is the complex unit root of 1. So intuitively, you divide the complex plane into N parts, and you sweep the circumference of the complex plane counterclockwise by k times N.

The unit complex roots have periodicity and symmetry, which allow us to make a lot of simplification of the W matrix and form the basis of the fast algorithm for DFT_1D.

Periodicity: WNk+N=WNkW_{N}^{k +N}=W_{N}^{k}WNk+N=WNk

Symmetry: WNk + N / 2 = – WNkW_ {N} ^ + N / 2} {k = – W_ ^ {N} {k} WNk + N / 2 = – WNk

Cooley – Tuckey FFT algorithm

Of the many fast algorithms for DFT_1D, the cooley-Tuckey FFT algorithm is the most frequently used. The algorithm adopts the idea of divide and conquer. The sequence with input size of N is decomposed into N/ Radix sub-sequences according to different radix radix, and each sub-sequence is divided again until it can no longer be divided. Each partition can get a stage, and all the stages are combined bottom-up to get the final output sequence. Here, N = 8, Radix =2 is taken as an example to show the reasoning process. Where x(k)x(k)x(k) is the sequence of N=8, and XF(k) x ^{F}(k)XF(k) is the DFT output sequence. According to the formula of DFT

$X^{F}(k)=W_{8}^{0} x_{0}+W_{8}^{k} x_{1}+W_{8}^{2 k} x_{2}+W_{8}^{3k} x_{3}+W_{8}^{4k} x_{4} + W_{8}^{5k} x_{5}+W_{8}^{6k} x_{6} +W_{8}^{7k} x_{7}$

Split up according to the odd and even terms into two sequences of length 4, G(k)G(k)G(k), H(k)H(k)H(k) H(k)H(k).

$X^{F}(k)= W_{8}^{0} x_{0}+W_{8}^{2 k} x_{2}+W_{8}^{4 k} x_{4}+W_{8}^{6 k} x_{6}$

$+W_{8}^{k}\left(W_{8}^{0} x_{1}+W_{8}^{2 k} x_{3}+W_{8}^{4 k} x_{5}+W_{8}^{6 k} x_{7}\right)$

$=G^{F}(k)+W_{8}^{k} H^{F}(k)$

$X^{F}(k+4)=W_{8}^{0} x_{0}+W_{8}^{2(k+4)} x_{2}+W_{8}^{4(k+4)} x_{4}+W_{8}^{6(k+4)} x_{6}$

$+W_{8}^{(k+4)}\left(W_{8}^{0} x_{1}+W_{8}^{2(k+4)} x_{3}+W_{8}^{4(k+4)} x_{5}+W_{8}^{6(k+4)} x_{7}\right)$

$=G^{F}(k)+W_{8}^{k+4} H^{F}(k)$

$=G^{F}(k)-W_{8}^{k} H^{F}(k)$

GF (k) G ^ {} F (k) GF (k) and HF (k) H ^ {} F (k) HF G (k) (k) (k) and G (k) G H H H (k) (k) (k) DFT results. GF (k) G ^ {} F (k) GF (k) and HF (k) H ^ {} F (k) of HF (k) multiplied by the corresponding rotation factor W8kW_ {8} ^ {k} W8k, carries on the simple addition and subtraction can be output XF (k) X ^ {} F XF (k) (k). By the same token, the G G G (k) (k) (k) and H (k) (k) H H (k) is also do the same iteration, A (k) (k) (k), A B B B (k) (k) (k), C (k), C (k) (k) C, D D D (k) (k) (k) is the sequence of N = 2, Combining their DFT results gives you GF(k)G^{F}(k)GF(k) and HF(k)H^{F}(k)HF(k).

$\begin{aligned} &G^{F}(k)=A^{F}(k) + W_{4}^{k}B^{F}(k)\\ \end{aligned}$

$\begin{aligned} &G^{F}(k+2)=A^{F}(k)-W_{4}^{k}B^{F}(k)\\ \end{aligned}$

$\begin{aligned} &H^{F}(k)=C^{F}(k)+W_{4}^{k}D^{F}(k)\\ \end{aligned}$

$\begin{aligned} &H^{F}(k+2)=C^{F}(k)-W_{4}^{k}D^{F}(k)\\ \end{aligned}$

Calculate the sequence of N = 2 AF (k) A ^ {} F AF (k) (k), BF (k) B ^ {} F BF (k) (k), CF (k) C ^ {} F CF (k) (k), DF (k) D ^ {} F (k) DF (k), because k = 0 k = 0 k = 0, Rotation factor W20W_{2}^{0}W20= 1. You just add and subtract to get the answer.

$\left[\begin{array}{l} A^{F}(0) \\ A^{F}(1) \end{array}\right]=\left[\begin{array}{ll} 1 & 1 \\ 1 & -1 \end{array}\right]\left[\begin{array}{l} x_{0} \\ x_{4} \\ \end{array}\right]$

$\left[\begin{array}{l} B^{F}(0) \\ B^{F}(1) \end{array}\right]=\left[\begin{array}{ll} 1 & 1 \\ 1 & -1 \end{array}\right]\left[\begin{array}{l} x_{2} \\ x_{6} \\ \end{array}\right]$

$\left[\begin{array}{l} C^{F}(0) \\ C^{F}(1) \end{array}\right]=\left[\begin{array}{ll} 1 & 1 \\ 1 & -1 \end{array}\right]\left[\begin{array}{l} x_{1} \\ x_{5} \\ \end{array}\right]$

$\left[\begin{array}{l} D^{F}(0) \\ D^{F}(1) \end{array}\right]=\left[\begin{array}{ll} 1 & 1 \\ 1 & -1 \end{array}\right]\left[\begin{array}{l} x_{3} \\ x_{7} \\ \end{array}\right]$

With the algorithm graph, each layer of the calculation will produce a number of butterflies, so the algorithm is also known as the butterfly algorithm. Here we will introduce the basic components of the disk network, which will be helpful to the analysis below.

N=8 disc algorithm diagram

The calculation sequence of N=8 is divided into three levels, each stage has one or more blocks (sections), and each block contains one or more butterflies, and butterfly calculation is the kernel of DFT operation. The calculation order of each stage:

Take the input
Times the conversion factor
For section_num, for butterfly_num, run radixN_kernel
Write output.

Look at the butterfly algorithm diagram of N=8. When stage = 1, the operation is divided into 4 sections, and butterfly_num of each section is 1. If stage = 2, section_num = 2 and butterfly_num = 2. If stage = 3, section_num = 1 and butterfly_num = 4. As you go from left to right, section_num decreases, butterfly_num increases, and the butterflies “get bigger and denser,” while the total number of discs in each order remains the same. In fact, for the algorithm with length N, radix = r, we can deduce that:

$\text Sec$ _ $\text num$ = $N / r^ {S}$

$\text Butterfly$ _ $\text num$ = $r^{S-1}$

$\text Sec$ _ $\text stride =r^{S}$

$\text Butterfly$ _ $stride$ = $1$

S is the current stage, and SEC /butterfly_stride is the interval of each section/butterfly.Copy the code

This algorithm can reduce the complexity from O(n^2) to O(nlogn), which is efficient and elegant. Based on the butterfly algorithm, we further classify and optimize different radix algorithms, which are mainly divided into two categories: the power of Radix – 2 and the power of Radix – non-2.

Power optimization of RADIX -2

Kernel of DFT_1D is WNnkW_{N}^{nk}WNnk matrix in matrix form. We analyze kernel of Radix_2 ^ N.

As mentioned in the background, the matrix form of the DFT formula is:

$\left[\begin{array}{c}X(0) \\ X(1) \\ \vdots \\ X(N-1)\end{array}\right]=\left[W_{N}^{n k}\right]\left[\begin{array}{c} \left.x(0\right) \\ x(1) \\ \vdots \\ x(N-1)\end{array}\right]$

Among them $x(0)$ ~ $x(N-1)$ Is times the rotation factor $W_{N}^{kn}$ After the input of the

When radix = 2, because W21W_{2}^1W21 = -1, W22W_{2}^2W22 = 1, the DFT matrix form of radix_2 can be written as:

[XkXk+N/2]\left[\begin{array}{c}\mathrm{X}_{\mathrm{k}} \\ \mathrm{X}_{\mathrm{k}+\mathrm{N} / {2} \ end array} \ [XkXk + N / 2] = right] [WN0AkWNkBk] = \ [111-1] left [\ begin {array} {cc} 1 & 1 \ \ 1 & -1\end{array}\right]\left[\begin{array}{l}\mathrm{W}_{\mathrm{N}}^{0} \mathrm{A}_{\mathrm{k}} \\ \ mathrm {W} _ {\ mathrm {N}} ^ {\ mathrm {k}} \ mathrm {B} _ {\ mathrm {k}} {array} \ \ end = right] [WN0AkWNkBk] [111-1]

When radix = 4, because W41W_{4}^1W41 = -j, W42W_{4}^2W42 = -1, W43W_{4}^3W43 = j, W44W_{4}^4W44= 1, the DFT matrix form of radix_4 can be written as:

$\left[\begin{array}{c}\mathrm{X}_{\mathrm{k}} \\ \mathrm{X}_{\mathrm{k}+\mathrm{N} / 4} \\ \mathrm{X}_{\mathrm{k}+\mathrm{N} / 2} \\ \mathrm{X}_{\mathrm{k}+3 \mathrm{~N} / 4}\end{array}\right]=\left[\begin{array}{cccc}1 & 1 & 1 & 1 \\ 1 & -\mathrm{j} & -1 & \mathrm{j} \\ 1 & -1 & 1 & -1 \\ 1 & \mathrm{j} & -1 & -\mathrm{j}\end{array}\right]\left[\begin{array}{c}\mathrm{W}_{\mathrm{N}}^{0} \mathrm{A}_{\mathrm{k}} \\ \mathrm{W}_{\mathrm{N}}^{\mathrm{k}} \mathrm{B}_{\mathrm{k}} \\ \mathrm{W}_{\mathrm{N}}^{2 \mathrm{k}} \mathrm{C}_{\mathrm{k}} \\ \mathrm{W}_{\mathrm{N}}^{3 \mathrm{k}} \mathrm{D}_{\mathrm{k}}\end{array}\right]$

Similarly, the kernel of radix_8 can be deduced as:

$\left[\begin{array}{cccccccc}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & \mathrm{~W}_{8}^{1} & -j & \mathrm{~W}_{8}^{3} & -1 & -\mathrm{W}_{8}^{1} & j & -\mathrm{W}_{8}^{3} \\ 1 & -j & -1 & j & 1 & -j & -1 & j \\ 1 & \mathrm{~W}_{8}^{3} & j & \mathrm{~W}_{8}^{1} & -1 & -\mathrm{W}_{8}^{3} & -j & -\mathrm{W}_{8}^{1} \\ 1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 \\ 1 & -\mathrm{W}_{8}^{1} & -j & -\mathrm{W}_{8}^{3} & -1 & \mathrm{~W}_{8}^{1} & j & \mathrm{~W}_{8}^{3} \\ 1 & j & -1 & -j & 1 & j & -1 & -j \\ 1 & -\mathrm{W}_{8}^{3} & j & -\mathrm{W}_{8}^{1} & -1 & \mathrm{~W}_{8}^{3} & -j & \mathrm{~W}_{8}^{1}\end{array}\right]$

Let’s start with access. Modern processors optimize for computing performance better than access, which is usually a performance bottleneck in scenarios where computing and access are similar.

In DFT1D, for the algorithm R-2 / R-4 / R-8 with different bases, each stage has equal access: 2 * butterfly_num * radix = 2N, The number of stages corresponding to different bases is obviously different (log⁡2N\log_2Nlog2N vs log⁡4N\log_4Nlog4N vs log⁡8N\log_8Nlog8N).

Therefore, for DFT, under the condition of not significantly increasing the amount of computation, the larger kernel can obtain obvious advantages in memory access. According to the deduced kernel diagram, each butterfly of the kernel of R-2 corresponds to four memory access operations and two complex floating point addition and subtraction operations. In THE KERNEL of R-4, each butterfly algorithm corresponds to 8 loads/stores and 8 complex floating point addition and subtraction operations (combining the same operation). While the computation amount increases slightly, the stage is reduced from log⁡2N\log_2Nlog2N to log⁡4N\log_4Nlog4N, which reduces the total number of memory accesses, so the performance will be improved. Each butterfly of r-8’s kernel corresponds to 16 loads/stores, 24 complex floating point additions, and 8 floating point multiplications. The existence of floating point multiplication increases the calculation cost, and the stage is further reduced from log⁡4N\log_4Nlog4N to log⁡8N\log_8Nlog8N. However, the reduction of stage from R-4 to R-8 is not obvious because N is not too large in daily life, so the optimization is limited

Let’s look at the cost of computing. There are two common ways to reduce computation overhead: reducing redundant operations and parallelization.

Taking R-4 algorithm as an example, the calculation of kernel is as follows:

radix_4_first_stage(src, dst, sec_num, butterfly_num)
radix_4_other_stage(src, dst, sec_num, butterfly_num)
- for Sec_num
  - for butterfly_num
    - raidx_4_kernel

Since k=0 and rotation factor is 1 for data of radix4_first_stage, this part of complex multiplication operation can be omitted and optimized separately. In the radix4_other_stage section, from the second stage, butterfly_num = 4^(s-1) is a multiple of 4, and each Butterfly array is read/stored at intervals. The loop expansion and vectozation can be performed on the innermost loop to realize four or more butterfly parallel operations. The use of loop unrolling and SIMD instruction can not only improve the parallelism, but also improve the efficiency of cacheline utilization, which can bring great performance improvement. In the case of SM8150(ARMV8), parallel optimization of R-4 can achieve 1.6x performance of R2.

Size: 1 * 2048 (R2C) Environment: SM8150 core

In conclusion, for the optimization of Radix 2^ N, the appropriate Radix is selected to reduce the access and storage cost brought by multiple stages, and the unit complex root property and parallelization are used to reduce the calculation cost, which can bring about a great performance improvement.

Power optimization of Radix non-2

When input length N = radix1^m1 * radix2^m2… When the Radix is not a power of 2, the performance will decline sharply if naive O(n^2) algorithm is used. Common solutions to the original length complement 0, use radix_N algorithm, special radix_N algorithm (ChirP-Z transform). The power complement method of 0 to 2 increases a lot of computation and storage capacity for large size input, while chirp-Z Transform uses convolution to calculate DFT, so the algorithm is too complex. Therefore, optimization of non-2 power Radix -N is also necessary.

The calculation flow of Radix -N is the same as the power of Radix -2, and the periodicity and symmetry of unit complex root can also be used to simplify the calculation of kernel. Taking Radix 5 as an example, DFT_kernel of Radix 5 is:

$\left[\begin{array}{cccc} 1&1&1&1&1\\ 1 &\mathrm{W}_{\mathrm{5}}^{1} & \mathrm{W}_{\mathrm{5}}^{2} & \mathrm{W}_{\mathrm{5}}^{-2} & \mathrm{W}_{\mathrm{5}}^{-1} \\ 1 &\mathrm{W}_{\mathrm{5}}^{2} & \mathrm{W}_{\mathrm{5}}^{-1} & \mathrm{W}_{\mathrm{5}}^{1} & \mathrm{W}_{\mathrm{5}}^{-2} \\ 1 &\mathrm{W}_{\mathrm{5}}^{-2} & \mathrm{W}_{\mathrm{5}}^{1} & \mathrm{W}_{\mathrm{5}}^{-1} & \mathrm{W}_{\mathrm{5}}^{2} \\ 1 &\mathrm{W}_{\mathrm{5}}^{-1} & \mathrm{W}_{\mathrm{5}}^{-2} & \mathrm{W}_{\mathrm{5}}^{2} & \mathrm{W}_{\mathrm{5}}^{1} \\ \end{array}\right]$

W5kW_5^kW5k and W5−kW_{5}^{-k}W5− K are symmetric in the complex plane according to the X-axis, having the same real part and opposite imaginary part. By this property. As shown in the figure below, for each stage, common items A, B, C and D can be combined, and then the output of this stage can be calculated according to the common items.

$A=\left(x_{1}+x_{4}\right) * W_{5}^{1} \cdot r+\left(x_{2}+x_{3}\right) * W_{5}^{2} \cdot r$

$B=(-j) * \left[\left(x_{1}-x_{4}\right) * W_{5}^{1} \cdot i+\left(x_{2}-x_{3}\right) * W_{5}^{2} \cdot i\right]$

$C=\left(x_{1}+x_{4}\right) * W_{5}^{2} \cdot r+\left(x_{2}+x_{3}\right) * W_{5}^{1} \cdot r$

$D=j * \left[\left(x_{1}-x_{4}\right) * W_{5}^{2} \cdot i-\left(x_{2}-x_{3}\right) * W_{5}^{1} \cdot i\right]$

$\begin{array}{l} X(k)=x_{0}+\left(x_{1}+x_{4}\right)+\left(x_{2}+x_{3}\right)\\ \end{array}$

$\begin{array}{l} X(k+N/5)=x_{0}+\mathrm{A}-\mathrm{B}\\ \end{array}$

$\begin{array}{l} X(k+2N/5)=x_{0}+\mathrm{C}+\mathrm{D}\\ \end{array}$

$\begin{array}{l} X(k+3N/5)=x_{0}+C-D\\ \end{array}$

$\begin{array}{l} X(k+4N/5)=x_{0}+\mathrm{A}+\mathrm{B}\\ \end{array}$

This algorithm reduces a lot of repetitive operations. Meanwhile, when stage>=2, circular expansion and parallelization of Butterfly is also performed to further reduce the calculation overhead. The optimization idea of radix-5 can be extrapolated to Radix -N. For each stage of radix_N, the calculation process is as follows:

Take the input
Times the corresponding conversion factor
Calculate the common items. Radix_N has N minus 1 common items
Perform the parallelized radix_N_kernel
Written to the output

Other optimization

The above two chapters describe the general optimization of DFT_1D, and more detailed optimization can be done on this basis. You can refer to the papers cited in this paper.

For all real input, since the imaginary part of the input is 0, there will be redundant operations and storage in the complex operation of rotation factor and radix_N_kernel. Split R2C algorithm can be used to regard it as a complex sequence with length N/2. Calculate the DFT result and split the result to get the sequence of N long real numbers.
For the power algorithm of Radix 2, recalculating the input/output stride of each stage to cancel the bit reversal of the first stage can further reduce the cost of memory acquisition.
For the Radix -N algorithm, N = radix1^ M1 * radix2^m2 in the mixed basis framework, merging small Radix to large radix to reduce stage.

Principle and optimization of DFT extension algorithm

DCT and FFT_conv are two typical algorithms based on DFT extension. DFT_1D/2D optimization can be well used in this kind of algorithm.

DCT

The DCT algorithm (Discrete Cosine Transform) can be thought of as taking the sinusoidal components of the DFT and industrially correcting them. The calculation formula for DFT_1D is:

$\begin{aligned} X[k] &=C(k) \sum_{n=0}^{N-1} x[n] \cos \left(\frac{(2 n+1) \pi k}{2 N}\right) \\ &C(k)=\sqrt{\frac{1}{n}} \\&k=1 \\ &C(k)=\sqrt{\frac{2}{n}} \\&k! =1 \\ \end{aligned}$

The naive implementation of the algorithm is O(n^2), and by converting it to the DFT_1D algorithm, we can reduce the complexity of the algorithm to O(nlogn). The DCT algorithm flow based on DFT is as follows:

For the input sequence x[n] of DCT, create the input sequence y[n] with length 2N, such that y[n] = x[n] + x[2n-N-1], that is, make a mirror symmetry.
The output sequence y[K] is obtained by DFT operation on the input sequence y[n].
The output X[K] of the original input sequence is calculated from Y[K].

Let’s try to derive this algorithm:

$\begin{array}{l} y[n]=x[n]+x [2 N-1-n] \\ \end{array}$

$\begin{array}{l} Y[k]=\sum_{n=0}^{N-1} x[n]\cdot e^{-j \frac{2 \pi k n}{2 N}} +\sum_{n=N}^{2 N-1} x[2 N-1-n] \cdot e^{-j \frac{2 \pi k n}{2 N}} \end{array}$

$\begin{array}{l} \\=\sum_{n=0}^{N-1} x[n]\cdot e^{-j \frac{2 \pi k n}{2 N}} +\sum_{n=0}^{N-1} x[n] \cdot e^{-j \frac{2 \pi k (2N-1-n)}{2 N}} \end{array}$

$\begin{array}{l} \\=e^{-j \frac{2 \pi k }{2 N}} \cdot \sum_{n=0}^{N-1} x[n] (e^{-j \frac{2\pi}{2 N} kn} \cdot e^{-j \frac{\pi}{2 N}k}+e^{j \frac{2\pi}{2 N} kn} \cdot e^{j \frac{\pi}{2 N}k}) \end{array}$

$\begin{array}{l} \\=e^{-j \frac{2 \pi k }{2 N}} \cdot \sum_{n=0}^{N-1} x[n] \cdot 2\cdot\cos(\frac{2n+1}{2N} k\pi) \end{array}$

$\begin{array}{l} \\=e^{-j \frac{2 \pi k }{2 N}} \cdot C(u) \cdot X[k] \end{array}$

Y [n] is expanded according to DFT formula, the two expanded terms are summarized and the common term e−j2πk2Ne^{-j \frac{2 \ PI k}{2 n}}e−j2N2πk is extracted. According to Euler’s formula and induction function, Finishing non-public item (e – j2 PI 2 NKN ⋅ e – j ej2 PI PI 2 nk + 2 NKN ⋅ ej PI 2 nk) (e ^ {- j \ frac {2 \ PI} {N} 2 kn} \ cdot e ^ {- j \ frac {\ PI}} {N} 2 k + e ^ {j \ frac {2 \ PI} {2 Kn} \ cdot e ^ N} {j \ frac {\ PI}} {N} 2 k), (e – j2N2 PI kn ⋅ e – j2N ej2N2 PI PI, k + kn ⋅ ej2N PI k). You can see that the result is the product of x[k] and the coefficients associated with k. The DCT output of x[n] x[k] x[k] x[k] x[k] can be obtained by first calculating Y[k]Y[k]Y[k] Y[k].

Based on the understanding of the algorithm, our optimization for DFT_1D can be fully applied to DCT. The calculation process of DCT_2D is to do DCT_1D for rows and columns in turn. We parallel DCT_1D with multiple threads, which can further optimize the algorithm.

FFT_conv

Conv is the most common operation in deep learning. The common methods for calculating Conv include IMG2COL+GEMM, Winograd, and FFT_conv. Each of the three algorithms has its own use scenarios.

The mathematics of FFT_conv is that the cyclic convolution in the time domain corresponds to the product of its discrete Fourier transforms. As shown in the figure below, the convolution of F and g is the same as the result of taking the Fourier transform f of f and g respectively, taking the dot product and calculating the result through the inverse Fourier transform.

$f \underset{\text { Circ }}{*} g=\mathcal{F}^{-1}(\mathcal{F}(f) \cdot \mathcal{F}(g))$

Intuitive proof of the theory can be found in the figure below (source).

${F}[f * g]$

$=\int_{-\infty}^{\infty}\left[\left(\int_{-\infty}^{\infty}g(z)f(x-z)dz\right)e^{-i k x}\right]dx$

$=\int_{-\infty}^{\infty} g(z)\left[\int_{-\infty}^{\infty} f(x-z) e^{-i k x} d x\right] d z$

$=\int_{-\infty}^{\infty} g(z)\left[\int_{-\infty}^{\infty} f(y) e^{-i k(y+z)} d y\right] d z$

$=\left[\int_{-\infty}^{\infty} g(z) e^{-i k z} d z\right]\left[\int_{-\infty}^{\infty} f(y) e^{-i k y} d y\right]$

$=\mathcal{F}[f] \cdot \mathcal{F}[g]$

By expanding the convolution formula and the DISCRETE Fourier transform, changing the order of the integrals and replacing the variables, you can prove the conclusion. Notice that the convolution here is circular convolution, as opposed to the linear convolution that we often use in deep learning. Using cyclic convolution, linear convolution of the conditions for circular convolution length L ⩾ | | | | + g f – 1. Therefore, we need to do zero-padding on Feature Map and Kernel, and get effective linear calculation results from the final results.

The process of FFT_conv algorithm:

The Feature Map and Kernel are zero-pad to the same size for DFT conversion.
Dot matrix
Calculate the result by IDFT.

This algorithm converts convolution into dot product. The complexity of the algorithm is O(nlogn), which is smaller than O(n^2) of convolution. When the input size is relatively large, the operation amount can be reduced, and it is suitable for conV algorithm with large kernel.

In deep learning calculation, the size of Kernel is much smaller than that of Feature Map, so the zero-padding in the first step of FFT_conv will have a great overhead. It is mentioned in paper 2 that Feature Map can be divided into blocks. The partitioned Feature Map and Kernel need less padding, which can greatly reduce the overhead of this part. After optimization, the calculation process of FFT_conV is as follows:

The appropriate cache size is calculated and the original image is partitioned
Zero-padding is performed on the partitioned graph and kernel, and DFT operation is performed
Dot product of small graph matrix
Do the reverse and combine to make a larger picture.

At the same time, we can observe that the core computing module of FFT_conv is still DFT operation for small graphs, so we can put the optimization of DFT in the previous chapter into here, supplemented by multithreading, to further improve the calculation efficiency of FFT_conv.

The resources

Chen Dun, Li Zhihao, Jia Haipeng, Zhang Yunquan. Research on realization and optimization of multi-dimensional FFT based on ARMV8 platform
Qinglin Wang,Dongsheng Li. Optimizing FFT-Based Convolutionon ARMv8 Multi-core CPUs
Aleksandar Zlateski, Zhen Jia, Kai Li, Fredo Durand. FFT Convolutions are Faster than Winograd onModern CPUs, Here’s Why

The attached:

GitHub: MegEngine Tianyuan

MegEngine- Deep learning, simple development

Welcome to MegEngine technical Exchange QQ group: 1029741705