Deep learning operator optimization -FFT

Author: YanJianWen | kuang apparent MegEngine architects

background

In the field of digital signal and digital image, the study of frequency domain is an important branch.

The images we “process” in our daily life are at the pixel level, which is called the spatial data of images. Airspace data represents the details of our “readability”. If we regard the same image as a signal and conduct spectral analysis, we can get the frequency domain data of the image. Look at the following set of pictures (source), the bright spot in the frequency domain map is the low-frequency signal, which represents most of the energy of the image, that is, the main information of the image. Dark dots are high frequency signals, representing the edge of the image and noise. As can be seen from the group of images, Degraded Goofy compared with Goofy, the approximate low-frequency signal retains the “contour” of Goofy, while the increase of its high-frequency signal makes the background noise more obvious. Frequency domain analysis allows us to understand the composition of the image, and then do more abstract analysis and detail processing.

Goofy and Degraded Goofy

Fourier transform is the tool to realize spatial and frequency domain conversion of images. Since the image data is spatially Discrete, we use the Discrete form DFT (Discrete Fourier Transform) and its Inverse Transform IDFT (Inverse Discrete Fourier Transform). Cooley-Tuckey developed a faster algorithm FFT (Fast Fourier Transform) on the basis of DFT.

Graph LR Airspace Graph -- DFT --> Frequency Domain Graph -- IDFT --> Airspace Graph

DFT/FFT has some extended applications in the field of digital image. For example, DFT based DCT (Discrete Cosine Transform) uses image compression JPEG algorithm (source) and image watermarking algorithm ([source]()).

JPEG coding is realized by color space conversion, sampling block, DCT transform and quantization coding. Among them, the use of DCT transform can distinguish the low frequency information from the high frequency information, and compress a small amount of low frequency information and a large amount of high frequency information in the process of quantization coding, so as to obtain the size compression. As you can see from the cat’s face image, the image quality deteriorates as the compression ratio increases, but the main body information is still preserved.

Different JPEG quality of cat face (compression ratio)

The image watermarking algorithm converts the original image to the frequency domain through DCT, selects the appropriate position to embed the watermarking image information, and converts back to the original image through IDCT. In this way, the change to the original image is small and undetectable, and the watermark can be extracted by operation.

DFT/FFT also has an extended application in deep learning. For example, FFT can reduce the amount of convolution calculation, and FFT_CONV algorithm has also become a common deep learning convolution algorithm. In this paper, we will explore the principle of frequency domain algorithm and optimization strategy.

Principle and optimization of DFT

The formula

Whether it is multidimensional DFT operation or DCT/ FFT_CONV based on DFT, the underlying computing unit is DFT_1D. Therefore, the optimization of DFT_1D is the basis of the whole FFT class operator optimization. The formula of DFT_1D is as follows:

$$ X_{k}=\sum_{n=0}^{\mathrm{N}-1} x_{n} e^{-j 2 \pi k \frac{n}{N}} \quad k=0, \ldots, N-1 $$

\ \ (x_ {n}) for the length of the input signal for n, \ [e ^ 2 \ {- j k PI \ frac {n} {n}} \) is 1 n times, and \ (x_ {k} \) for the length of the output signal of n. The matrix form of this formula is as follows: $$\left[\begin{array}{c}X(0) \\ X(1) \\ \vdots \\ X(N-1)\end{array}\right]=\left[W_{N}^{n k}\right]\left[\begin{array}{c} \left.x(0\right) \\ x(1) \\ \vdots \\ x(N-1)\end{array}\right]$$

Properties of unit complex roots

In DFT_1D \ (W_ {N} ^ {nk} = e ^ 2 \ {- j k PI \ frac {N} {N}} \) complex root is 1 unit. Intuitively, you divide the complex plane into N pieces, and you sweep the circumference of the complex plane counterclockwise based on k times N.

The unit complex root has periodicity and symmetry. According to these two properties, we can make a lot of simplifications to the W matrix, which forms the basis of the fast algorithm of DFT_1D. Cyclical: $$W_ {N} ^ + N} {k = W_ ^ {N} {k} $$symmetry: $$W_ {N} ^ + N / 2} {k = – W_ ^ {N} {k} $$

Cooley – Tuckey FFT algorithm

Among the many fast algorithms of DFT_1D, Cooley-Tuckey FFT algorithm is the most frequently used. The algorithm uses the idea of divide and conquer, the input size of the sequence of N, according to different basis Radix, decomposition into N/ Radix subsequence, and divide each subsequence, until it can no longer be divided. Each division can obtain a stage, and all the levels are combined from bottom to top to calculate the final output sequence. Here, N = 8, Radix =2 is taken as an example to demonstrate the reasoning process. Where $x(k)$ is the sequence N=8, $x ^{F}(k) $ is the DFT output sequence. Based on DFT calculation formula of $$X ^ {} F (k) = W_ {8} ^ {0} x_ + W_ {0} {8} ^ {k} x_ + W_ {1} {8} ^ 2 k} {x_ + W_ {2} {8} ^ {3} k x_ + W_ {3} {8} ^ {4} k x_ + W_ {4} {8} ^} {5 k x_{5}+W_{8}^{6k} x_{6} +W_{8}^{7k} x_{7}$$

Split into two sequences $G(K) $, $H(K) $ of length 4 according to even and odd terms.

$$X^{F}(k)=W_{8}^{0} x_{0}+W_{8}^{2 k} x_{2}+W_{8}^{4 k} x_{4}+W_{8}^{6 k} x_{6}$$

$$+W_{8}^{k}\left(W_{8}^{0} x_{1}+W_{8}^{2 k} x_{3}+W_{8}^{4 k} x_{5}+W_{8}^{6 k} x_{7}\right)$$

$$=G^{F}(k)+W_{8}^{k} H^{F}(k)$$

$$X^{F}(k+4)=W_{8}^{0} x_{0}+W_{8}^{2(k+4)} x_{2}+W_{8}^{4(k+4)} x_{4}+W_{8}^{6(k+4)} x_{6}$$

$$+W_{8}^{(k+4)}\left(W_{8}^{0} x_{1}+W_{8}^{2(k+4)} x_{3}+W_{8}^{4(k+4)} x_{5}+W_{8}^{6(k+4)} x_{7}\right)$$

$$=G^{F}(k)+W_{8}^{k+4} H^{F}(k)$$

$$=G^{F}(k)-W_{8}^{k} H^{F}(k)$$

\ (G ^ {} F (k) \] and \ [H ^ {} F (k) \] as \ (G (k) \] and \ [H (k) \) DFT results. \ [G ^ {} F (k) \] and \ [H ^ {} F (k) \) multiplied by the corresponding rotation factor \ (W_ {8} ^ {k} \), carries on the simple addition and subtraction can be output \ (X ^ {} F (k) \]. Similarly, to \ (G (k) \] and \ [H (k) \), also do the same iteration \ (A (k) \], \ [B (k) \], \ [C (k) \], \ [D (k) \] is the sequence of N = 2, Combining their DFT results gives $G^{F}(k)$ and $H^{F}(k)$.

$$ \begin{aligned} &G^{F}(k)=A^{F}(k)+W_{4}^{k}B^{F}(k)\\ &G^{F}(k+2)=A^{F}(k)-W_{4}^{k}B^{F}(k)\\ &H^{F}(k)=C^{F}(k)+W_{4}^{k}D^{F}(k)\\ &H^{F}(k+2)=C^{F}(k)-W_{4}^{k}D^{F}(k)\\ \end{aligned} $$

Calculate the sequence of N = 2 \ (A ^ {} F (k) \], \ [B ^ {} F (k) \], \ [C ^ {} F (k) \], \ [D ^ {} F (k) \], because \ \) (k = 0, rotating factor \ (W_ {2} ^ {0} \) = 1. You just add and subtract to get the answer.

$$ \left[\begin{array}{l} A^{F}(0) \\ A^{F}(1) \end{array}\right]=\left[\begin{array}{ll} 1 & 1 \\ 1 & -1 \end{array}\right]\left[\begin{array}{l} x_{0} \\ x_{4} \\ \end{array}\right] $$

$$ \left[\begin{array}{l} B^{F}(0) \\ B^{F}(1) \end{array}\right]=\left[\begin{array}{ll} 1 & 1 \\ 1 & -1 \end{array}\right]\left[\begin{array}{l} x_{2} \\ x_{6} \\ \end{array}\right] $$

$$ \left[\begin{array}{l} C^{F}(0) \\ C^{F}(1) \end{array}\right]=\left[\begin{array}{ll} 1 & 1 \\ 1 & -1 \end{array}\right]\left[\begin{array}{l} x_{1} \\ x_{5} \\ \end{array}\right] $$

$$ \left[\begin{array}{l} D^{F}(0) \\ D^{F}(1) \end{array}\right]=\left[\begin{array}{ll} 1 & 1 \\ 1 & -1 \end{array}\right]\left[\begin{array}{l} x_{3} \\ x_{7} \\ \end{array}\right] $$

Graphic representation of the algorithm, each layer of calculation will produce a number of butterfly, so the algorithm is also known as the butterfly algorithm.

Here we will introduce the basic composition of the Dish Network, which will be helpful for the analysis below.

N=8 Dish algorithm graph

The calculation sequence of N=8 is divided into 3 levels. Each stage has one or more sections, and each block contains one or more Butterflies. The calculation of butterfly shape is the kernel of DFT operation. The calculation order of each stage:

Take the input
Times the conversion factor
For section_num, for butterfly_num, execute radixN_kernel
Write the output.

Look at the butterfly algorithm diagram of N=8. When stage = 1, the operation is divided into four sections, and the butterfly_num of each section is 1. When stage = 2, section_num = 2, butterfly_num = 2 When stage = 3, section_num = 1, butterfly_num = 4. It can be observed that in the process from left to right, section_num continuously decreases, butterfly_num continuously increases, and the butterfly group is “getting bigger and denser”, while the total number of dishes at each level remains the same. In fact, for the algorithm of length N and radix = r, we can deduce:

$Sec$_ $num $=$N / r^{S}$

$Butterfly $_ $num $= $ r^{S-1}$

$Sec $_ $stride $=$r^{S}$

$Butterfly $_ $stride$ =$1$

S is the current stage, SEC /butterfly_stride is the interval of each section/butterfly.

This algorithm can reduce the complexity from O(n^2) to O(nlogn), which is efficient and elegant. Based on the butterfly algorithm, we further divide and optimize the different Radix algorithms, which are mainly divided into two categories: the power of Radix-2 and the power of Radix-non-2.

Power optimization of Radix-2

The kernel of DFT_1D is the $W_{N}^{nk}$ matrix in matrix form. We analyze the kernel of Radix_2 ^ N.

As mentioned in the background, the matrix form of the DFT formula is: $$\left[\begin{array}{c}X(0) \\ X(1) \\ \vdots \\ X(N-1)\end{array}\right]=\left[W_{N}^{n K} \ \ left right] [\ begin {array} {c} \ left (0 \ right) \ \ x x (1) \ \ \ vdots \ \ x (N – 1) {array} \ \ end right] $$with \ [x (0) \] ~$x(n-1)$ is the input after multiplying by the rotation factor $W_{N}^{kn}$

When radix = 2, because the \ (W_ {2} \ ^ 1) = 1, \ (W_ {2} \ ^ 2) = 1, the DFT radix_2 matrix form can be written as:

$$\left[\begin{array}{c}\mathrm{X}_{\mathrm{k}} \\ \mathrm{X}_{\mathrm{k}+\mathrm{N} / 2}\end{array}\right]=\left[\begin{array}{cc}1 & 1 \\ 1 & -1\end{array}\right]\left[\begin{array}{l}\mathrm{W}_{\mathrm{N}}^{0} \mathrm{A}_{\mathrm{k}} \\ \mathrm{W}_{\mathrm{N}}^{\mathrm{k}} \mathrm{B}_{\mathrm{k}}\end{array}\right]$$

When radix = 4, because \ (W_ {4} \ ^ 1) = – j \ (W_ {4} \ ^ 2) = 1, \ (W_ {4} ^ 3 \] = j \ (W_ {4} \ ^ 4) = 1, the DFT radix_4 matrix form can be written as:

$$\left[\begin{array}{c}\mathrm{X}_{\mathrm{k}} \\ \mathrm{X}_{\mathrm{k}+\mathrm{N} / 4} \\ \mathrm{X}_{\mathrm{k}+\mathrm{N} / 2} \\ \mathrm{X}_{\mathrm{k}+3 \mathrm{~N} / 4}\end{array}\right]=\left[\begin{array}{cccc}1 & 1 & 1 & 1 \\ 1 & -\mathrm{j} & -1 & \mathrm{j} \\ 1 & -1 & 1 & -1 \\ 1 & \mathrm{j} & -1 & -\mathrm{j}\end{array}\right]\left[\begin{array}{c}\mathrm{W}_{\mathrm{N}}^{0} \mathrm{A}_{\mathrm{k}} \\ \mathrm{W}_{\mathrm{N}}^{\mathrm{k}} \mathrm{B}_{\mathrm{k}} \\ \mathrm{W}_{\mathrm{N}}^{2 \mathrm{k}} \mathrm{C}_{\mathrm{k}} \\ \mathrm{W}_{\mathrm{N}}^{3 \mathrm{k}} \mathrm{D}_{\mathrm{k}}\end{array}\right]$$

Similarly, the kernel of Radix_8 can be obtained as follows:

$$\left[\begin{array}{cccccccc}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & \mathrm{~W}_{8}^{1} & -j & \mathrm{~W}_{8}^{3} & -1 & -\mathrm{W}_{8}^{1} & j & -\mathrm{W}_{8}^{3} \\ 1 & -j & -1 & j & 1 & -j & -1 & j \\ 1 & \mathrm{~W}_{8}^{3} & j & \mathrm{~W}_{8}^{1} & -1 & -\mathrm{W}_{8}^{3} & -j & -\mathrm{W}_{8}^{1} \\ 1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 \\ 1 & -\mathrm{W}_{8}^{1} & -j & -\mathrm{W}_{8}^{3} & -1 & \mathrm{~W}_{8}^{1} & j & \mathrm{~W}_{8}^{3} \\ 1 & j & -1 & -j & 1 & j & -1 & -j \\ 1 & -\mathrm{W}_{8}^{3} & j & -\mathrm{W}_{8}^{1} & -1 & \mathrm{~W}_{8}^{3} & -j & \mathrm{~W}_{8}^{1}\end{array}\right]$$

Let’s start with access. Modern processors are better at optimizing computing performance than access, and access is usually a performance bottleneck in scenarios where computing and access are similar.

In DFT1D, for algorithms R-2 / R-4 / R-8 with different bases, each stage has the same access amount: 2 butterfly_num radix = 2N, and the number of stages corresponding to different bases varies significantly (\ (\log_2N\) vs $\log_4N$ vs $\log_8N$).

Therefore, for DFT, under the condition of no significant increase in computation, the selection of a larger kernel will achieve obvious advantages in access and storage. Observing the derived kernel diagram, each butterfly shape of R-2 kernel corresponds to 4 fetch operations and 2 complex floating point addition and subtraction operations. In R-4 kernel, each butterfly algorithm corresponds to 8 load/store and 8 complex floating-point addition and subtraction operations (merging the same operations). With a slight increase in the amount of computation, the stage drops from $\log_2N$ to $\log_4N$, which reduces the total fetch times. So there’s a performance boost. The R-8 kernel corresponds to 16 loads/stores, 24 complex floating-point additions, and 8 floating-point multiplications per butterfly. The existence of floating point multiplication increases the calculation cost, and the stage is further reduced from $\log_4N$ to $\log_8N$. However, as N is not too large in daily life, the reduction of stage from R-4 to R-8 is not obvious, so the optimization is limited

Let’s look at the cost of computation. There are usually two ways to reduce the overhead of computation: reduce redundant computation, and parallelize.

Taking R-4 algorithm as an example, the calculation of kernel part is as follows:

radix_4_first_stage(src, dst, sec_num, butterfly_num)
radix_4_other_stage(src, dst, sec_num, butterfly_num)
- for Sec_num
  - for butterfly_num
    - raidx_4_kernel

Radix4_First_Stage data because k=0, rotation factor is 1, can eliminate this part of complex multiplication operation, separate optimization. The radix4_other_stage part, from the second stage forward, butterfly_num = 4^(s-1) are all multiples of 4, and each Butterfly array reads/stores separately. The loop expansion and vectorization of the innermost loop can be done to realize 4 or more Butterfly parallel operations. The use of loop unwinding and SIMD instruction can not only improve the parallelism, but also improve the efficiency of the cacheline utilization, which can lead to a significant performance improvement. Taking SM8150(ARMV8) as an example, R-4 parallel optimization can achieve R2’s performance of 1.6x.

Size: 1.2048 (R2C) Environment: SM8150 Large Core *

In short, for the optimization of Radix-2 ^n, the selection of appropriate Radix to reduce the memory access cost brought by multiple stages, and the use of unit complex root properties and parallelization to reduce the cost of computing, can bring greater performance improvement.

Power optimization of radix-non-2

When the input length N = radix1^m1 * radix2^m2… If you use naive O(n^2) algorithm, the performance will degrade dramatically when Radi x is not a power of 2. Common solutions to the original long complement 0, the use of RADIX_N algorithm, special RADIX_N algorithm (Chirp-Z Transform). For large input, the power method of 0 to 2 adds a lot of computation and storage capacity, while Chirp-Z Transform uses convolution to calculate DFT, and the algorithm is too complicated. Therefore, the optimization of non-2 power radix-n is also necessary.

The radix-n calculation process is the same as the radix-2 power. We can also use the periodicity and symmetry of the unit complex root to simplify the kernel calculation. Taking Radix-5 as an example, the DFT_Kernel of Radix-5 is:

$$ \left[\begin{array}{cccc} 1&1&1&1&1\\ 1 &\mathrm{W}_{\mathrm{5}}^{1} & \mathrm{W}_{\mathrm{5}}^{2} & \mathrm{W}_{\mathrm{5}}^{-2} & \mathrm{W}_{\mathrm{5}}^{-1} \\ 1 &\mathrm{W}_{\mathrm{5}}^{2} & \mathrm{W}_{\mathrm{5}}^{-1} & \mathrm{W}_{\mathrm{5}}^{1} & \mathrm{W}_{\mathrm{5}}^{-2} \\ 1 &\mathrm{W}_{\mathrm{5}}^{-2} & \mathrm{W}_{\mathrm{5}}^{1} & \mathrm{W}_{\mathrm{5}}^{-1} & \mathrm{W}_{\mathrm{5}}^{2} \\ 1 &\mathrm{W}_{\mathrm{5}}^{-1} & \mathrm{W}_{\mathrm{5}}^{-2} & \mathrm{W}_{\mathrm{5}}^{2} & \mathrm{W}_{\mathrm{5}}^{1} \\ \end{array}\right] $$

\ \ (W_5 ^ k), and \ (W_ {5} ^ {} – k \) under axisymmetric x in the complex plane, have the same real part and imaginary part. By this property. As shown in the figure below, for each stage, common items A, B, C and D can be combined, and then the output of the stage can be calculated according to the common items.

$$ \begin{array}{l} A=\left(x_{1}+x_{4}\right) * W_{5}^{1} \cdot r+\left(x_{2}+x_{3}\right) * W_{5}^{2} \cdot r\\ B=(-j) * \left[\left(x_{1}-x_{4}\right) * W_{5}^{1} \cdot i+\left(x_{2}-x_{3}\right) * W_{5}^{2} \cdot i\right] \\ C=\left(x_{1}+x_{4}\right) * W_{5}^{2} \cdot r+\left(x_{2}+x_{3}\right) * W_{5}^{1} \cdot r\\ D=j * \left[\left(x_{1}-x_{4}\right) * W_{5}^{2} \cdot i-\left(x_{2}-x_{3}\right) * W_{5}^{1} \cdot i\right] \\ X(k)=x_{0}+\left(x_{1}+x_{4}\right)+\left(x_{2}+x_{3}\right)\\ X(k+N/5)=x_{0}+\mathrm{A}-\mathrm{B}\\ X(k+2N/5)=x_{0}+\mathrm{C}+\mathrm{D}\\ X(k+3N/5)=x_{0}+C-D\\ X(k+4N/5)=x_{0}+\mathrm{A}+\mathrm{B}\\ \end{array} $$

This algorithm reduces a lot of repetitive operations. At the same time, when stage>=2, Butterfly is also cyclically expanded and parallelized to further reduce the cost of calculation. The optimization idea of Radix-5 can be extrapolated to Radix-N. For each stage of RADIX_N, the calculation flow is as follows:

Take the input
Times the corresponding conversion factor
Calculate common items, radix_N has n-1 common items
Executes the parallelized radix_N_kernel
Written to the output

Other optimization

The above two chapters describe the general optimization of DFT_1D, and more detailed optimization can be done on this basis. You can refer to the paper quoted in this paper.

For the input of all real numbers, since the imaginary part of the input is 0, there will be redundant operation and storage when the complex number operation of rotation factor and Radix_N_Kernel is carried out. Split R2C algorithm can be used as a complex number sequence of length N/2. The DFT results are calculated and split operation is performed to obtain the results of n-long real sequences.
For the power algorithm of Radix-2, the input/output stride of each stage is recalculated to cancel the bit flip of the first stage, which can further reduce the overhead of access.
For the Radix-N algorithm, N = Radix1 ^m1 * Radix2 ^m2 under the mixed base framework, the smaller Radix is merged into the larger Radix to reduce the stage.

Principle and optimization of DFT extension algorithm

DCT and FFT_CONV are two typical algorithms based on DFT extension. The optimization of DFT_1D/2D can be well used in this kind of algorithm.

DCT

The DCT algorithm (Discrete Cosine Transform) can be thought of as the algorithm where the DFT takes its sinusoidal components and industrially corrects them. The calculation formula of DFT_1D is:

$$ \begin{aligned} X[k] &=C(k) \sum_{n=0}^{N-1} x[n] \cos \left(\frac{(2 n+1) \pi k}{2 N}\right) \\ &C(k)=\sqrt{\frac{1}{n}} \\&k=1 \\ &C(k)=\sqrt{\frac{2}{n}} \\&k! =1 \\ \end{aligned} $$

The algorithm naive implementation is O(n^2), and we can reduce the algorithm complexity to O(nlogn) by converting it into DFT_1D algorithm. The DCT algorithm flow based on DFT is as follows:

For the input sequence x[n] of DCT, create an input sequence y[n] of length 2N that is y[n] = x[n] + x[2n-n-1], that is, make a mirror image symmetry.
DFT operation is performed on the input sequence Y[n] to obtain the output sequence Y[K].
The output X[K] of the original input sequence is calculated from Y[K].

Let’s try to derive this algorithm:

$$ \begin{array}{l} y[n]=x[n]+x [2 N-1-n] \\ Y[k]=\sum_{n=0}^{N-1} x[n]\cdot e^{-j \frac{2 \pi k n}{2 N}} +\sum_{n=N}^{2 N-1} x[2 N-1-n] \cdot e^{-j \frac{2 \pi k n}{2 N}} \\=\sum_{n=0}^{N-1} x[n]\cdot e^{-j \frac{2 \pi k n}{2 N}} +\sum_{n=0}^{N-1} x[n] \cdot e^{-j \frac{2 \pi k (2N-1-n)}{2 N}} \\=e^{-j \frac{2 \pi k }{2 N}} \cdot \sum_{n=0}^{N-1} x[n] (e^{-j \frac{2\pi}{2 N} kn} \cdot e^{-j \frac{\pi}{2 N}k}+e^{j \frac{2\pi}{2 N} kn} \cdot e^{j \frac{\pi}{2 N}k}) \\=e^{-j \frac{2 \pi k }{2 N}} \cdot \sum_{n=0}^{N-1} x[n] \cdot 2\cdot\cos(\frac{2n+1}{2N} k\pi) \\=e^{-j \frac{2 \pi k }{2 N}} \cdot C(u) \cdot X[k] \end{array} $$

Expand Y [n] according to DFT formula, sort out the two expanded items and extract the common item $e^{-j \frac{2 \ PI k}{2 n}}$, according to Euler’s formula and induction function, Finishing the public item \ [(e ^ {- j \ frac {2 \ PI} {N} 2 kn} \ cdot e ^ {- j \ frac {\ PI}} {N} 2 k + e ^ {j \ frac {2 \ PI} {N} 2 kn} \ cdot e ^ {j \ frac {\ PI} {2 N}}) k \). You can see that the result is the product of x[k] and the coefficients associated with k. This gives you the DCT output of x[n] $x[k]$ by first calculating $Y[k]$.

Based on the understanding of the algorithm, our optimization of DFT_1D can be completely applied to DCT. The calculation process of DCT_2D is to do DCT_1D for row and column in turn. We use multi-thread to parallel DCT_1D, which can further optimize the algorithm.

FFT_conv

CONV is the most common operation in deep learning, and the commonly used methods for calculating CONV include IMG2COL+GEMM, WINOGRAD and FFT_CONV. Each of the three algorithms has its own usage scenarios.

The mathematical principle of FFT_CONV is that the circular convolution in the time domain corresponds to the product of its discrete Fourier transforms. As shown in the figure below, the convolution of F and G is equal to the result after taking F and G as Fourier transform respectively, taking dot product and computing through inverse Fourier transform.

$$ f \underset{\text { Circ }}{*} g=\mathcal{F}^{-1}(\mathcal{F}(f) \cdot \mathcal{F}(g)) $$

Intuitive theoretical proof can be seen below (source).

$$ \begin{aligned} \mathcal{F}[f * g] &=\int_{-\infty}^{\infty}\left[\left(\int_{-\infty}^{\infty}g(z)f(x-z)dz\right)e^{-i k x}\right]dx\\ &=\int_{-\infty}^{\infty} g(z)\left[\int_{-\infty}^{\infty} f(x-z) e^{-i k x} d x\right] d z \\ &=\int_{-\infty}^{\infty} g(z)\left[\int_{-\infty}^{\infty} f(y) e^{-i k(y+z)} d y\right] d z \\ &=\left[\int_{-\infty}^{\infty} g(z) e^{-i k z} d z\right]\left[\int_{-\infty}^{\infty} f(y) e^{-i k y} d y\right] \\ &=\mathcal{F}[f] \cdot \mathcal{F}[g] \end{aligned} $$

The result can be proved by expanding the convolution formula and the discrete Fourier transform, changing the order of the integrals and replacing the variables. Notice that the convolution here is circular convolution, as opposed to the linear convolution that we’re used to in deep learning. Using cyclic convolution, linear convolution of the conditions for circular convolution length L ⩾ | | | | + g f – 1. Therefore, zero-padding should be made on the Feature Map and Kernel, and effective linear calculation results should be obtained from the final results.

FFT_CONV algorithm flow:

Both the Feature Map and Kernel are zero-pad to the same size to carry out DFT conversion.
Dot matrix
Calculate the result by IDFT.

This algorithm converts convolution into dot product, and the algorithm complexity is O(nlogn), which is less than O(n^2) of convolution. When the size of input is relatively large, the operation amount can be reduced, and it is suitable for conv algorithm with large kernel.

In deep learning calculation, the size of Kernel is much smaller than that of Feature Map, so the zero-padding of FFT_CONV in the first step will incur a lot of overhead. As mentioned in paper 2, Feature Map can be divided into blocks. The Feature Map and Kernel after partitioning require a smaller padding size, which can greatly reduce the overhead of this part. The optimized calculation flow of FFT_CONV is as follows:

Reasonable arrangement of cache to calculate the appropriate tile size, the original image is divided into blocks
After partitioning, the small image is zero-padding with kernel, and DFT operation is performed
Small graph matrix dot product
Do the inverse and combine it into a big picture.

At the same time, we can observe that the core calculation module of FFT_CONV is still for the DFT operation of small graphs, so we can put the optimization of DFT in the previous chapter into this one, with the help of multi-threading, to further improve the calculation efficiency of FFT_CONV.

The resources

Chen Tun, Li Zhihao, Jia Haipeng, Zhang Yunquan. Research on Implementation and Optimization of Multi-dimensional FFT Based on ARMV8 Platform
Qinglin Wang,Dongsheng Li. Optimizing FFT-Based Convolutionon ARMv8 Multi-core CPUs
Aleksandar Zlateski, Zhen Jia, Kai Li, Fredo Durand. FFT Convolutions are Faster than Winograd onModern CPUs, Here’s Why

The attached:

MegEngine- Deep Learning, Easy Development