Faster neural network directly from JPEG

Abstract

Convolutional Neural network (CNN) ‘s simple, elegant training methods directly from RGB pixels have achieved overwhelming empirical success. But can the performance representation of the network be improved by using different inputs? In this paper, we propose and explore a simple idea: train CNN to compute the block discrete cosine transform (DCT) coefficients directly and in the middle of a JPEG codec. Intuitively, when processing JPEG images using CNN, there seems to be no need to unzip the block’s frequency to transfer it from the CPU to the GPU, and then process it with CNN, it will learn something like a transform back to the first layer’s frequency representation. Why not skip those two steps and input the frequency domain directly into the network? In this article, we modify libJPEG to generate DCT coefficients directly, modify the ResNET-50 network to accommodate inputs of different sizes and spans, and evaluate the performance of ImageNet. We found faster and more accurate networks, and the ResNET-50 was about as accurate but 1.77 times faster.

1 Introduction

The amazing progress of neural network training, especially convolutional neural networks [14], which achieve good performance on various tasks [13,19,20,10] has led to the widespread adoption of this mode in academia and industry. When CNN uses image data as training input, the data is usually provided in the form of red-green-blue (RGB) pixel array. Convolutional layers start to compute features from pixels, and early layers usually learn Gabor filters and later layers learn higher-level, more abstract features [13,27].

In this paper, we propose and explore a simple idea for accelerating neural network training and reasoning in common scenarios where the network is applied to image encoding in JPEG format. In this case, the image is typically decoded from a compressed format into an array of RGB pixels and then fed into a neural network. Here, we propose and explore a more direct approach. First, we modify the libjpeg library to only partially decode JPEG images, resulting in an image representation that consists of three tensors containing discrete cosine transform (DCT) coefficients. Due to the way JPEG codecs work, these tensors have different spatial resolutions. We then design and train a network to operate directly from this representation; As one might suspect, this turned out to be quite effective.

Related Work

When training and/or inference speed is critical, much effort is focused on accelerating computation by reducing the number of parameters or computing more efficient operations using graphics processing units (gpus) [12,3,9]. Some studies use spatial frequency decomposition and other compressed representations for image processing instead of deep learning [22,18,8,5,7]. Other studies have combined deep learning with compressed representations other than JPEG and achieved good results [24,1]. Our closest parallels are from [6] and [25]. [6] Sequences compress DCT coefficients not through a JPEG encoder, but through a simpler truncation method. [25] Train a similar input representation, but without the full early JPEG stack, specifically excluding the Cb/Cr downsampling step. So our workstations build on the shoulders of many previous studies, extending them to complete early JPEG stacks, into deeper networks, and training on larger data sets and more difficult tasks. We carefully timed the relevant surgeries and performed the necessary ablative studies to understand where the performance improvements were coming from.

Figure 1:(a) Three steps of JPEG image coding: first convert RGB image to YCbCr color space, downsample chroma channel, then project and quantize channel through DCT, and finally compress the quantized coefficient losslessly. See Section 2 for details. (b) JPEG decoding follows the reverse process. In this paper, we only carry out the first step of decoding, and then input the DCT coefficients directly into the neural network. This saves time in three ways: the final step of normal JPEG decoding is skipped, the data transferred from the CPU to the GPU is twice as small, and the image is already in the frequency domain. To some extent, the early layers of the neural network have learned the conversion to the frequency domain, which allows the neural network to use fewer layers.

The rest of this article makes the following contributions. We review the JPEG codec in more detail to give an intuitive understanding of the steps in the process that have characteristics suitable for neural network training (Section 2). Because Y and Cb/Cr DCT blocks have different resolutions, we considered different architectures inspired by ResNET-50 [10], by which information from these different channels can be combined, each with different speed and performance considerations (seconds). 3 seconds. 5). It turns out that some combinations produce faster network performance at the same baseline RGB model or better performance at a more moderate speed (Figure 5). In DCT to discover cyberspace faster and more accurately, we asked if it was possible to simply find nearby ResNet architectures that run in RGB space and show the same performance or speed gains. We found that a simple mutation to RESNET-50 did not produce a competitive network (section 4). Finally, given the superior performance of the DCT representation, we performed an ablation study to examine whether this was due to a different color space or a specific first layer filter. We find that accurate DCT transformations work very well, even better than trying to learn the same dimensional transformations (section 4.3, Section 5.3)! So others can repeat the experiment and benefit from the speed increase found in this article, which we found at github.com/uber-resear… We released our code.

2 JPEG Compression

2.1 The JPEG Encoder

The JPEG standard (ISO/IEC 10918) was created in 1992 as a result of efforts that began in 1986 [11]. The JPEG standard supports 8-bit grayscale images and 24-bit color images, and despite being more than 30 years old, it remains the dominant image representation in consumer electronics and on the Internet. In this article, we consider only the 24-bit color version, which starts with 8-bit encoded RGB pixels per color channel.

3 Designing CNN models for DCT input

Figure 2:(a) 64 orthonormal DCT basis vectors used to decompose single channel 8×8 pixel blocks in the JPEG standard [26]. (b) 64 first-layer convolutional filters of size 7×7 learned from the baseline ResNET-50 network running on RGB pixels [10]. (c) The 64-size convolution filter 8×8 learns from random weights through the DCT-Learn network described in Section 4.3. (d) 64 convolution filters from the DCT-Ortho network, similar to (c) but with the addition of orthonormal regularization.

In this section, we will describe the transformation that facilitates the adoption of DCT coefficients through traditional CNN architecture (such as RESNET-50 [10]). Some careful design is required, such as the DCT coefficients from the Y channel, DY, generally have larger sizes than from the chroma channel, DCb and DCr, as shown in Figure 1A, where the actual shape is calculated from the image input size 224 × 224. Therefore, special transformations are necessary to handle spatial dimension matching before the resulting activation is connected and entered into the traditional CNN. We consider two abstract transformations (T1, T2) acting on different coefficient channels, respectively, in order to obtain a spatial size match between three activations, aY = T1(DY), aCb = T2(DCb), and aCr = T2(DCr). Figure 3 illustrates this process.

In addition to ensuring the alignment of the size of the convolutional feature graph, it is important to consider the size and step of the receiving domain of each cell in the entire network at the end of the transformation (represented by R and S below). For a typical network with RGB input, the acceptance domain and step length of each cell are the same on each input channel (red, green and blue). Here, for information, the acceptance domain considered in the original pixel space may be different through Y channel rather than Cb and Cr channel, which may not be what we want. We examine the resulting size expressed by DCT operation, when with the same set of parameters resNET-50 in different blocks (table below), we find that the spatial dimension of DY matches the active dimension of block 3, while the spatial dimension and DCb match the active dimension of block 4. This inspired us to skip some ResNet block architectures when designing the network, but skipping without further modifications resulted in a less functional network (fewer layers and fewer parameters) and a final network layer with a smaller acceptance domain.

Transformations (T1, T2) are universal, allowing us to make the DCT coefficients compatible. In determining the transformation, we considered the following design concepts. Transformations can be (1) nonparametric and/or manually designed, such as up or down sampling of the original DCT coefficients, (2) learned, and can be simply represented as convolution layers, or (3) a combination of layers, such as the ResNet block itself. Starting with the simplest, we explore seven different transformation methods (T1, T2) for up-sampling to deconvolution, combined with different options for the subsequent ResNet block stack. We will describe each in detail in Section S1 of supplementary information:

  • UpSampling. Chromaticity DCT coefficients DCb and DCr are upsampled by copying the height and width of the pixel onto the dimension of DY. The three nodes are then cascaded through channels and through a batch normalization layer before entering ResNet ConvBlock 3 (CB3), but in steps of 1, followed by standard CB4 and CB5.

  • UpSampling – RFA. This setup is similar to UpSampling, but here we retain ResNet CB2(instead of removing it) and CB2 and CB3 so that they mimic the increase in R and S observed in the original ResNET-50 block; We call this the “sensory receptive field” or RFA. As figure 4 shows, without this modification, the jump from input to the first block in R is large, and the R behind the network will never be as large as it was in baseline ResNet (green line). Instead, the transition to large R is more gradual by keeping CB2 but lowering its stride, and when CB3 is reached, R and S match the baseline ResNet through the remaining layers. This architecture is shown in Figures 3B and S1.

  • Deconvolution – RFA. An alternative to upsampling is a learnable deconvolution layer. In this design, we use two independent deconvolution layers on DCb and DCr to increase the spatial dimensions. The rest of the design is the same as upSAMpling-RFA.

Figure 3:(a) the first layer of the original resnet-50 architecture [10]. (b) The upSAMpling-RFA architecture is explained by the coefficients DY, DCb and DCr of 28 × 28 × 64 and 14 × 14 × 64. The abbreviations NT and U stand for No Transform and Upsampling operations, respectively. (c) Late-concat architecture describes the luminance coefficient DY through ResNet Block 3 and the chromaticity coefficient through single convolution. This results in additional total calculations along the luminance path compared to chroma paths, and tends to work well.

  • Another method of down-sampling is to down-sample the larger DY with the convolution layer as opposed to up-sampling the smaller coefficients in space. The rest of the design is similar to UpSampling, but with some changes to handle smaller input space sizes. As we will see in Section 5, this network operating on smaller total inputs results in faster processing at the cost of higher errors.

  • Newest – Concat. In this design, we run DY separately with two convblocks (CBs) and three IdentityBlocks (IBs) of ResNET-50. DCb and DCr pass through a CB in parallel before connecting to the DY path. Then, after CB4, the representation of the connection is fed into the standard ResNet stack. This architecture is shown in Figures 3C and S1. The effect is extra total computation along the luminance path than along the chroma path, and the result is a fast network with good performance.

  • Newest – Concat – RFA. The late-concat version of this aware receive domain passes DY through three Kernel-sized CBs and is tuned so that the increase in R mimicked the R in the original RESNET-50. Parallel DCb and DCr take the same path as Late Concat before connecting to the result of DY path. The comparison of average receptive field is shown in Figure 4. It can be seen that the increase of receptive field of Late- concat-RFA is smoother than that of Late- Concat-RFA. As shown in Figure S1, we used more channels in the earlier blocks because the space size was smaller than standard ResNet.

  • Newest – Concat – RFA – Thinner. This architecture is the same as Late- concat-RFA, but the number of channels is modified. The number of channels decreased in the first two CBs of the DY path and increased in the third CBs, changing the channel count from {1024,512,512} to {384,384, and 768}. DCb and DCr components are fed through 256 channel CB instead of 512 channel. All other parts of the network are identical to Late- concat-RFA. These changes were made to preserve the performance of the late-concat-RFA model while gaining some of the speed advantage of late-concat. As figure 5 shows, it strikes an attractive balance.

Figure 4: Average accepted domain size within each ResNet block with corresponding block step. The scale on both axes is logarithmic. Some measurements based on the DCT architecture are reported and compared with the perceived wild length observed in RESNET-50. These diagrams highlight how the receptive domain aware (RFA) version of the DCT-BASED infrastructure allows transformations similar to those observed in the baseline network.

4 RGB Network Controls

As we will observe in Section 5 and Figure 5, many networks with DCT input have lower and/or higher error speeds than the benchmark ResNET-50 RGB. In this section, we’ll examine whether this is simply due to a lot of tweaks to the architecture, some of which happen to work better than the benchmark ResNet. Here, we start with a baseline ResNet and try to change the architecture slightly to reduce errors and/or speed execution. The input is an RGB image of size 224 × 224 × 3.

2. There are two different classes of us

First, we tested the simple idea of removing the convolution layer in ResNET-50. We remove the Identity block from block 2 and block 3 one by one, and when 6 layers are removed, 6 experiments are generated. We never remove the convolution layer between block 2 and block 3 to keep the number of channels and the representation size constant in each block.

In this series of experiments, the first identity layer (ID) in Block 2 is removed. Second, the first and second ID layers are removed. The experiment continues until all three ID layers of blocks 2 and 3 are removed. In the final configuration, the network and upsampling architecture, where THE RGB signal is converted to a representation size of 28 × 28 × 512 through a small amount of convolution. RGB input goes through a series of layers: convolution, maximum pooling, and the last identity layer of Block 3. We can see the trade-off between reasoning speed and accuracy in the legend “Baseline, Remove ID Blocks” (6 gray squares) in Figure 5. As you can see, the network will be slightly faster, but the accuracy will drop dramatically.

2. She reduced the Number of Channels

Since reducing the number of layers is not effective, we also study thinning networks: reducing the number of channels at each layer speeds up inference. The final full connection layer is modified to fit the size of its input layer while keeping the number of outputs constant. We propose to reduce the number of channels by dividing the original number of channels by a fixed ratio. We conducted three experiments with {1.1, √2,2}. The tradeoff between speed or GFLOPS versus accuracy is shown under the legend “Reduced Channel #” in Figure 5. As the number of layers is reduced, the network is slightly faster, but the accuracy is greatly reduced. Perhaps both of these results were suspected, as the authors of ResNET-50 may have adjusted the depth and width of the network nicely; However, it is important to verify that the observed performance improvements cannot be achieved through this much simpler approach.

4.3 Learning the DCT Transform

The final set of experiments — the four “YCbCr pixels, DCT layers” diamonds shown in Figure 5 — illustrates whether we can obtain similar benefits to the DCT architecture, but starting with RGB pixels, using the convolution layer to replicate the DCT transform accurately or approximately. RGB images are first converted into YCbCr Spaces, and then each channel is fed independently through the convolution layer. To simulate DCT, a convolution filter with a size of 8×8 and a step of 8 64 output channels (or in some cases: more) is used. The resulting activation is wired before being entered into ResNet Block 2. In DCT-Learn, we randomly initialize filters and train them in a standard way. In DCT-Ortho, we normalize the convolution weights to orthonormal, as described in [2], to encourage them not to discard information, inspired by the orthonormal of DCT transformations. In DCT-Frozen, we only used the exact DCT coefficients without training, and in DCT-Frozenx2 we changed the step to 4 instead of 8 to increase the presentation size of the layer and allow the filters to overlap. Surprisingly, despite the lack of acceleration from the deconvolution RFA method, the performance of the network was comparable to the other best methods (6.98%) when it was run more than three times on average. This is interesting because it goes against current popular network design principles: layer 1 filters are large rather than small, are hard-coded rather than learned, run on YCbCr space rather than RGB, and deeply process channels (individually) rather than together. Future work could assess the extent to which we should adopt these atypical choices as standard practice.