Computer visionists, always trying to “see” every pixel in the busy world, have found that they can’t get around the resolution limit. Today I’m going to talk to you about resolution and super-resolution reconstruction.


Resolution limit


Resolution limit, whether for image reconstruction or image post-processing algorithm researchers, is an unavoidable technical index. The performance of time resolution determines the frame rate of video output, that is, the real-time effect. Spatial resolution determines whether the image is 720P, 1080P, or 4K; The level resolution performance determines the fullness and granularity of the colors displayed in the image. Therefore, resolution is the core of an image or a video.


I’m going to talk a little bit about spatial resolution today.

In practical application scenarios, due to the cost of image acquisition equipment, the bandwidth of video image transmission, or the technical bottleneck of imaging mode itself, we are not always able to obtain edge sharpening, block-free fuzzy large-size HD images. In this context, super-resolution reconstruction technology emerges.



Figure 1


Application Scenario I: Image compression and transmission, that is, images are encoded at a low bit rate. During transmission, the traffic bandwidth of the forwarding server is greatly reduced, and relatively low-definition images are decoded on the client. Finally, high-resolution images are obtained through super-resolution reconstruction



Figure 2


Application Scenario II: Biological tissue imaging Left: photoacoustic microscopic image Right: photoacoustic super-resolution microscopic image, with fine bee wing texture clearly visible [5]


Traditional super-resolution reconstruction techniques can be broadly divided into four categories [1, 2], They are prediction-based, edge-based, statistical and patch-based/example-based super-resolution reconstruction methods.

At present, most people use image block type. In the field of image block type, we select four classical papers on image block type super-resolution reconstruction based on deep learning to analyze the key technical points. From the paper, we can see the researchers’ different understanding and problem-solving ideas for superresolution tasks.


In 2012, AlexNet won the annual champion of ImageNet Large-scale visual recognition Challenge with a historically low classification error rate of 15.4%, blowing the horn after the explosive development of deep learning in the field of computer vision. Superresolution reconstruction technology also uses deep learning to obtain better algorithm performance.

Article 1: Image super-resolution Using Deep Convolutional Networks

Chao Dong, Chen Change Loy, Kaiming He and Xiaoou Tang

SRCNN is a pioneering work in the field of super-resolution reconstruction based on deep learning. It inherits the idea of sparse coding in the field of traditional machine learning and uses three convolution layers to achieve the following results: 1. Image block extraction and sparse dictionary establishment; 2. Nonlinear mapping between high and low resolution features of the image; 3. Reconstruction of high resolution image blocks.


Specifically, it is assumed that the size of the low-resolution image to be processed is H × W × C, where H, W and C represent the length, width and channel number of the image respectively. The size of the convolution kernel at the first layer of SRCNN is C × F1 × F1 × N1, which can be understood as a sliding window extraction of F1 × F1 image block region on low-resolution images for n1 types of convolution operations. Within the scope of the whole image, each type of convolution operation can output a feature vector. Finally, n1 feature vectors constitute the sparse representation dictionary of low-resolution images, and the dimension of the dictionary is H1 × W1 × N1. The size of the second layer of SRCNN convolution kernel is N1 × 1 × 1 × n2, so as to establish the nonlinear mapping between low resolution and high resolution sparse representation dictionaries. The dimension of the output high resolution sparse dictionary is H1 × W1 × n2. It is worth noting that SRCNN does not adopt fully connected layer to carry out the mapping between feature maps or sparse dictionaries in this step, but adopts 1×1 convolution kernel, so that the mapping of each pixel position in space shares parameters. That is, each spatial position is mapped in the same way in a nonlinear way; The size of the third SRCNN convolution kernel is N2 × F3 × F3 × C, and F3 × F3 image blocks are reconstructed by n2 × 1 vector of each pixel position in the high-resolution sparse dictionary. The image blocks overlap and cover each other, and finally achieve super-resolution reconstruction of the image.



FIG. 3 Three-layer convolutional structure of SRCNN [1]


Article 2: Real-time Single Image and Video super-resolution Using an Efficient Algorithm

Sub-Pixel Convolutional Neural Network

Authors: Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, Zehan Wang

After SRCNN introduced CNN into the field of super-resolution reconstruction, researchers began to consider how to use “convolution” to solve deeper problems. If gaussian smoothing or de-sampling on a high-resolution image can be equivalent to convolution, then the process of restoring a high-resolution image from a de-sampled low-resolution image is equivalent to deconvolution. At this point, our computational task is to learn an appropriate deconvolution kernel to recover high-resolution images from low-resolution images. The standard practice of deconvolution layer in CNN is shown in FIG. 4. Fill a low-resolution image with zero padding, that is, with the position of each pixel as the center, fill 0 in the surrounding 2×2 or 3×3 neighborhood, and then conduct convolution operation with a convolution kernel of a certain size.



FIG. 4 Schematic diagram of standard deconvolution layer implementation


However, the disadvantages of the standard deconvolution operation are obvious. First, the filled zero value does not contain any effective information related to the image. Second, the computational complexity of the filled image convolution operation increases. In this context, The Twitter Picture and Video Compression research group introduced the concept of sub-Pixel convolution into SRCNN.



Fig.5 Efficient sub-pixel Convolutional Neural Network (ESPCN) Network structure [2]

The core idea of sub-Pixel is that for an image with any dimension of H × W × C, the dimension of the feature graph output by the standard deconvolution operation is rH × rW × C, where R is the super-resolution coefficient, namely the multiple of image size enlargement, while the dimension of sub-pixel’s output feature graph is H × W × C × R2. Even if the size of the feature graph is kept consistent with that of the input image, increasing the number of channels of the convolution kernel not only makes effective use of the information of neighboring pixels in the input image, but also avoids the increase of computational complexity introduced by filling 0.

Article 3: Perceptual Losses for real-time Style Transfer and super-resolution

By Justin Johnson, Alexandre Alahi and Li Fei-Fei


Compared with other machine learning tasks, such as Object Detection or instance segmentation, the definition of loss function of learning task in super-resolution reconstruction technology is usually relatively simple and crude. Since the purpose of our reconstruction is to make the Peak signal-to-noise Ratio (PSNR) between reconstructed high-resolution images and real hd images as large as possible, Therefore, the vast majority of super-resolution reconstruction studies based on deep learning directly design the Loss function as Mean Square Error (MSE), that is, calculate the Mean Square Error between all corresponding pixel position points of two images. Since MSE Loss requires one-to-one correspondence between pixel positions, Hence it is also called per-Pixel Loss.


However, with the development of technology, researchers gradually find the limitations of per-Pixel Loss. Consider an extreme case where the original HD image is offset by one Pixel in any direction. In fact, the resolution and style of the image itself have not changed much, but per-Pixel Loss will increase significantly because of the offset of this one Pixel. Therefore, the constraint of per-Pixel Loss cannot reflect the high-level feature information of the image. Therefore, researchers studying image style transfer proposed the concept of Perceptual Loss at the 2016 CVPR conference compared to Per-Pixel Loss.



FIG. 6 Structure of full-convolutional network based on Perceptual Loss [3]


The super-resolution reconstruction network based on Per-Pixel Loss aims to directly minimize the difference between the ORIGINAL HD image and the super-resolution reconstruction image, so that the super-resolution reconstruction image gradually approaches the clear effect of the original image. However, the minimizing of Perceptual Loss is the difference between the feature images of the original and reconstructed images. To improve computational efficiency, the feature images of Perceptual Loss are extracted by the convolutional neural network with fixed weight values, such as the VGG16 network pre-trained on ImageNet data set, as shown in Figure 7. The feature information extracted by convolution layer with different depth is different, and the image texture is also different.



FIG. 7 Schematic diagram of image features extracted from convolution layers at different depths [3]


So when researchers train super-resolution neural networks, Fully Convolutional Network (FCN) is constructed by using strided convolution layer instead of pooling layer for superresolution reconstruction. Residual block was added between the convolutional layers to deepen the network depth and achieve better performance on the premise of ensuring the network fitting performance. Finally, VGG16 network is used to extract features from the original image and the reconstructed image. The differences between the two feature images are minimized so that the super-resolution reconstructed image constantly approaches the resolution of the original image.

Article 4: RAISR: Rapid and Accurate Image Super Resolution

Yaniv Romano, John Isidoro, and Peyman Milanfar

Several typical image block-type (also known as sample type) super-resolution techniques mentioned above all learn the mapping from low resolution to high resolution image blocks based on the one-to-one data of high and low resolution image blocks. Specifically, this mapping is usually a series of filters, and appropriate filters are selected for super-resolution reconstruction based on different texture features of different pixel positions of the input image.


Based on this idea, Google released RAISR algorithm in 2016 on the basis of SRCNN, A+, ESPCN and other super-resolution studies. The algorithm features high-speed real-time performance and extremely low computational complexity. The core idea is to obtain A series of filters by training paired high-low resolution image blocks. During the test, an appropriate filter was selected according to the local gradient statistical characteristic index of the input image to complete the super-resolution reconstruction. Therefore, RAISR algorithm consists of two parts, the first part is to train the filter of LR/HR mapping, and the second part is to establish the filter indexing mechanism.



FIG. 8 RAISR 2x upsampling filter [4]



FIG. 9 Comparison of technical indexes of RAISR with SRCNN, A+ and other super-resolution algorithms at the sampling rate of 2X

The left figure is pSNR-Runtime indicator, and the right figure is SSM-Runtime indicator [4]

conclusion



Super-resolution reconstruction has broad application prospects in medical image processing and compressed image enhancement, and has been a hot research field of deep learning in recent years. Improvement of convolution and residual components, further analysis of different types of perceptual loss, and exploration of adversity-generated networks for super-resolution reconstruction are all worthy of attention.


reference



[1] Dong, Chao, et al. “Image Super-Resolution Using Deep Convolutional Networks.” IEEE Transactions on Pattern Analysis & Machine Intelligence 38.2 (2016) : 295-307.

[2] Shi, Wenzhe, et al. “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network.” (2016) : 1874-1883.

[3] Johnson, Justin, A. Alahi, and F. F. Li. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” (2016):694-711.

[4] Romano, Yaniv, J. Isidoro, and P. Milanfar. “RAISR: Rapid and Accurate Image Super Resolution.” IEEE Transactions on Computational Imaging 3.1(2016):110-125.

[5] Conkey, Donald B., et al. “Super-resolution photoacoustic imaging through a scattering wall.” Nature Communications 6(2015):7902.