• Semantic Segmentation — U-net (Part 1)
  • Original article by Kerem Turgutlu
  • Translation from: The Gold Project
  • This article is permalink: github.com/xitu/gold-m…
  • Translator: JohnJiangLA
  • Proofreader: Haiyang-Tju Leviding

To me six months ago

I will focus on semantic segmentation, a pixel-level classification task, and its specific algorithm implementation. In addition, I will provide some case studies that I have been doing recently.

By definition, semantic segmentation is the process of dividing an image into contiguous parts. For example, classify every pixel that belongs to a person, a car, a tree, or any other entity in a data set.

Semantic segmentation vs. instance segmentation

Semantic segmentation is much easier than its big brother, instance segmentation.

In instance segmentation, our goal is not only to make pixel-level predictions for each person and vehicle, but also to separate entities into Person 1, Person 2, Tree 1, Tree 2, Car 1, car 2, and so on. At present, the best segmentation algorithm is Mask-RCNN: One uses RPN (Region Proposal Network), FPN (Feature Pyramid Network) and FCN (Fully Convolutional Network) [5, 6, 7, 8] A two-stage approach for multi-subnet collaboration.

Figure 4. Semantic segmentation

Figure 5. Instance segmentation

Case study: Data Science Bowl 2018

Data Science Bowl 2018 just ended, and I learned a lot during the game. Perhaps the most important of these is that even with deep learning, which is more automated than traditional machine learning, pre-processing and post-processing may be the key to achieving good results. These are important skills that practitioners need to master, and they determine the way problems are networked and modeled.

Since there’s already a lot of discussion and explanation on Kaggle about this task and the methods used during the contest, I won’t go into every detail of the contest. But since the champion scheme is relevant to the basis of this blog post, it will be covered briefly.

Data Science Bowl 2018, like its predecessors, is organized by the Booz Allen Foundation. This year’s task was to identify the nucleus in a given microscope image and draw a separate segmentation mask for it.

Now, take a minute or two to guess what kind of segmentation this task requires: semantic or entity?

This is a sample mask image and the original microscopic image.

Figure 6. Nuclear mask (left) and original image (right)

Although this task initially sounds like a semantic splitting task, it actually requires instance splitting. We need to process each core in the image independently and identify them as a mu 1, mu 2, mu 3, and so on, similar to Car 1, Car 2, Person 1, and so on in the previous instance. Perhaps the motivation for the task was to keep track of the size, number and characteristics of the nuclei in the cell samples. Such an automated tracking and recording process is important to further accelerate research into treatment trials for various diseases.

You may now wonder, if this post is about semantic splitting, but if Data Science Bowl 2018 is a sample instance splitting task, then why do I keep talking about this particular game. If you’re thinking about this, absolutely right, the ultimate goal of this contest is not an example of semantic segmentation. However, how to transform this instance segmentation problem into a multi-classification semantic segmentation task? This is the method I have tried, although failed in the practice process, but also to the final success of a certain help.

During the three-month competition, there were only two models (or variants of them) shared or at least explicitly discussed throughout the forum: Mask-RCNN and U-net. As mentioned above, mask-RCNN is the best object detection algorithm at present, which can detect single objects and predict their masks just as in instance segmentation. However, since mask-RCNN uses a two-stage approach, it needs to optimize a RPN (Region Proposal Network) first and then predict boundary box, category and Mask at the same time, so the deployment and training will be very difficult.

U-net, on the other hand, is a very popular end-to-end codec network for semantic segmentation [9]. It was originally created and used in biomedical image segmentation, very similar to the Data Science Bowl. There is no silver bullet in this race, and neither architecture can be predicted well without post-processing or pre-processing or subtle structural adjustments. I didn’t have a chance to try mask-rCNN in this competition, so I experimented around u-net and learned a lot.

Also, since our topic is semantic segmentation, Mask-rCNN leaves it to other blogs to explain. But if you want to try them out on your CV app, here are two popular github libraries that have been implemented: Tensorflow and PyTorch. [10, 11]

Now, let’s move on to u-NET and delve into its details…

Let’s start with its architecture:

Figure 7. Native U-net

For those familiar with traditional convolutional neural networks, the structure of the first part (represented as descending) is very familiar. The first part can be called the descent or you can think of it as the encoder part, where you deal with the convolution module, and then you use maximum pooling subsampling to encode the input image into different levels of feature representation.

The second part of the network consists of upsampling and cascading, followed by ordinary convolution. Using upsampling in CNN may be a new concept for some readers, but the idea is simple: extend the feature dimension to the same size as the corresponding cascaded block on the left. The gray and green arrows here are mapping the two features together. Compared with other FCN segmentation networks, the main contribution of U-NET in this aspect is that in the upsampling and deep network processes, we connect the high-resolution features in the downsampling with the upsampling features so as to better locate and learn the representation of entities in the subsequent convolution processes. Since upsampling is a sparse operation, we need to obtain a good prior in the early processing process to better represent the location information. In THE FPN (Feature Pyramidal Networks) there is a similar idea of connection matching tier.

Figure 7. Diagram of the native U-net tensor

We can define an operation module in the descending part as “convolution → down-sampling”.

# a sampling drop module
def make_conv_bn_relu(in_channels, out_channels, kernel_size=3, stride=1, padding=1):
    return [
        nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size,  stride=stride, padding=padding, bias=False),
        nn.BatchNorm2d(out_channels),
        nn.ReLU(inplace=True)
    ]
self.down1 = nn.Sequential(
    *make_conv_bn_relu(in_channels, 64, kernel_size=3, stride=1, padding=1 ),
    *make_conv_bn_relu(64, 64, kernel_size=3, stride=1, padding=1 ),
)

# Convolve and maximize pooling
down1 = self.down1(x)
out1   = F.max_pool2d(down1, kernel_size=2, stride=2)
Copy the code

U-net undersampling module

We can also define an operation module in the ascending section: “upsampling → cascade → convolution”.

# A sampling rise module
def make_conv_bn_relu(in_channels, out_channels, kernel_size=3, stride=1, padding=1):
    return[ nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=False), Nn. BatchNorm2d(out_channels), nn.ReLU(inplace=True)] self.up4 = nn.Sequential(*make_conv_bn_relu(128,64, *make_conv_bn_relu(64,64, kernel_size=3, stride=1, stride=1) padding=1 ) ) self.final_conv = nn.Conv2d(32, num_classes, kernel_size=1, stride=1, padding=0 )# Upsample out_last, cascade with down1, and convolve
out   = F.upsample(out_last, scale_factor=2, mode='bilinear')  
out   = torch.cat([down1, out], 1)
out   = self.up4(out)

# 1 * 1 convolution for the final prediction
final_out = self.final_conv(out)
Copy the code

U-net upsampling module

If you look closely at the structure diagram, you will see that the output size (388 * 388) does not match the original input size (572 * 572). If you want the output to be consistent in size, you can use fill convolution to keep the dimensions consistent across the cascade, as we did in the example code above.

When referring to this kind of upsampling, you might come across one of the following terms: transposed convolution, upconvolution, deconvolution, or upsampling. Many people, myself included, and the PyTorch technical documentation don’t like the term deconvolution, because in the upsampling phase, we’re actually doing regular convolution, not literally “inverse.” Before going any further, if you are not familiar with the basic convolution operation and its arithmetic, I strongly recommend you visit here. [12]

I will explain the methods of upsampling from simple to complex. Here are three ways to upsample a two-dimensional tensor in PyTorch:

Nearest neighbor interpolation

This is the easiest way to find missing pixel values when you adjust (transform) tensors to larger tensors, such as 2×2 to 4×4, 5×5, or 6×6.

We use Numpy to implement this basic computer vision algorithm step by step:

def nn_interpolate(A, new_size):
    """Nearest neighbor interpolation step by step"""
    # Get size
    old_size = A.shape
    
    Calculate the expanded rows and columns
    row_ratio, col_ratio = new_size[0]/old_size[0], new_size[1]/old_size[1]
    
    # Define new row and column positions
    new_row_positions = np.array(range(new_size[0]))+1
    new_col_positions = np.array(range(new_size[1]))+1
    
    # Standardise the new row and column positions proportionally
    new_row_positions = new_row_positions / row_ratio
    new_col_positions = new_col_positions / col_ratio
    
    # Apply ceil to new row and column positions
    new_row_positions = np.ceil(new_row_positions)
    new_col_positions = np.ceil(new_col_positions)
    
    # Count the number of points that need to be repeated
    row_repeats = np.array(list(Counter(new_row_positions).values()))
    col_repeats = np.array(list(Counter(new_col_positions).values()))
    
    Perform column interpolation on the columns of the matrix
    row_matrix = np.dstack([np.repeat(A[:, i], row_repeats) 
                            for i in range(old_size[1])])[0]
    
    Perform column interpolation on the columns of the matrix
    nrow, ncol = row_matrix.shape
    final_matrix = np.stack([np.repeat(row_matrix[i, :], col_repeats)
                             for i in range(nrow)])

    return final_matrix
    
    
def nn_interpolate(A, new_size):
    ""Vectorized nearest neighbor interpolation"""Old_size = a.shape row_ratio, col_ratio = np.array(new_size) 1 + int(old_size[0]*row_ratio))/row_ratio) -1).astype(int) 1 + int(old_size[1]*col_ratio))/col_ratio) - 1).astype(int) final_matrix = A[:, row_idx][col_idx, :] return final_matrixCopy the code

[PyTorch] F.u psample (… The mode = “on”)

>>> input = torch.arange(1, 5).view(1, 1, 2, 2) >>> input (0 ,0 ,.,.) >>> m = nn.Upsample(scale_factor=2, mode='nearest') >>> m(input) (0 ,0 ,.,.) [torch.FloatTensor of size (1,1,4,4)]Copy the code

Bilinear interpolation

Bilinear interpolation is less efficient than nearest neighbor interpolation, but it is a more accurate approximation algorithm. The individual pixel value is calculated as a weighted average of all other pixel values based on distance.

[PyTorch] F.u psample (… The mode = “bilinear”)

>>> input = torch.arange(1, 5).view(1, 1, 2, 2) >>> input (0 ,0 ,.,.) >>> m = nn.Upsample(scale_factor=2, mode='bilinear') >>> m(input) (0 ,0 ,.,.) = 1.0000 1.2500 1.7500 2.0000 1.5000 1.7500 2.2500 2.5000 2.5000 2.7500 3.2500 3.5000 3.5000 3.0000 3.2500 3.7500 4.0000 [torch. FloatTensor of size (1,1,4,4)]Copy the code

Transpose convolution

In transpose convolution, we can learn weights by back propagation. In the paper, I tried all the methods of upsampling for various situations. In practice, you might change the architecture of the network. You can try all of these methods to find the one that best fits the problem. I personally prefer transpose convolution because it’s more controllable, but you can use simple bilinear interpolation or nearest neighbor interpolation directly.

[PyTorch] nn. ConvTranspose2D (… , stride =… , padding =…).

Figure 8. Example of transposed convolution with different parameters, from github.com/vdumoulin/c… [12]

In this Data Science Bowl specific case, the main disadvantage of using native U-net is the overlap of nuclei. As shown in the previous figure, by creating a binary mask as the target output, U-NET is able to accurately make similar predictive masks, so that overlapping or adjacent nuclei produce associated masks.

Fig. 9. Overlapping nuclear masks

For instance overlapping problems, the authors of the U-NET paper use weighted cross entropy to focus on the learning of cell boundaries. This approach helps to separate overlapping instances. The basic idea is to do more weighting on the boundaries so that the network can learn the spacing between adjacent instances.

Figure 10. Weighted mapping

Figure 11. (a) Original image (b) Different background colors added to each instance (c) segmentation mask generated (d) pixel weighted mapping

Another approach to this type of problem is to convert a binary mask into a compound type of target, an approach used by many competitors, including the winning scheme. One advantage of U-NET is that multiple types can be represented by using 1*1 convolution at the last layer to build a network for any number of outputs.

Quote from the Data Science Bowl winning scenario:

Target the network for 2-channel masks using sigMOD activation function, i.e. (mask – boundary, boundary); Target for 3 channel mask network using softMax activation function, i.e. (mask – boundary, 1 – mask – boundary) 2 channel full mask, i.e. (mask, boundary)

After these predictive operations, traditional image processing algorithms such as WATERSHED can be used for post-processing to further segment individual nuclei. [14]

Figure 12. Visual classification: Foreground (green) Outline (yellow) Background (black)

It was the first time I had the courage to participate in an official CV contest on Kaggle, which was also the Data Science Bowl. Even though I only finished in the top 20% (which is usually the average of the tournament), I had the pleasure of participating in the Data Science Bowl and learning a few things that you can only learn by actually participating and trying. Active learning is far more productive than watching and reading online course resources.

As a deep learning practitioner who just started to participate in the course of Fast. Ai, this is an important step in my long learning journey, and I can gain valuable experience from it. Therefore, I suggest that you can deliberately try to face some challenges you have never seen or solved, to feel the great joy of learning the unknown.

Another valuable lesson I learned during the competition is that in computer vision (and NLP as well), it’s important to personally examine each prediction to see what works. If your data set is small enough, then you should check each output. This can help you find better ideas, or debug your code when it doesn’t work.

Transfer learning and others

So far, we’ve looked at the native U-Net architecture modules and how to shift the target to address instance splitting. Now let’s further discuss the flexibility of these types of codec networks. By flexibility, I mean the degree of freedom and creativity you can have in designing your network.

Transfer learning is such a powerful idea that it is unavoidable for people who use deep learning. Simply put, transfer learning is the use of pre-trained networks for similar tasks with large amounts of data in the absence of large data sets. Even with sufficient data, transfer learning can improve performance to a certain extent, and can be used not only for computer vision, but also for NLP.

Transfer learning is also a powerful technique for systems like U-NET. We have previously defined two important components in u-NET: upsampling and downsampling. Here we think of them as encoders and decoders. The encoder takes the input and encodes it into a low-dimensional feature space, which represents the input in a lower dimension. So imagine replacing this encoder with your ideal ImageNet: VGG, ResNet, Inception, NasNet, whatever you want. These highly designed networks are all doing one thing: coding natural images in the best possible way, and their pre-trained weight models are available online at ImageNet.

So why not use one of these architectures as our encoder and build a decoder that will be just as usable as the original U-NET, but better and more aggressive.

TernausNet, the network architecture for the winning solution of the KaggleVagle Carvana Challenge, uses the same idea, with VGG11 as the encoder. [15, 16]

TernausNet by Vladimir Iglovikov and Alexey Shvets

Fast.ai: dynamic U – Net

Inspired by the TernausNet paper and many other excellent sources, I have outlined the idea of applying pre-trained or preprogrammed encoders to an architecture similar to u-NET. Therefore, I propose a general architecture: dynamic U-NET.

Dynamic U-Net is the implementation of this idea, which is able to do all the calculations and matches, automatically creating the decoder for any given encoder. Encoders can be either off-the-shelf pre-trained networks or custom network architectures.

It is written using PyTorch and is currently in the fast. ai library. You can refer to this document for sample practices or to see the source code. The main goal of the dynamic U-Net is to save development time and make it easier to experiment with different encoders with as little code as possible.

In Part 2, I’ll explain an encoder decoder model for 3d data, such as MRI (Magnetic resonance imaging) scans, and give real-world cases that I’ve been working on.

reference

[5] Faster R-CNN: Towards real-time Object Detection with Region Proposal Networks: arxiv.org/abs/1506.01…

[6] Mask R – CNN: https://arxiv.org/abs/1703.06870

[7] Feature Pyramid Networks for Object Detection: https://arxiv.org/abs/1612.03144

[8] Fully Convolutional Networks for Semantic Segmentation: https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

[9] U – net: Convolutional Networks for Biomedical Image Segmentation: https://arxiv.org/abs/1505.04597

[10] Tensorflow Mask-RCNN: https://github.com/matterport/Mask_RCNN

[11] Pytorch Mask-RCNN: https://github.com/multimodallearning/pytorch-mask-rcnn

[12] Convolution Arithmetic: https://github.com/vdumoulin/conv_arithmetic

[13] Data Science Bowl 2018 Winning Solution, ods-ai: https://www.kaggle.com/c/data-science-bowl-2018/discussion/54741

[14] Watershed Algorithm docs.opencv.org/3.3.1/d3/db…

[15] Carvana Image Masking Challenge: www.kaggle.com/c/carvana-i…

[16] TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation: https://arxiv.org/abs/1801.05746

Thanks to Prince Grover and Serdar Ozsoy.

If you find any errors in the translation or other areas that need improvement, you are welcome to revise and PR the translation in the Gold Translation program, and you can also get corresponding bonus points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


Diggings translation project is a community for translating quality Internet technical articles from diggings English sharing articles. The content covers the fields of Android, iOS, front end, back end, blockchain, products, design, artificial intelligence and so on. For more high-quality translations, please keep paying attention to The Translation Project, official weibo and zhihu column.