The article comes from zhihu, a leader of the Unet++ model, which analyzed the Unet++ model in detail at the end of 2019, and explained it very well. So here is a carry + personal understanding. The bold part of the article is a personal note. Friends who need to discuss and communicate can add my wechat: CYx645016617, or join an AI algorithm communication group with excellent atmosphere established by me. I’m just a fresh grad on the road to smart algorithms.

Reference Catalogue: [TOC]

1 bedding

In the field of computer vision, full convolutional network (FCN) is a well-known image segmentation network. In the direction of medical image processing, U-NET can be said to be a more popular network. Basically, we will take U-NET to see the basic results of all segmentation problems, and then make “magic changes”.

U-net is very similar to FCN. U-net was proposed a little later than FCN, but both of them were published in 2015. Compared with FCN, the first feature of U-NET is complete symmetry, that is, the left and right sides are very similar. And then it doesn’t follow the convolution structure. The second difference is skip Connection. FCN uses summation while U-NET uses concatenation. These are details, but the key point is that their structure uses a classical idea, namely encoder-decoder, which was proposed by Hinton in 2006 and published in Nature.

At that time, the main function proposed by this structure was not segmentation, but compression image and noise reduction. The input is an image, which is encoded by downsampling to get a series of features smaller than the original image, equivalent to compression, and then through a decoding, ideally restored to the original image. In this way, when we save an image, we only need to save a feature and a decoder. I personally think it’s a beautiful idea. Similarly, this idea can also be used in the original image denoising. The way is to artificially add noise to the original image during the training stage, and then put it into the decoder. The goal is to restore the original image.

Later, this idea was applied to the problem of image segmentation, which is the U-NET structure we see now. In the three years since it was proposed, there have been many papers on how to improve U-NET or FCN, but the essential topological structure of this segmentation network has not changed. For example, Mask RCNN proposed by Kaiming Daishen on ICCV last year is equivalent to an integrator of detection, classification and segmentation. When we look at its segmentation carefully, we actually use this simple FCN structure. It shows that this “U-shaped” encoding and decoding structure is really very simple, and the most critical point is easy to use.

A simple look at the network red structure, we first extract its topological structure, so it will be easier to analyze its essence, eliminate a lot of details of interference.

The input is a graph and the output is the segmentation result of the target. Further simplification is, you take a picture, encode it, or downsample it, and then decode it, which is upsample it, and then output a partition result. According to the difference between the result and the real segmentation, back propagation is used to train the segmentation network. The best parts of U-NET, we can say, are these three parts:

  • The sampling;
  • The sampling;
  • Skip Connection (This is the concatenate of the feature degree of the same scale obtained by down-sampling and up-sampling on the channel dimension)

The continuation in this figure represents an intuitive model processing process. The output and input dimensions of the segmentation task are the same. As for the downward solid line, it means that the image needs to be de-sampled (encode process), and the upward solid line means that the image needs to be up-sampled (decode process), which is well represented in the Unet explained in the previous article

hyperlinks

2 a

The basic groundwork has been laid, and looking at the topology, a very general question is:

Is there really nothing wrong with this three-year-old static topology?

In these three years, U-NET got over 2500 citations and FCN nearly 6000 citations. What improvements are being made? If you were to improve on this classic structure, what would you focus on?

The first question is: how deep is appropriate?

Here I would like to emphasize that a lot of paper, they suggest that the network structure is given, including so many details, such as what kind of convolution, with several layers, and how to send a sample, how much more, what the optimizer, these are the parameters of the intuitive, these parameters are given in the research paper is not necessarily the best, so pay attention to the meaning is not big, A network structure, what we really care about is what the design says. Take U-Net as an example, the structure given in the original paper is that the original image has been downsampled for four times and upsampled for four times to obtain the segmentation result. Actually, why four times? It is the author’s preference, or the data set used by the author at that time, the effect of four downsampling is good, we can also be more professional, the size of the acceptance field or receptive field of four downsampling is appropriate for image processing. Or quad downsampling is better for the size of the input image and so on for a bunch of reasons, but do you really believe that? Not really.

I will first give a segmentation network named PSPNet published in CVPR in 2017. You will find that the overall architecture is similar to U-NET, but the number of downsamples is reduced. Of course, they also targeted to enhance the complexity of the feature capture process in the middle.

And if you don’t think this work is enough to show that four downsampling is not necessary, let’s take a look at the recent paper on image segmentation by the Yoshua Bengio group. This is their proposed structure, called Tiramisu.It’s also u-shaped, but as you can see, they only took three downsamples. So how deep is it? Is it as deep as possible? It’s also an open quesion.

The first message I want to share is this: focus on the big picture and don’t let the details limit your creativity. Such adjustment of detailed parameters is a simple deep learning methodology, which can easily take a lot of time without improving your own scientific research level.

Okay, let’s get back to the question of how deep we need to go. In fact, this is very flexible, one of the points involved is the feature extractor, u-NET and FCN why successful, because it kind of gives a network framework, the specific feature extractor, you can choose. At this time, high citing appeared, all kinds of micro innovation in encoder came in an endless stream, the most direct is to use ImageNet inside the star structure to set, the last few years of BottleNeck, Residual, and last year’s DenseNet, who is faster than the article. This kind of paper is equivalent to the progression from 1 to 10, while the u-NET low level structure is proposed from 0 to 1. As an aside, papers from 1 to 10 tend not to get higher citations than papers from 0 to 1 because they unconsciously limit their expansion. For example, I said that WHEN I wrote a paper, the feature extractors had to be dense block, or the residual blocks had to be good. And then the name is DenseUNet or ResUNet, and that’s it. So backbone exactly use what question, is not what I want to talk about this time.

An extended question about how deep is it is that a reduced sampling is necessary for a segmented network at all? The reason for asking this question is, if the input and output are the same size graph, why bother sampling down and then sampling up?

The more direct answer, of course, is the theoretical significance of down-sampling. I will read it briefly. It can increase the robustness of some small disturbances to the input image, such as image translation and rotation, reduce the risk of over-fitting, reduce the amount of computation, and increase the size of the receptive field. The biggest function of upsampling is to restore and decode the abstract features to the size of the original image, and finally get the segmentation result.

These theoretical explanation is reasonable, in my understanding, for the feature extraction stage, shallow structure can capture the image of some simple features, such as the border, color, and deep structure because receptive field, and with more convolution operation and can grab the abstract nature of the image of some indescribable chemistry, more and more the metaphysics, in short, Shallow has shallow focus, deep has deep advantage. Then I have to ask a very sharp question. Since both shallow and deep features are important, why does U-Net only go back after 4 layers, i.e. only to capture the deep features? I don’t know if I’ve addressed the question. So the problem is actually, if you look at this graph, since
X 1.0 1.0 X ^ {}
.
X 2.0 2.0 X ^ {}
.
X 3.0 3.0 X ^ {}
.
X 4.0 4.0 X ^ {}
All the features I capture are important, so why do I have to go down
X 4.0 4.0 X ^ {}
Does the layer start to sample back up?

3 the main body

And if YOU can guess what I’m going to put on the next page, I’m going to follow that logic. You can pause it here. Is it on the horizon? If you eliminate all the other distractions, since we don’t know how deep we need to be, do you get a series of U-Nets that look like this, with varying depths. This is not hard to understand. To find out if deeper is better, we should experiment with them and see how they behave separately:

Two UNet++ after don’t look at first, just look at the different depth of U – Net, we can see that is not the deeper the well, the message behind it is, the importance of the characteristics of different levels for different data sets are not the same, is not to say that I design a layer 4 U – Net, just like the original paper gives the structure, It must be optimal for the segmentation of all data sets.

So here’s the key, we have a clear goal in mind, which is to use both shallow and deep features! But you can’t train these U-nets. It’s too much. Ok, so you can pause here and think about how you can take advantage of these U-Nets at different depths, each of which can capture different levels of features.

It’s really easy for me to type this out.

Let’s see if this connects all the LAYERS 1-4 of u-NET together. Let’s look at a subset of them, which consists of 1 layer of U-NET, 2 layers of U-NET, and so on. The nice thing about this structure is that I don’t care what depth of features you have available, I’ll just use them for you and let the network learn the importance of different depth features. The second benefit is that it shares a feature extractor, which means you don’t need to train a bunch of U-Nets, you only need to train one encoder, whose different levels of features are restored by different decoder paths. This encoder can still flexibly use a variety of backbone to replace.

Unfortunately, this network structure cannot be trained, because no gradient will pass through the red area, because it is disconnected from the place where the Loss function is calculated during the back propagation. I pause for a moment to think about whether this is the case.

The problem with gradient descent training can be seen by looking closely at the dotted lines in the figure below. The elements in the triangle are not connected to the final output

Excuse me, if you, how to solve this problem?

In fact, since the structure has been laid out like this, it is quite natural to think of, and I offer two alternative solutions.

The first one is to impose gradients with deep supervision, right? I’ll talk about that later. The second solution is to change the structure to something like this:

This structure might not be easy to think of without the logic of what I have said above, but it makes sense, if not obvious, if you follow it step by step. I haven’t finished my story yet, but LET me mention that this structure was proposed by a UC Berkeley team and published in this year’s CVPR (2019) in an oral paper entitled “Deep Layer Aggregation”.

Maybe someone was shocked. That’s it? CVPR! ? Yes and no… This is just the improvement of the segmented network structure that they presented in their paper, and they also did other work, including classification, and edge detection. But the main idea is what we just did, and the goal is to integrate features from different levels.

That’s just a digression oh, let’s move on to the structure, what do you think is wrong with it?

To answer this question, let’s now compare this structure to the one above. It’s not hard to see that this structure forces out the long connection of U-NET itself. Instead, a series of short links. So let’s take a look at the benefits of U-Net’s vaunted Skip-connection.

We believe that the long connection in U-NET is necessary because it connects a lot of information of the input image and helps restore the information loss caused by the descending sampling. To some extent, I think it is very similar to the residual operation, that is, x+f(x). I don’t know if you agree with that. Therefore, my suggestion is that it is best to give a combination of long and short connections. What does this plan look like in your mind? Pause it here.

In fact, I feel, is the crazy feature splicing splicing splicing…

It’s basically a structure that doesn’t go away. Let’s compare the original U-NET, which is actually a subset of this structure. And this structure is what we published in MICCAI as UNet++. Enthusiastic netizens may ask oh, your UNet++ is too similar to the CVPR paper structure just said. Let me mention that this work and UC Berkeley research are completely separate work, which is also a beautiful coincidence. The idea of UNet++ was already formed at the beginning of this year. The article of CVPR was seen when we had a meeting in July this year. At that time, UNet++ had been accepted, so it was equivalent to proposing it at the same time. In addition, compared with the CVPR paper, I have a more wonderful point buried in the back, also left a foreshadowing.

Well, when you’re staring at UNet++ right now, I wonder if a question will pop up out of the blue:

This network is better than THE EFFECT of U-NET, but this network added how many parameters, bold parameters can be more than the U-NET ah?

That’s a very sharp question, and it actually requires designing an experiment to answer it. How do you design an experiment? What we do is we force the number of parameters in U-Net to make it wider, that is, increase the number of convolution kernels per layer. For this reason, we designed a reference structure called wide u-net. The number of parameters in UNet++ is 9.04m, and the number of parameters in u-net is 7.76m, which is about 16% more. Therefore, we designed wide u-net to have almost the same parameters as UNet++, and a little more. To prove that it is not brainless to increase the number of parameters, the model effect will be better.

I think this is not a strong response, because it is a bit perfunctory to add parameters, should be able to find a better comparison method. Despite the shortcomings, let’s look at the results first.

Based on the existing results, MY conclusion is that simply widening the network and increasing the parameters does not improve the effect very much, and how to use the parameters is very important. So the design of UNet++ is a plan to use parameters on the edge.

We are coming back to the UNet++. For this main structure, we give some comments in the paper. To put it simply, the original hollow u-net is filled. The advantage is that the features of different levels can be captured and integrated through feature superposition. For different sizes of the sensitivity of the target object is different, for example, feeling wild big characteristics, can easily identify the large objects, but in the actual split edges information and small object itself is very easy to be deep web again and again down sampling and up sampling time and again to lost, this time may need to feel the characteristics of the wild little to help. Another interpretation is that if you look sideways at the feature stacking process of one layer, like a DenseNet structure that was hot last year, very coincidentally, the original U-net looks like the Residual structure when you look sideways, this is interesting. The reason why UNet++ improves the segmentation effect of u-net may be the same as DenseNet improves the classification effect of ResNet. Therefore, in the interpretation, we also refer to the advantages of Dense Connection, such as feature reuse, etc.

All these interpretations are very intuitive. In fact, in deep learning, the reason why a certain structure is better than a certain structure, or if you add this operation, it is better than not adding this operation. In many cases, there is a flavor of metaphysics in it, and a lot of work has also been done to explore the interpretibility of deep network. I won’t spend too much time on the main structure of UNet++.

4 orgasm

Now, this part of what I’m going to say is really interesting. If I had three minutes to share this, I would probably spend two and a half minutes here. There is a foreshadowing that the middle part of this structure will not receive the gradient when it propagates back, if only the loss on the right is used.

As I said, a very direct solution is deep supervision. This concept is not new, and there are many useful for u-NET improvement papers. The specific implementation operation is to add a 1×1 convolution kernel after [formula], [formula], [formula] in the figure, which is equivalent to supervising the output of U-NET for each level or branch.

I offer you three structures with Deep Supervision for comparison. One is this one, the second one is added to the structure proposed by UC Berkeley, and the last one is added to UNet++. May I ask which one you think is better? This is an open question, and I’m not going to go into it here, but we use the last one in the paper.

Because the specific structure of deep supervision is not what I want to focus on. The advantages of deep supervision have also been explained in many papers. What I want to focus on here is that when it is combined with such a full U-NET structure, it brings one of the great advantages. It is highly recommended to pause and think, what would be the benefit if I added this kind of deep supervision to the sub-network at each level during training?

Two words: prune.

Three questions are raised at the same time:

  • Why can UNet++ be pruned
  • How the pruning
  • What are the benefits

Let’s see why we can prune, and this is a really nice picture. If you look at the clipped portion, you’ll notice that in the test phase, throwing out the portion will have no effect on the previous output because the input image propagates only forward, while in the training phase, because there is both forward and backward propagation, the clipped portion will help update the weight of the other portions. These two statements are equally important, and I’ll repeat them again, when you’re testing, cutting does not affect the rest of the structure, and when you’re training, cutting does affect the rest of the structure. What does that mean?

Because in the process of deep supervision, the output of each sub-network is actually the segmentation result of the image, so if the output result of the small sub-network is good enough, we can cut those redundant parts at will.

Take a look at this GIF. For the sake of definition, we will name each of the remaining subnetworks UNet++ L1, L2, L3, L4 based on their depth, which will be shortened to L1 ~ L4. What is the ideal situation? L1, of course. If the output result of L1 is good enough, the segmentation network will become very small after cutting.

This pruning is actually very simple, compared to the pruning rules of decision trees. Since Unet++ will give four outputs, if the output of the third layer is accurate enough, there is no need for the fourth layer in the reasoning phase. I remember the concept of auxiliary output started from GoogleNet, which is a classification network, so it is called auxiliary classifier (it seems to be). The professional concept should be deep supervision, if you are interested, you can check it

In the training stage, we need four layers of feature guidance to train, so we still need to use a complete model for training. In the reasoning phase, the author considers that it may be possible to sacrifice very little accuracy to achieve a large reduction in model parameters. Ok, let’s move on to the author’s explanation

Here I would like to ask two questions:

  • Why do we prune during the test, instead of just cutting L1, L2, and L3?
  • How do you decide how much to cut?

For why when testing the pruning, rather than directly cut the L1, L2, L3 training, our interpretation actually wrote back PPT, cut off the part in the training of back propagation is a contribution, if directly with L1, L2, L3 training, is equivalent to only training the different depth of U – Net, the end result will be very poor.

The second question, how to decide how much to cut, or better to answer. Because the data will be divided into training set, verification set and test set during the training model, the training set must fit well, and the test set is not touchable, so we will decide how much to cut according to the results of the sub-network in the verification set. The so-called validation set is the data extracted from the training set at the beginning to monitor the training process.

Ok, so with the idea out of the way, let’s look at the results.

Firstly, let’s look at the number of network parameters between L1 and L4, which is much different. L1 is only 0.1m, while L4 has 9M. That is, theoretically, if I am satisfied with the result of L1, 98.8% of parameters can be cut out of the model. But according to our four data sets, L1 doesn’t work that well because it’s too shallow. However, there are three data sets that show that the results of L2 are very close to L4. That is to say, for these three data sets, we do not need to use 9M network, but half M network is enough.

If you think back to the question I asked at the beginning, how deep the network needs to be, this picture is obvious. The depth of the network and the difficulty of the data set is a relationship, the four data sets, the second, namely polyp segmentation is the most difficult, you can see the vertical, it represents the integral evaluation index, the bigger the better, the other can achieve quite high, but only around 30, but polyps segmentation for difficult data sets, can see the network, It’s going up and up. For most simple segmentation problems, you don’t need a very deep, very large network to achieve very good accuracy.

The abscissa represents the time it took to split 10,000 images during the test phase on a 12-gigabyte TITAN X (Pascal). We can see that with different model sizes, the test time varies a lot. If L2 and L4 are compared, the difference is three times greater.

For the speed of the test, it will be clearer to use this picture. We calculated how many graphs can be segmented in a second using different models. If you replace L4 with L2, you can actually triple the speed.

Pruning is most used in mobile phones. According to the number of parameters in the model, if THE effect obtained by L2 is similar to that obtained by L4, the memory of the model can be saved 18 times. It’s still a significant amount.

I think the pruning part is a great change from the original U-NET. The original structure was too rigid and did not make good use of the features of different levels.

To sum up briefly, the first advantage of UNet++ is the improvement of accuracy, which should integrate the features of different levels. The second advantage is the flexible network structure with deep supervision, which enables the deep network with a large number of parameters to greatly reduce the number of parameters within an acceptable precision range.

5 Last mention

Let’s put in our UNet++ structure again