Nnu-net: Self-adapting Framework for U-NET-based Medical Image Segmentation == NNU-NET: An adaptive Medical Image Segmentation Framework Based on U-NET.

See other posts in this column for implementation. Direct link 👇 (write and attach)

Click to download the paper click to enter github code address

What is NNU-NET and how good is it

Nnu-net is a medical image segmentation framework developed by researchers (Fabian Isensee, Jens Petersen, Andre Klein) from the German Cancer Research Center, the University of Heidelberg, and Heidelberg University Hospital, adapting to any new data set. The framework can automatically adjust all hyperparameters according to the attributes of a given data set without human intervention. Relying only on a naive U-NET architecture (the original U-NET) and a robust training scheme, NNU-NET achieves state-of-the-art performance in six recognized segmentation challenges.

With no network improvements, the framework was able to win first place in multiple challenges through adaptive processing of data, which shows how good the framework is. He’s the king of medical image segmentation.

Listen to me: if you are going to do medical image segmentation, you must know how to use NNU-Net. There are two reasons.

  • First test it with NNU-net. Take a look at the general effect of the task and have a general prediction in mind.
  • Second, when writing papers, you usually have to compare with NNU-NET, especially for English journals, SCI and so on. Because it’s so popular, reviewers know this stuff, don’t take any chances. Either beat it or use it 😂. Its appearance, is really good and bad, cast the paper more difficult ~

If you are interested in it, let’s read the paper together.

Abstract

U-net was released in 2015. With its straightforward and successful architecture, it quickly developed into a common benchmark in medical image segmentation. However, U-Net’s adaptation to new problems involves multiple degrees of freedom regarding precise architecture, preprocessing, training, and reasoning. These choices are not independent of each other and have a significant impact on overall performance.

This article introduces NNU-Net (now new-Net) (the name is interesting, indicating that it is not a new network 😃), which is an adaptive framework based on 2D and 3D Vanilla U-Nets. We make a strong case for cutting out many of the complexities of fancy network design and focusing on other aspects that determine the performance and scalability of an approach.

We evaluated NNU-NET performance in the Medical Segmentation Decathlon Challenge. Results: nnU-Net achieves the highest mean dice scores across all classes and seven phase 1 tasks (except class 1 in BrainTumour) in the online leaderboard of the challenge.

Note: In papers, the original U-net is usually referred to as vanilla U-net

1 Introduction

At present, deep convolutional neural networks (CNNs) are mainly used in medical image segmentation. However, each segmentation benchmark appears to require a specific structural design and training program to achieve competitive performance [1, 2, 3, 4, 5]. This has led to a large number of publications in the field, coupled with limited validation of often only a few or even a single data set, making it increasingly difficult for researchers to determine whether the promised superior approach can be achieved in limited cases.

The medical segmentation decathlon is designed specifically to address this problem: participants in this challenge are asked to create a segmentation algorithm that generalizes in 10 data sets corresponding to different entities in the human body. These algorithms can adapt dynamically to the specifics of a particular data set, but are only allowed to do so in a fully automated manager. The challenge is divided into two consecutive phases: 1) The development phase, where participants have access to seven data sets to optimize their methods and use the final freeze method, which must be submitted in sections of the corresponding seven test sets. 2) Phase 2 evaluated the same exact method for three previously undisclosed data sets.

Conclusion: Most segmentation models are aimed at a specific task (such as heart segmentation), which requires specific network architecture design and training method setting. It can only solve specific problems but not a series of problems. The medical segmentation decathlon aims to develop an algorithm for multiple segmentation tasks.

In this paper, we propose nNU-NET (” no-new-NET “) framework. It resides on a set of three relatively simple U-NET models containing only minor modifications to the original U-NET [6]. We omit recently proposed extensions, such as the use of residual joins [7,8], dense joins [5], or attention mechanisms [4]. Nnu-net automatically adjusts the image structure according to the given image geometry. More importantly, though, the NNU-NET framework thoroughly defines all the other steps around them.

In these steps, most of the network’s performance can be gained or lost, respectively: Preprocessing (e.g., resampling and standardization), training (e.g., loss, optimizer setup, and data augmentation), reasoning (e.g., patch-based policies and integration through extended models across tests), and potential post-processing (e.g., enforcing a single connectivity component analysis, if applicable).

Summary: These data processing methods are the essence of NNU-Net, if you have the time, check the source code, learn how it handles data.

2 Methods

2.1 Network architectures

Medical imaging usually consists of three dimensions, which is why we consider a basic U-NET architecture consisting of three main UNETS: 2D U-NET, 3D U-NET and a U-NET Cascade. 2D U-NET, 3D U-NET are input full resolution images. Cascaded U-NET first uses low resolution image to do a rough segmentation, and then uses full resolution image to do fine segmentation. Compared to u-NET’s original formula, our architectural modifications were negligible, and instead we worked on designing an automatic training pipeline for these models.

Summary: When we look at the code, we will see that NNU-NET provides these three architectures.

U-net [6] is a successful codec network that has attracted much attention in recent years. The working principle of its encoder part is similar to that of traditional classification CNN, which continuously gathers semantic information at the cost of reducing spatial information. Because both semantic and spatial information are critical to the success of the network during segmentation, the lost spatial information must be recovered in some way. U-net does this through a decoder that receives semantic information from the bottom of the “U” and recombines it with a higher-resolution feature map obtained directly from the encoder by skipping the connection. Unlike other segmentation networks (such as FCN[9] and previous DeepLab iterations [10]), this allows U-Net to segment fine structures well.

Summary: This paragraph mainly summarizes the advantages of U-NET, it is worth noting up, later writing paper can refer to use

Just like the original U-NET, we use two ordinary convolution layers between pools in the encoder and transpose convolution operations in the decoder. We differ from the original architecture in that we have replaced the ReLU activation function with Leaky ReLUs, for example, slope 1e 2), and used instance Normalization [11] instead of the more popular Batch normalization[12].

Summary: This illustrates two minor changes to the network architecture, modifying the activation function and the normalization method.

2.1.1 2 d U – Net

Intuitively, using 2D U-Net in the context of 3D medical image segmentation does not seem to be the best choice because valuable information along the Z-axis cannot be aggregated and considered. However, there is evidence [13] that the performance of conventional 3D segmentation methods deteriorates if the data set is anisotropic (see decathlon Challenge prostate data set).

2.1.2 3 d U – Net

3D U-Net seems like an appropriate way to select 3D image data. In an ideal world, we would train on images of the entire patient. In practice, however, we are limited by the amount of GPU memory available, which only allows us to train the architecture on image patches. This is not a problem for datasets consisting of smaller images (in terms of voxel number per patient), such as those for challenges such as brain tumors, hippocampus and prostate, patch-based training (as dictated by datasets with larger images). Liver, it might interfere with training. This is due to the limited vision of the architecture, so that not enough contextual information can be gathered into, for example, a computer network. Correctly distinguish between liver and other organs.

In conclusion, the medical 3D image is very large and it is impossible to input the whole image into the network, so the image will be cut into patches. And a large structure like the liver, when you cut it, you lose a lot of context.

2.1.3 U -.net Cascade

In order to solve the practical defects of 3D U-NET on data sets with large image sizes, we propose an additional cascade model. Therefore, 3D U-NET training (stage 1) should be performed on down-sampled images first. The segmentation result of this U-NET is then upsampled to the original voxel spacing and passed as an additional (a heat-coded) input channel to a second 3D U-NET, on which the second 3D U-NET is trained at full resolution (phase 2). See Figure 1.

Conclusion: The advantages and disadvantages of the three algorithms are given here, and you can choose according to your own task. Or you can run them all and see which way is better.

2.1.4 Dynamically Adjusting the Network Topology

Because the image sizes vary widely (the median shape of the liver is 482×512×512, while the hippocampus is 36×50×35), the size of the input patches and the combined operand for each axis (and the implied number of convolutional layers) must be adjusted automatically for each dataset to fully integrate spatial information. In addition to accommodating image geometry, there are technical constraints, such as the amount of available memory to consider. Our guiding principle in this regard is to dynamically weigh batch size against network capacity, as described below:

We start with a network configuration known to work with hardware Settings. For 2D U-NET, the input patch size for this configuration is 256×256, with batch sizes of 42 and 30 feature plots at the top level (doubling the number of each down-sampled feature plot). We automatically adjust these parameters to the median plane size for each dataset (here, we use the plane with the smallest in-plane spacing, corresponding to the highest resolution) so that the network effectively trains the entire slice. We configure the network to be pooled along each axis until the factor graph size for that axis is less than 8 (but not more than 6 pooled operations at most). Just like 2D U-Net, our 3D U-Net uses 30 feature images on the highest resolution layers. Here, we start with the basic configuration of input patch size 128⇥128⇥128, batch size 2. Due to memory limitations, we do not increase the volume of the input patch beyond 1283 voxels, but instead match the patch size of the aspect ratio of the matching input to the median size of the data set in pixels. If the median shape of the dataset is less than 1283, we use the median shape as the input patch size and increase the batch size (so that the total number of voxels processed is the same as 128⇥128⇥128 and the batch size is 2). Just like 2D U-Net, we merge along each axis (up to 5 times) until the element graph has a size of 8.

For any network, we limit the total number of voxels processed per optimizer step (defined as input patch times batch size) to 5% of the maximum data set. For excess cases, we reduce the batch size (lower limit 2).

Table 2.1 lists all the network topologies generated for the Phase 1 dataset.



Network topologies automatically generated by the seven Phase 1 tasks of the medical Segmentation Decathlon challenge. 3D U-NET low resolution refers to the first stage of the U-NET cascade. The configuration of the second phase of the U-NET cascade is the same as that of the 3D U-NET.

2.2 Preprocessing

Preprocessing is part of the fully automated segmentation process that our method consists of, so the steps described below can be performed without any user intervention.

Crop: All data is cropped to a non-zero value region. This has no effect on most data sets (e.g., liver CT), but reduces the size (and thus computational burden) of cranial exfoliation MRI.

Resampling: CNN itself does not have a disintegrant spacing. In medical images, different scanners or different acquisition protocols usually produce data sets with different voxel spacing. To enable our network to properly learn spatial semantics, all patients were resampled to the median voxel spacing of their respective data sets, where third-order spline interpolation was used for image data and nearest neighbor interpolation was used for corresponding segmentation masks.

The necessity of U-NET cascade is determined by the following heurization: resampling data is eligible for U-NET cascade if the median shape of the resampled data has more than 4 times the voxel (batch size 2) that 3D U-NET can process as an input sheet, and the data set is resampled to a lower resolution. This is done by doubling the voxel spacing (reducing resolution) until the above criteria are met. If the data set is anisotropic, the lower resolution axes are downsampled first until they match the lower resolution axes, and then all axes are downsampled simultaneously. The following phase 1 data sets fall within the heuristic approach described, thus triggering the use of u-NET cascades: heart, liver, lung, and pancreas. (Machine translation)

Normalization: Because the intensity level of CT scans is absolute, all CT images are automatically normalized based on statistics from the entire corresponding data set: If the data set corresponding modal in json decryption device file description said “ct”, then all the intensity values occurred in the collect training data set, and by cutting the strength value of [0.5, 99.5] percentile to normalization of the whole data set, and then based on the collected all the strength of the value of the mean and standard deviation for z score normalization. For MRI or other image modes (that is, if the string “CT” is not found in the mode), the simple Z-score normalization is applied to patients separately.

If clipping reduces the mean patient size (voxel) in the dataset by 1/4 or more, normalization is performed only within the mask of non-zero elements, and all values outside the mask are set to 0.

2.3 Training Procedures

Train all models from scratch and evaluate them using a five-fold cross-validation on the training set. We combine DICE loss and cross entropy loss to train our network:


L t o t a l = L d i c e + L C E L_{total} = L_{dice} + L_{CE}

For 3D U-NET running on almost the entire patient (u-NET cascade and the first stage of 3D U-NET if no cascade is required), we calculate the die loss for each sample in the batch and calculate the average across the batch. For all other networks, we interpret samples in batches as pseudo-volumes and calculate dice losses for all voxels in batches.

Our initial learning rate using the Adam optimizer was 3∗10−43*10^{-4}3∗10−4, which was the same in all experiments. We defined an epoch as an iteration of 250 training batches. During training, we maintain the exponential moving average of validation and training losses……. (A lot of numbers are hard to beat, I give up ~~

2.3.1 Data Augmentation

When training large neural networks from limited training data, extra care must be taken to prevent overfitting. We address this problem by leveraging a variety of data enhancement techniques. During training, the following enhancement techniques are applied on the fly during training: random rotation, random scaling, random elastic deformation, gamma correction enhancement and mirroring. Data enhancement is done through our own internal framework, which is available at github.com/MIC-DKFZ/ba… Publicly available on.

We define a set of data enhancement parameters for 2D and 3D U-Net respectively. These parameters are not modified between data sets. If the maximum edge length of the input slice size of 3D U-NET is greater than twice the minimum length, applying 3D data enhancement may not be the best choice. For data sets that apply this standard, we use 2D data enhancement instead, and then apply it slice by slice to each sample.

The second phase of the U-NET cascade receives the segments from the previous step as additional input channels. To prevent strong co-adaptation, we apply random morphological operators (corrode, expand, open, close) and randomly delete these segmented joins.

2.3.2 Patch Sampling

To improve the stability of our network training, we forced more than one-third of the samples in the batch to contain at least one randomly selected prospect class.

2.4 Inference

Due to the patch-based nature of our training, all inferences are based on patch. Since the accuracy of the network decreases towards the boundary of the patch, the weight of the voxel near the center is increased to that near the boundary when the predictions across the patch are summarized. By selecting patches by patch size / 2 to overlap them, we further took advantage of test time data augmentation (TTA, a method commonly used in competitions) by mirroring all patches along all valid axes.

For the test cases, we used the five networks obtained from the cross-validation of the training set as a whole to further improve the robustness of the model.

2.5 Postprocessing

The training data are analyzed by the associated components of all ground truth segmentation labels (this method is usually called connected component analysis in China). . If in all cases a class is within a single connected component, this behavior is interpreted as a general property of the dataset. Therefore, all components except the largest component of this type of connection are automatically removed from the predicted image of the corresponding data set.

Conclusion: Through the connected component analysis, the other connected domains except the largest were deleted to reduce the false positive regions.

2.6 Integration and Submission

To further improve segmentation performance and robustness, we summarize all possible combinations of two of the three models for each dataset. For the final submission, the model (or integration model) that achieves the highest average prospect die score in the training set cross validation is automatically selected.

3 Experiments and Results

As can be seen from Table 2, our phase 1 cross-validation results recovered robustly on the retained test set, indicating that there was no over-fitting of the expectations. The only data set that caused the performance degradation of all prospect classes was Brainarteria. The data for this stage 1 dataset is derived from the BRATS challenge [16], where such performance degradation between validation and testing is a common phenomenon and is attributable to large changes in the corresponding data and/or ground truth distribution.

4 Discussion

In this paper, we propose an NNU-NET segmentation framework for the medical field, which is built directly around the original U-NET architecture [6] and dynamically ADAPTS itself to the details of any given data set. Based on our assumption that non-architectural modifications may be much more powerful than some of the recently proposed architectural modifications, the essence of the framework is a comprehensive design of adaptive preprocessing, training schemes, and reasoning.

All design choices required to accommodate the new segmentation task are made in a fully automated manner without human intervention. For each task, NNU-Net automatically runs quintuple cross validation against three different auto-configured U-Net models, and then selects the model (or whole) with the highest average prospect die score for final submission.

In the context of the “Medical segmentation decathlon”, we demonstrate that NNU-NET is competitive on the retention test sets of 7 highly diverse medical data sets, achieving the highest average dice scores across all categories (except categories) of all tasks, 1st in the BrainTumour data set on the online rankings at the time of submission of the manuscript).

We recognize that training three models and choosing the best model for each dataset independently is not the cleanest solution. Given a large time scale, appropriate heuristics can be studied prior to training to determine the best model for a given data set. Our current trend is in favor of U-NET cascading (or 3D U-NET if cascading cannot be applied), with the only (near) exception being prostate and liver tasks. In addition, the added benefits of many of our design choices, such as using Leaky ReLUs instead of regular ReLU, and our data expansion parameters were inappropriate.


If you want to know how to do connected component analysis, what is TTA, how to implement NNU-NET and other problems. Be sure to follow me on wechat GZH and let me know in the comments below. Keep solving your puzzles ~~~

Welcome to the public account: Medical image artificial intelligence combat camp