Selfie cutout weak? How to eliminate background with deep learning

This article describes the work and research of author Gidi Sheperber on the Greenscreen. AI project. Lei Feng AI Technology Review has compiled and introduced this article.

Introduction to the

In the past few years, when machine learning was in the ascendancy, I wanted to develop practical, machine learning-based products myself. Then a few months ago, after I took a deep learning course offered by Fast.AI, I realized the opportunity was here. The current opportunity is that the technical advantages of deep learning make many things possible that could not be done before, and there are many new tools being developed that make it easier to deploy deep learning.

In the course ABOVE, I met Alon Burg, an experienced Web developer who happened to be on the same wavelength. So in order to achieve such a product, we set some goals for ourselves:

Improve your deep learning skills
Improve AI product deployment skills
Develop products with practical value according to market demand
Make your product fun
To share our experience

Taking the above objectives into consideration, we brainstormed:

What hasn’t been done (or isn’t perfect enough)
The design and implementation of the product should not be too difficult — our plan is to spend 2-3 months on it and only spend one day a week on it
The user interface needs to be simple and easy to use — because we want users to easily learn how to use a product that wasn’t developed for scientific validation, right
Data for training deep learning models is easy to come by — as anyone with experience of machine learning knows, sometimes data is more important than algorithms
Will use cutting-edge deep learning techniques (which have yet to be commercialized by Google, Amazon, etc.), but not too new (so we can find examples online)
The product will have production potential

We started with a medical project because the medical field was very close to what we wanted to do, and we thought (and still think) that there was a lot of opportunity for deep learning in the medical field. However, we realize that this would violate the principle of easy access to data collection. Therefore, we chose Background removal as the next best thing.

The task of background removal is very simple if done by human hand or with the help of tools such as Photoshop and Power Point. However, fully automated background removal is a challenging task. As far as we know, there is no product that does a very good job of background elimination, although some vendors have tried.

So the question is, what context do we need to remove? This is an important question. Because the more specific a model is (such as specifying objects, angles, etc.), the better the segmentation will be. When we started, we defined the scope fairly broadly: a universal background eliminator would automatically recognize foreground and background from any type of image. But after we trained our first model, we realized that we would get better results if we focused on a specific set of images. So we decided to focus on portraits as well as selfie-type photos.

Example of background elimination

The image of the selfie has a prominent and concentrated foreground (one or more “people”), which allows for a good segmentation between the background and the object (face and upper body) and a relatively stable Angle.

With these basic assumptions in place, we embarked on a journey of theoretical research, code implementation, and hours of model training to create a service that could easily remove the background with one click. The most important part of this is training the model, but the importance of deploying the model correctly should not be underestimated. In addition, the best segmentation models currently available are not as compact as classification models such as SqueezeNet. We also proactively examined server and browser deployment options.

If you want to learn more about the deployment process of our product, I recommend you read these two articles, which describe the project from the server-side and client-side perspectives, respectively.

If you just want to learn about the model and its training process, read on.

Semantic segmentation

When we start to think about the various techniques in deep learning and computer vision, it becomes obvious that the most suitable technique for implementing background elimination tasks is semantic segmentation.

Of course, there are other strategies, such as segmentation through depth detectors, but this doesn’t seem mature enough for us.

Semantic segmentation is a well-known computer vision task, which is one of the three major challenges in the field of computer vision, along with image classification and object detection. The process of segmentation can be regarded as classifying every pixel point in the image to a certain object category in the image, so segmentation is actually a classification task. Different from image classification or detection, semantic segmentation really shows the “understanding” of the image. Because it doesn’t simply say “there’s a cat in the image,” it also directly indicates at the pixel level where the cat is in the image and how much it occupies.

So how does segmentation work? To better understand this, we’ll review some of the early work in this area.

The original idea was to use some of the earlier classification networks, such as VGG and Alexnet. VGG is the most advanced model for processing image classification tasks in 2014, and it is still very useful today because of its straightforward network architecture. In the network layer higher up in VGG, there will be higher activation around the objects to be classified, and there will be stronger activation at deeper levels. However, the nature of these activations is relatively rough due to repeated Pooling operations. With this basic understanding, we can assume that the classification training model can also be used to find or segment objects in images with some modifications.

Early results of semantic segmentation appear with classification algorithms. In this article, you’ll see some of the segmentation results implemented with VGG, but the results are rough.

Results for later networks:

School bus segmentation results, light purple (29) represents the school bus category

After bilinear up-sampling:

These results only come from converting the fully connected layer to its original shape, while preserving the original spatial character. In the example shown above, we input a 768*1024 Image into VGG and obtain a layer of 24*32*1000, where 24*32 is the pooled version of the Image and 1000 refers to the number of image-Net categories from which segmentation results can be derived.

To achieve smooth predictions, the researchers used a simple bilinear upsampling layer.

In the FCN paper, the researchers further improved the method. They were connected to several layers for more interpretation of features, and were named FCN-32, FCN-16, and FCN-8, respectively, based on the up-sampling rate.

Adding some jump connections between the layers allows the predictor to extract finer details from the original image. Further training can improve results.

The technique itself is not as bad as once thought, and demonstrates the potential of deep learning in the field of semantic segmentation.

FCN processing results, images from the paper

FCN opens the door to a new world of semantic segmentation, and researchers have tried different network structures for this task. But the core idea remains the same: use known structures, up-sampling, and jumping connections between network layers. These techniques are still common in newer models.

You can learn more about the development of semantic segmentation in these posts:

Blog. Qure. Ai/notes/seman…
Blog.athelas.com/a-brief-his…
– meetshah1995. Making. IO/semantic – se…

In addition, you may have noticed that most semantic segmentation methods maintain the encoder-decoder architecture.

Back to the project

After some research, we focused on three models: FCN and Unet. Tiramisu uses a very deep coding-decoding architecture. We also have some ideas for Mask-RCNN, but implementing it seems to be beyond the scope of this project.

FCN was first abandoned by us due to its poor effect. The other two models have good results: the main advantages of Tiramisu and Unet are compact models and fast computations. In terms of implementation, Unet is very easy to implement (using Keras) and Tiramisu can also implement. We also used the Tiramisu implementation code from Jeremy Howard’s last Deep Learning class.

With two candidate models in place, we moved on to the next step — training the model. I must say that when I first tried the Tiramisu model, I was impressed by its amazing results, as it was able to capture sharp edges in images. Unet, on the other hand, doesn’t seem to be good enough, and its results aren’t good enough.

data

Now that we have a general direction in terms of model selection, we can start looking for suitable training data. Data used to train segmentation models is not as common as classification or detection. In fact, the most commonly used data sets for image segmentation are COCO, which contains about 80,000 images in 90 categories; VOC PASCAL, which contains 20 categories and 11,000 images; And the recently released ADE20K.

Ultimately we chose to go with the COCO dataset because it contained many images of the “people” category that we were interested in.

Given the tasks the product is intended to accomplish, we also need to consider whether to use only the part of the data set that is most relevant to the task, or more general data. On the one hand, using more general data sets with more images and categories will enable the model to handle more scenes and more challenging images; On the other hand, we can train 150,000 images overnight. So if we used the entire COCO dataset to train the model, each image would be used, on average, about twice over the course of an overnight session. Therefore, the use of more simplified partial data will be beneficial to model training, in addition, it will make the model more centralized.

It is also worth mentioning that the Tiramisu model was originally trained on the CamVid data set, which has some flaws, but most importantly its image content is very monotonous — all the images are roads and cars. As you can easily understand, learning from such a dataset (even if it included people) was not conducive to our ultimate purpose, so after a short period of consideration, we gave up and switched to COCO.

A sample from the CamVid dataset

The COCO dataset comes with a very straightforward API that lets us know exactly what objects are contained in each image (according to 90 predefined categories).

After some experimentation, we decided to simplify the data set. First, we selected images that contained people, and there were 40,000 of them. We then went on to weed out images that had too many people, leaving only one or two people in the rest of the images, because that was the best fit for our product goals. In the end, we only kept images where 20-70% of the images were labeled as people, and continued to remove images where the people in the background were too small, or there was something weird about them. In the end, the streamlined dataset contained 11,000 images, which we believe is sufficient for this stage.

Left: good example, middle: contains too many elements, right: target body is too small

Tiramisu model

As mentioned above, we learned about the Tiramisu model in Jeremy Howard’s class. While its full name “100-tier Tiramisu” may imply that it is a huge model, it is in fact very economical, even with only 9 million parameters, compared to VGG16 with over 130 million parameters.

Tiramisu model is based on the design of DensNet. DensNet is a recently proposed image classification model in which all layers are interconnected. Tiramisu also adds Skip connections to the upper sampling layer, just like Unet.

If you recall, this architecture is in line with the idea proposed by FCN: refinement using classification schemas, up-sampling, and jump joins.

Common architecture for Tiramisu

The DenseNet model can be seen as a natural evolution of the ResNet model, but instead of “remembering” the relationships between layers, DenseNet remembers all the layers in the model. This kind of connection is called Highway connections. It led to a surge in the number of filters, which is defined as a “growth rate.” The growth rate of Tiramisu is 16, so at each layer we need to add 16 new filters until we reach a layer with 1072 filters. You might expect 1600 layers, since the name of the model is “100-layer Tiramisu”, however the up-sampling layer lowers the filter a bit.

Densenet model sketches, early filters stacked throughout the model

training

We trained the model as described in the original paper: standard cross entropy, RMSProp optimizer with 1E-3 learning rate, and small attenuation values. The streamlined 11,000 images were divided into 70% training sets, 20% validation sets, and 10% test sets. All images shown below are from the test set.

To keep our training process consistent with that in the paper, I set the Epoch size to every 500 images, which allowed us to periodically save the model as each result improved, as we trained on more data (the CamVid dataset used in the original paper contained less than 1000 images).

In addition, we only trained in two categories: background and people, and we used 12 categories in this article. We tried training on COCO at first, but found it didn’t help much.

Data problems

Some data set flaws hinder the model’s ability to represent:

Animals – Our model sometimes splits animals, which results in a lower IoU. Adding animals or something else to the main category of our task will make the results worse.
Torsos – Because we automatically filter the data set through a program, we can’t tell if an image that contains a person is actually a person, or if it’s just torsos, like hands or feet. These images are not in our target range, but they are in the dataset anyway.

Animals, torsos, hand-held objects

Hand-held Objects – Many of the images in the dataset are sports-related, with people often accompanied by baseball bats, tennis rackets, skis, etc. Our model somewhat confuses how to distinguish between people and objects in the picture. As in the case of animals, grouping them as part of the subject or separating them from each other will help improve the model’s expressiveness.
Rough Ground Truth — Instead of being labeled pixel by pixel, the COCO dataset is labeled with polygons. Sometimes this level of annotation is sufficient, but other times the real data is too crude to learn the model.

Raw images and rough real data

The results of

Our results were very satisfactory, although there were a few shortcomings: our IoU on the test set was 84.6, and our best score so far was 85. This data (IoU) is also tricky because it fluctuates across different data sets and categories. Some categories themselves are easier to divide, such as houses, roads, etc., and many models can easily achieve an IoU of 90. Some of the more challenging categories are trees and people, where most models can only reach 60 IoU. To measure this difficulty, we helped the network focus on single categories and limited types of photos.

While we still don’t feel that the results are Production ready, it’s a good time to pause and discuss the results, as about 50 percent of the photos give good results.

Here are some good examples.

From left to right: image, real data, output (from test set)

Debugging and recording

Debugging is an important part of neural network training. It’s very easy to start training a neural network, just get the data into the network, start training, and see what the output is. But we found that it was important to track every step of online training, so we built our own tools so that we could check the results at each step.

Here are some of the common challenges and what we are doing:

Early problems – The model may not be trainable. This could be due to some inherent problem, or to some kind of preprocessing error, such as forgetting to normalize a block of data. In any case, it would be helpful to visualize the results. Here is a post on the subject.
Debug the network – After determining that there are no major problems, the network begins training with predefined losses. Intersect over Union (IoU) is the main measurement in the field of semantic segmentation. It took us a while to start using IoU as the primary indicator of model training (as opposed to cross entropy). Another useful practice is to present some model predictions for each Epoch. This post is a good introduction to tuning machine learning models. Also note that IoU is not a standard loss function in Keras, but can easily be found online, such as this one. We also use this Gist to map losses and forecast results for each period.
Versioning in machine learning – Often a model has many parameters, some of which are difficult to track. I must say that we haven’t found a perfect solution to this problem, other than writing out configuration information more frequently (and using keras callbacks to automatically save the best model, see below).
Debugging tools – After doing all of the above, we will be able to verify our work at every step, but it will still not be seamless. Therefore, the most important step was to combine the above steps and create a Jupyter notebook that we could use to seamlessly load each model and each image and quickly detect its results. This makes it easy to find differences between models, flaws in models, and other problems.

Below are examples of improvements to our model, along with parameter adjustments and additional training.

Save the model with the best IoU score on the validation set (Keras provides a nice callback function that makes this very easy) :

Callbacks = [keras. Callbacks. ModelCheckpoint (hist_model, verbose = 1, save_best_only = True, monitor = ‘val_IOU_calc_loss’), plot_losses]

In addition to normal debugging of possible code errors, we also note that model errors are “predictable.” Examples include “cutting” body parts beyond the normal range of the torso, unnecessary extension of the torso, poor lighting, poor photo quality and too much detail in the photo. Some of these are processed when specific images are added to different data sets, but others are still a challenge that needs to be addressed. In order to improve the results in the next release, we will train the model with an extension specifically for “hard” images.

We’ve talked about data sets before. Now let’s look at some of the challenges the model faces:

Clothes – very bright or black clothes are sometimes used as background
“Digging a hole” – Some of the results that should have been good have been similar to being dug a hole

Clothes and “Digging holes”

3. Lighting — While poor lighting and blurry images are common in photos, they are not common in COCO datasets. As a result, we were not prepared for challenging images beyond the standard difficulty model. This problem can be solved by obtaining more Data, or by adding Data augmentation.

Example of insufficient light

In the future

Further training

Our training data was trained for over 300 periods, after which the model began to overfit. Because our results were so close to the published score, we didn’t have the opportunity to apply the basic approach of data enhancement.

The input images used for training were uniformly adjusted to 224*224. Furthermore, training models with more data and larger resolution images (COCO images were about 600 x 1000 in original size) also helped improve the final results.

CRF and other enhancements

At some stages, we found some results with noisy edges. The CRF model can be used to ameliorate this problem. In this blog, the author shows a simple use sample of CRF.

But this method is not very helpful for our work, perhaps because it only helps when the results are crude.

Matting

Even though we have achieved this result so far, it is still not perfect in the actual segmentation. Such delicate objects as hair, delicate clothes, branches and other objects can never be perfectly separated. In fact, this very fine-grained segmentation task is called Matting, and it defines a different kind of challenge. Here’s an example of a state-of-the-art matting work that was presented at the NVIDIA Conference earlier this year.

For a matting sample, the input contains Trimap

The matting task is different from other image-related tasks because its input includes not only the original image but also a Trimap. A trine refers to the outline of an image’s edge, which makes it a “semi-supervised” problem.

We carried out some experiments using matting, that is, using our segmented image as a tripartit graph, but we did not get any meaningful results.

Another problem is the lack of training data sets to use.

conclusion

As stated from the beginning, our goal is to develop a deep learning product that makes sense. As you can see in Alon’s post, deployment is relatively easy and fast. Training the model, on the other hand, is the trickiest — especially if it takes a night of training, where you need to plan, debug, and document the results.

It turns out that balancing research and trying new things and training and improving models is not easy. Because of deep learning, we always feel that the best model or model that works best for us is just around the corner, and that one more Google search or one more paper will lead us to it. But in practice, our actual improvements come from “squeezing” the original model bit by bit. And as mentioned above, we still feel there is more to be done.

All in all, we’ve had a lot of fun doing this job, and it’s been a science fiction couple of months. We are happy to discuss and answer any questions you may have and look forward to meeting you on our website.

Via Background removal with deep learning, Lei Feng 网AI technology Review

Lei Feng net copyright article, unauthorized reprint prohibited. See instructions for details.