How close are we to solving 2D&3D face alignment?

The authors are Adrian Bulat & Georgios Tzimiropoulos of the University of Nottingham

Abstract

This paper studies the degree to which a very deep neural network achieves near-saturation performance on existing 2D and 3D face alignment datasets. To this end, we propose five contributions: (a) Combining the state-of-the-art landmark localization architecture with the state-of-the-art residual block, we have for the first time constructed a very strong baseline, Training was performed on a very large 2D facial Landmark Dataset and evaluated on all other facial landmark datasets. (b) We created a method to convert 2D feature point annotations into 3D and unify all existing datasets to create the largest and most challenging 3D face feature point dataset ls3D-W (about 230,000 images) to date. (c) We then train a neural network to perform 3D face alignment and evaluate it on the new LS3D-W dataset. (d) We further investigate all the “traditional” factors that affect face alignment performance, such as large pose, initialization, and resolution, and introduce a “new” factor, namely network size. (e) Our research shows that both 2D and 3D face alignment networks achieve very high performance, which is likely close to the saturation performance of the data set used. Training and testing code and data set can be downloaded from https://www.adrianbulat.com/face-alignment/.

Paper: https://arxiv.org/pdf/1703.07332.pdf

Github:https://github.com/1adrianb/face-alignment

And model download

2 d-fan:https://www.adrianbulat.com/downloads/facealignment/2d-fan-300w.t7

3 d-fan:https://www.adrianbulat.com/downloads/facealignment/3d-fan.t7

2 d – to – 3 d FAN:https://www.adrianbulat.com/downloads/FaceAlignment/2D-to-3D-FAN.tar.gz

3 d-fan-depth:https://www.adrianbulat.com/downloads/facealignment/3d-fan-depth

Face alignment is one of the most studied topics in computer vision over the past few decades

With the advent of deep learning and the development of large-scale annotated data sets, recent work has shown unprecedented accuracy in even the most challenging computer vision tasks. In this work, the authors focused on Landmark localization, in particular, face localization, also known as face alignment, which is arguably one of the most studied topics in computer vision over the past few decades.

Recent work on feature point localization using convolutional neural networks (CNN) has pushed the boundaries in other areas, such as human pose estimation, but it is not clear what results have been achieved in face alignment.

Historically, different techniques have been used to locate feature points depending on the task. For example, prior to the advent of neural networks, human posture estimation was largely based on pictorial structure and various complex extensions, because they could simulate large changes in appearance and accommodate a wide range of human postures. Although these methods have not been proven to achieve the high accuracy shown by the cascaded regression method used for face alignment tasks, on the other hand, the performance of the cascaded regression method deteriorates when the initialization is inaccurate, or there are a large number of self-enclosed feature points or large in-plane rotation.

More recently, the fully convolutional neural network architecture based on Heatmap Regression has revolutionized human pose estimation, achieving very high accuracy even for the most challenging data sets. Since they require little end-to-end training and manual engineering, this approach can be easily applied to face alignment problems.

Build a powerful benchmark for the first time, using 2D-3D to build the largest dataset to date

Following this path, the authors say, “our main contribution was to build and train such a robust face alignment network, and to investigate for the first time how far saturating performance was achieved on all existing 2D face alignment datasets and on newly introduced large 3D datasets”.

More specifically, their contributions are:

1. Built a very strong baseline for the first time, combined with state-of-the-art feature point positioning architecture and state-of-the art residual block, and trained on a very large comprehensive and extended 2D face feature point data set. We then evaluated all the other 2D datasets (about 230,000 images) to investigate how close we were to solving the 2D face alignment problem.

2. In order to solve the problem of insufficient 3D face alignment data sets, we further propose a 2D feature point CNN method for converting 2D annotations into 3D annotations, and use it to create LS3D-W dataset, which is the largest and most challenging 3D face feature point dataset (about 230,000 images) so far. This is achieved by unifying almost all existing data sets.

3. We then train a 3D face alignment network and evaluate it on a new large 3D face feature point dataset to investigate how close we are to solving the 3D face alignment problem.

4. We further investigate all the “traditional” factors that affect face alignment performance, such as large pose, initialization and resolution, and introduce a “new” factor, namely network size.

5. Our results show that both 2D and 3D face alignment networks achieve very high accuracy performance, which may be close to the saturation performance of the data set used.

2D-FAN structure: Face Alignment Network (FAN) constructed by stacking four HG, in which all bottleneck blocks (rectangular blocks in the graph) are replaced with new layered, parallel, and multiscale blocks.

Methods and data: 2D, 3D annotation and 2D-3D conversion are close to saturation performance

The author first constructs a Face Alignment Network (FAN), and then, based on FAN, constructs 2D-to-3D-FAN, which converts 2D Face landmarks of given images into 3D. The authors say this is the first time, to their knowledge, that a powerful network like FAN has been trained and evaluated in large-scale 2D/3D face alignment experiments.

They built FAN based on HourGlass (HG), one of the most advanced human pose estimation architectures, and replaced HG’s existing block with a new layered parallel multiscale structure proposed by other researchers.

2D-to-3D-FAN network architecture: Based on human pose estimation architecture HourGlass, input is RGB image and 2D face landmark, and output is corresponding 3D face landmark.

Result of 2D-FAN tag

3D-FAN tag results

Here is a comparison with the existing method (red) so that the accuracy of the new method can be seen more clearly:

In addition to building FAN, the authors’ goal was to create the first superlarge 3D dataset of facial landmarks. Data on 3D facial landmarks are scarce, which makes this work useful. Due to the excellent results of 2D-Fan, the authors decided to use 2D-to-3D-Fan to generate a 3D facial landmark dataset.

However, there is a problem with this, which is that it is difficult to evaluate 2D to 3D data. The largest dataset of its kind available is AFLW2000-3D. Therefore, the author first uses 2D-Fan to generate 2D facial landmarks, and then uses 2D-to-3D-Fan to convert 2D data into 3D facial landmarks. Finally, the generated 3D data is compared with AFLW2000-3D.

It was found that there was indeed a difference between the two images. The following figure shows the labeling results of the 8 images with the largest difference (white is the result of the paper) :

The biggest reason for the difference, the authors say, is that the previous approach did not produce accurate results for some complex poses. So, after refining the data, they incorporated AFLW2000-3D into an existing dataset to create LS3D-W (Large Scale 3D Faces in-the-Wild Dataset), which contains about 230,000 labeled images. It is also the largest 3D face alignment dataset to date.

The authors then evaluated the performance of the LS3D-W dataset from various aspects. The results show that their network has reached the “saturation performance” of the dataset, showing high resilience in terms of composition, resolution, initialization, and number of network parameters. See the paper for more information.

The authors say that while they have not yet explored the effects of rare gestures in these data sets, given enough data, they are confident that networks can perform just as well.

Go back to Sohu and check out more