On the generation of dance behind virtual idol

This article was first published on:Walker AI

With the development of secondary culture, virtual idols are becoming more and more popular.

Virtual idol technology mainly includes singing synthesis and dancing generation, that is, singing and dancing.

For this article, Dance Revolution, ICLR2021 Long Sequence Dance Generation with Music via Curriculum Learning, a paper co-authored by Fudan University, Microsoft, Meituan and Rinna AI.

1. Dance generation

Dance Generation refers to the input of a musical sequence (often an audio feature) to obtain a meaningful action sequence of the same duration, that is, to Dance music. Here’s a brief introduction to audio features and action sequences.

1.1 Audio Features

An audio data file consisting of a large number of sample points. One second audio can reach tens of thousands of sampling points, which brings great difficulties to model training. In practical use, audio is generally not as a direct input and output, but take the characteristics of audio as input and output.

Common features of audio are:

MFCC

MFCC delta

constant-Q chromagram

tempogram

In practical use, the above audio features do not need to be understood in detail, except that they reduce the length of the audio time series by nearly a hundredfold.

1.2 Action Sequence

Figure 1. Action sequence extraction diagram

Motion sequence refers to the time sequence data about motion generated by Pose estimation. Each time step data is composed of key points and is used to represent human body movements. As shown in Figure 1, solid-colored points constitute the key points of the current action, and their lines can well represent the body movement of the current character. (See github.com/CMU-Percept…

The movement sequence can only extract the body movement of the characters, exclude the interference phonemes such as the characters and the background, and extract the dance movement, so the movement sequence represents the dance.

Dance is a movement with strong rhythm. In addition to the above common features, the author also regards the unique thermal coding of drum as an audio feature for model training.

1.3 Dance generation problem definition

After understanding the audio characteristics and movement sequences, the definition of dance generation problem is given here.

For a given music-dance data set D, D is composed of paired sequences of music and dance movements, and music and dance movements correspond one by one. X is a music fragment, and Y is a dance movement fragment. Train a model g(·) with D such that g(X)=Y.

This input-output Sequence problem is defined as sequence-to-sequence (seq2SEq). Compared with machine translation sequence to sequence problem, music can be matched with many dances, without uniqueness, music is more like a style. The author starts from the SEQ2SEQ problem, and the optimization ideas are also from the SEQ2SEQ problem.

1.4 Background and motivation

By studying the current methods of dance generation, the author finds that there are two main ways of dance generation:

The first is the splicing method (stitching different dance movements together)

The second type is autoregressive model (such as LSTM model)

The first method lacks naturalness, while the second method can only generate short action sequences due to error accumulation of autoregressions model. Therefore, the author proposes two methods to solve the problem:

course

Local attention

These two approaches are described in the model structure.

2. Model structure

Figure 2. Structure diagram of dance generation model

As shown in Figure 2, the input audio features of the model are used to obtain action sequences through encoder and decoder.

Encoder and decoder will be introduced below.

2.1 encoder

Encoder the same seQ2SEQ problem, using a similar transformer architecture, consisting of N Transformer blocks, Each Transformer block consists of multi-head self-attention and feed forward neural network.

Local attention

Audio is a time signal sequence. Compared with text sequence, the time step of audio is longer, with a maximum of hundreds of words in a sentence. However, an audio can have millions of sampling points, and the length of audio is about thousands even after feature extraction. The computation complexity of Transformer’s multi-head self-attention module is the square of the sequence length. The longer the sequence length, the more resources needed for calculation. Audio, on the other hand, satisfies short-term invariance, and the meaningful audio can be intercepted arbitrarily, but the captured fragment is still meaningful. Therefore, the author proposes to change fully connected self-attention into local. In the upper left corner of Figure 2, set the window size for attention of each time step, so that the model only needs to calculate k/2 data before and after.

Figure 3. Local attention formula

Figure 3 shows the implementation of local multi-head self-attention. Compared with multi-head self-attention, the local implementation only adds scope limitation in the summation, and the actual implementation only needs to multiply a mask matrix.

2.2 decoder

Figure 4. Decoder formula

Figure 4 shows the decoder implementation. Decoder backbone is ordinary RNN network, the last time step information HI-1 and the last RNN cell hidden state YI -1 as input, get the current RNN cell hidden state HI, The implicit state ZI of the current time step output by encoder is spliced through the linear layer to obtain the final output Yi. Unlike natural language problems such as machine translation, dance generation does not need to sample the current time step when doing inference.

course

Curriculum learning believes that models should be like people and that learning should proceed from easy to difficult. For classification problems, the course should learn the simple samples first, and then the difficult samples; For SEQ2SEQ problem, the course should first learn short sequences and only predict the next step of the current step each time (teach forceing). After a certain degree of model training, the sequence length of subsequent prediction should be gradually increased, such as the last two time steps of the current step. The author of the paper improved the course learning on the dance generation problem, which will not be expanded here.

3. Evaluation criteria

The model proposed in this paper is relatively simple, and the highlight lies in the labeling standards of evaluation.

Dance generation is similar to speech synthesis, and commonly used evaluation methods are divided into two categories: objective evaluation and subjective evaluation.

Objective evaluation From the perspective of test data, evaluate the authenticity of the generated dance, style consistency and matching with the soundtrack.

3.1 Frechet Inception short

Figure 5. FID calculation formula

Figure 5 is the calculation formula for Frechet Inception Distance. Frechet Inception Distance for FID is used to calculate the similarity of two distributions. The idea is that if two Matrices with gaussian distributions are closer in mean and covariance, the more similar they are. As shown in the formula above, μ is the mean, ∑ is the covariance, and Tr represents the sum of the diagonal elements of the matrix.

In dance generation, FID is used for the degree of similarity to the real dance as a whole, that is, the authenticity of the dance.

3.2 the ACC

Used to assess the degree to which dance generation is consistent with musical style. Train an MLP (multi-layer perceptron) to classify real dances (e.g., ballet, hip hop, pop, categories derived from music categories), use the MLP to classify the generated dances and evaluate their consistency with the music category.

3.3 Beat Coverage

Beat Coverage is the ratio of the total number of beats of dance movements to the total number of beats of music. The higher the beat coverage, the stronger the rhythmic pattern of the dance.

3.4 to Beat Hit Rate

Beat Hit Rate is the ratio of the number of hits to the total number of beats in all sports beats. The higher the hit rate, the more the dance moves fit the music.

3.5’m

Diversity, generating dances based on music on the test set, and evaluating the overall Diversity of these dances.

3.6 Multimodality

Multimodality, which generates multiple dances based on a single music and evaluates the overall diversity of these dances.

Many of the evaluation criteria mentioned above come from Dancing to Music Neural Information Processing Systems, which I recommend to read if you want to fully understand the field of dance generation.

3.7 Subjective evaluation criteria

The criteria described in 3.1 to 3.6 are objective assessment criteria. Many aspects of good or bad dancing are difficult to quantify and require manual, or subjective, assessment. Subjective evaluation: A number of professional dancers were recruited to play the proposed model generated dance, the comparison of the model generated dance and the real dance, and asked the following three questions, and the dancers scored.

Authenticity: Which dance is more authentic than the music?
Silky: Which dance is silky, regardless of music?
Matching: Stylistically, which dance is more in tune with the music?

The organizer averages the scores for each of the three indicators to get the final evaluation result.

4. To summarize

This paper proposes a SOTA dance generation model, and the generated dance is close to the human dance compared with previous studies. The authors of the paper give an example video: www.youtube.com/watch?v=lmE…

From this video, we can also see the shortcomings of the current dance generation:

The resulting dance may be against human physiology: there are many twisted arms in the video, which is not possible for normal people to do.
The frame rate of dance generation is too low: the frame rate of the model in this paper is 15 FPS, far lower than the normal frame rate of video.
There is no real-time dance generation: due to the slow model generation speed, it can not meet the speed requirements of real-time generation; Seq2seq architecture is adopted in most models, which encodes the music as a whole first and then decodes it step by step, without considering the demand of real-time performance.
The quality of production needs to improve: Even amateurs can still see a lot of dissonant body movements when watching the paper’s sample video.

Dance generation started to boom from last year, and the quality of the generated Dance has been improving. Recently, Google released Learn to Dance with AID ++, and Huyeye Released DanceNet3D:DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, Dance Generation, as an important technology to realize virtual idol, has received much attention.

By focusing on the current research on dance generation, the author also found the hot spots of dance generation:

Better data sets
3D dance generation
Transformer Base model structure
A higher frame rate

Dance generation is part of the walker AI virtual idol research. The author and other researchers are constantly exploring new methods and evaluation criteria, and welcome the exchange and cooperation of peers, and join us.

PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!