Siraj video source code

Today I wanted to see how AI composed music.

This article will use TensorFlow to write a music generator.

What happens when you say to a robot: I want a song that expresses hope and wonder?

The computer first translates your speech into text, and extracts the keywords into word vectors.

And then they use data from tagged music, and these labels are human emotions. Then, by training a model on this data, the model can be trained to generate music that matches the required keywords.

The final output of the program is some chords, he will choose the most close to the master of the emotional keywords to output some chords.

Of course you can not only listen to it, you can also use it as a reference for composing, so that you can easily compose music, even if you haven’t deliberately practiced for 10,000 hours.

Machine learningIt’s really about expanding our brains, expanding our capabilities.


DeepMind published a paper called WaveNet, which introduced the art of music generation and text-to-speech.

Generally speaking, the speech generation model is concatenation. This means that if we wanted to generate speech from a sample of text, we would need a very large database of speech fragments, by taking parts of them and putting them back together to form a complete sentence.

The same goes for generating music, but there’s one big difficulty: when you put together static components, generating sound needs to be natural and emotional, which is very difficult.

Ideally, we could store all the information needed to generate the music in the parameters of the model. That’s what was in the paper.

We do not need to pass the output to the signal processing algorithm to get the speech signal, but directly process the wave of the speech signal.

The model they used was CNN. Each of the hidden layers of the model, each of the expansion factors, can be interconnected and grow exponentially. The samples generated at each step are reinvested into the network and used to generate the next step.

We can look at the diagram of this model. The input data is a single node. As a rough sound wave, it needs to be preprocessed first to facilitate the following operations.

And then we encode it to have a Tensor, which has samples and channels.

And then put it into the first layer of the CNN network. This layer generates the number of channels for easier processing.

Then combine all the output results together and increase its dimensions. Increase the dimensions to the number of channels.

Throw this result into the loss function to measure how well our model is trained.

Finally, the results are fed back into the network to generate the sonic data needed at the next point in time.

Repeat this process to generate more speech.

The network is huge, takes 90 minutes on their GPU cluster, and generates only one second of audio.


Next we will implement an audio generator on TensorFlow using a simpler model.

1. Introduction of packages:

TQDM generates a progress bar that displays the progress of the training.

import numpy as np
import pandas as pd
import msgpack
import glob
import tensorflow as tf
from tensorflow.python.ops import control_flow_ops
from tqdm import tqdm
import midi_manipulationCopy the code

We will use rBM-Restricted Boltzmann Machine, a neural network model, as the generation model. It is a two-tier network: the first is visible and the second is hidden. Nodes of the same layer are not connected to each other. Nodes of different layers are connected to each other. Each node decides, randomly, whether it needs to send the data it has received to the next layer.

2. Define hyperparameters:

First define the range of notes that need to be generated by the model

lowest_note = midi_manipulation.lowerBound #the index of the lowest note on the piano roll
highest_note = midi_manipulation.upperBound #the index of the highest note on the piano roll
note_range = highest_note-lowest_note #the note rangeCopy the code

Next you need to define the size of timestep, visible and hidden layers.

num_timesteps  = 15 #This is the number of timesteps that we will create at a time
n_visible      = 2*note_range*num_timesteps #This is the size of the visible layer. 
n_hidden       = 50 #This is the size of the hidden layerCopy the code

Training times, batch size, and learning rate.

num_epochs = 200 #The number of training epochs that we are going to run. For each epoch we go through the entire data set.
batch_size = 100 #The number of training examples that we are going to send through the RBM at a time. 
lr         = tf.constant(0.005, tf.float32) #The learning rate of our modelCopy the code

3. Define variables:

X is the data that is put into the network and W is used to store the weight matrix, or the relationship between the two layers. In addition, two kinds of BIAS are required, one is BH of the hidden layer and the other is BV of the visible layer

x  = tf.placeholder(tf.float32, [None, n_visible], name="x") #The placeholder variable that holds our data
W  = tf.Variable(tf.random_normal([n_visible, n_hidden], 0.01), name="W") #The weight matrix that stores the edge weights
bh = tf.Variable(tf.zeros([1, n_hidden],  tf.float32, name="bh")) #The bias vector for the hidden layer
bv = tf.Variable(tf.zeros([1, n_visible],  tf.float32, name="bv")) #The bias vector for the visible layer
Copy the code

Next, use gibbs_SAMPLE to create a sample from the input data x and the sample of the hidden layer:

Gibbs_sample is an algorithm that can extract samples from multiple probability distributions.

It can generate a statistical model in which each state depends on the previous state and randomly generates samples that conform to the distribution.

#The sample of x
x_sample = gibbs_sample(1) 
#The sample of the hidden nodes, starting from the visible state of x
h = sample(tf.sigmoid(tf.matmul(x, W) + bh)) 
#The sample of the hidden nodes, starting from the visible state of x_sample
h_sample = sample(tf.sigmoid(tf.matmul(x_sample, W) + bh)) Copy the code

4. Update variables:

size_bt = tf.cast(tf.shape(x) [0], tf.float32)
W_adder  = tf.mul(lr/size_bt, tf.sub(tf.matmul(tf.transpose(x), h), tf.matmul(tf.transpose(x_sample), h_sample)))
bv_adder = tf.mul(lr/size_bt, tf.reduce_sum(tf.sub(x, x_sample), 0, True))
bh_adder = tf.mul(lr/size_bt, tf.reduce_sum(tf.sub(h, h_sample), 0, True))
#When we do sess.run(updt), TensorFlow will run all 3 update steps
updt = [W.assign_add(W_adder), bv.assign_add(bv_adder), bh.assign_add(bh_adder)]Copy the code

5. Next, run Graph:

1. Initialize variables first
with tf.Session() as sess:
    #First, we train the model
    #initialize the variables of the model
    init = tf.initialize_all_variables()
    sess.run(init)Copy the code

0 First needs 0 per song so that the corresponding vector representation can be better used in the training model.

    for epoch in tqdm(range(num_epochs)):
        for song in songs:
            #The songs are stored in a time x notes format. The size of each song is timesteps_in_song x 2*note_range
            #Here we reshape the songs so that each training example is a vector with num_timesteps x 2*note_range elements
            song = np.array(song)
            song = song[:np.floor(song.shape[0]/num_timesteps)*num_timesteps]
            song = np.reshape(song, [song.shape[0]/num_timesteps, song.shape[1]*num_timesteps])
Copy the code
2. Next, train the RBM model, one sample at a time
            for i in range(1.len(song), batch_size): 
                tr_x = song[i:i+batch_size]
                sess.run(updt, feed_dict={x: tr_x})
Copy the code

Once the model is fully trained, it can be used to generate music.

3. Need to train Gibbs Chain

The visible Nodes are initialized to 0 to generate some samples. And then put the vector 0 into a better shape for playback.

    sample = gibbs_sample(1).eval(session=sess, feed_dict={x: np.zeros((10, n_visible))})
    for i in range(sample.shape[0) :if not any(sample[i,:]):
            continue
        #Here we reshape the vector to be time x notes, and then save the vector as a midi file
        S = np.reshape(sample[i,:], (num_timesteps, 2*note_range))
Copy the code
4. Finally, print the resulting chords
       midi_manipulation.noteStateMatrixToMidi(S, "generated_chord_{}".format(i))
Copy the code

To sum up, CNN is used to generate sound wave parameterized,

RBM makes it easy to generate audio samples from training data,

Gibbs algorithm can help us get training samples based on probability distribution.


Summary of links to historical technical blog posts