Simple speech recognition using MFCC and RNN

Post Views

This post is from: InTheWorld blog

Last year, I studied speech recognition for a period of time. Due to the consideration of some power consumption, I still focused on the implementation of Spinx, a traditional method. The limitations of HMM method are quite obvious, and today’s advanced speech recognition technology is basically based on DNN. And RNN is very suitable for the processing of speech sequences. Previously, I stumbled upon a speech recognition learning project on Github, which provides some calibrated speech data and also implements some demo code. However, the author of this project has wrapped some TensorFlow, resulting in a bit of convoluted code, which is not good for beginners to learn. So, I wanted to implement a simple speech recognition program using the native TensorFlow API.

To be honest, I don’t know much about RNN either, so I won’t go into the principle here. From an intuitive point of view, the structure of RNN reflects the sequential relationship of the sequence, so RNN has a good ability to describe the sequence model. In the book Deep Learning with Tensorflow, RNN is used to realize the classification model training of MNIST data sets. Although the MNIST data set is an image data set, if we regard a row of pixels as an input vector, these row vectors will form a sequence in order. Through experiments, we can find that RNN can also complete the classification of MNIST data.

1. Speech feature extraction

Among speech feature extraction methods, MFCC(Meir frequency cepstrum coefficient) is probably the most common. Simply put, the MFCC is a short-time frequency domain feature. In Python, we can easily extract MFCC features using the librosa library. The extraction process of MFCC features is shown in the figure below. Firstly, the speech signal is divided into multiple segments according to time. Then the fast Fourier transform is applied to each segment of the signal, and a spectrum diagram can be obtained after the transformation. According to the energy envelope of the spectrum graph, a vector can be obtained by discretization of the energy envelope. This vector is called the MFCC vector.

2. RNN model training

With characteristics, we can use TensorFlow to complete the modeling and training. In fact, the model is very simple, as shown in the figure below, the input layer is LSTM, and the output layer is a layer of Softmax. The encoding of the output is one-HOT encoding, and the input is a set of multidimensional vectors arranged in chronological order.

Since the model is very simple, I’ll just post the code:

import os import re import sys import wave import numpy as np import tensorflow as tf from tensorflow.contrib import rnn From random import shuffle import librosa path = "data/ spoken_numbers_pCM /" # learning_rate = 0.00001 # training_iters = 300000 #steps # batch_size = 64 height=20 # mfcc features width=80 # (max) length of utterance classes=10 # digits N_input = 20 n_steps = 80 n_hidden = 128 n_classes = 10 learning_rate = 0.001 training_iters = 100000 batch_size = 50 display_step = 10 x = tf.placeholder("float", [None, n_steps, n_input]) y = tf.placeholder("float", [None, n_classes]) weights = { 'out' : tf.Variable(tf.random_normal([n_hidden, n_classes])) } biases = { 'out' : tf.Variable(tf.random_normal([n_classes])) } def mfcc_batch_generator(batch_size=10): # maybe_download(source, DATA_DIR) batch_features = [] labels = [] files = os.listdir(path) while True: # print("loaded batch of %d files" % len(files)) shuffle(files) for file in files: if not file.endswith(".wav"): continue wave, sr = librosa.load(path+file, mono=True) mfcc = librosa.feature.mfcc(wave, sr) label = dense_to_one_hot(int(file[0]),10) labels.append(label) # print(np.array(mfcc).shape) mfcc = Np. Pad (MFCC, ((0, 0), (0 reached - len (MFCC [0]))), the mode = 'constant', constant_values=0) batch_features.append(np.array(mfcc).T) if len(batch_features) >= batch_size: yield np.array(batch_features), np.array(labels) batch_features = [] # Reset for next batch labels = [] def dense_to_one_hot(labels_dense, num_classes=10): return np.eye(num_classes)[labels_dense] def RNN(x, weights, biases): Unstack (x, n_steps, 1) lSTM_cell = RNN.BasicLSTMCell(n_hidden, forget_bias=1.0) outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32) return tf.matmul(outputs[-1], weights['out']) + biases['out'] pred = RNN(x, weights, biases) cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y)) optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost) correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1)) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32)) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) step = 1 while step * batch_size < training_iters: batch = mfcc_batch_generator(batch_size) batch_x, batch_y = next(batch) # print(batch_x.shape) batch_x = batch_x.reshape((batch_size, n_steps, n_input)) sess.run(optimizer, feed_dict={x: batch_x, y: batch_y}) if step % display_step == 0: acc = sess.run(accuracy, feed_dict={x: batch_x, y: batch_y}) loss = sess.run(cost, feed_dict={x: batch_x, y : batch_y}) print("Iter " + str(step*batch_size) + ", Minibatch Loss = " + \ "{:.6f}".format(loss) + ", Training Accuracy = " + \ "{:.5f}".format(acc)) step += 1 print("Optimization Finished!" )Copy the code

Data sets can be downloaded at http://pannous.net/files/. For this example, you can simply download the spoken_numbers_pcm. Tar file and unzip it to the./data/ directory.

The following training result, very embarrassing, on the training set only 80+%. In addition, I found a very interesting thing. At first, I switched the dimensions of the MFCC features to the dimensions of the time slice, but everything else was pretty much the same. When training, I found that the accuracy rate on the training set was 100% early on. As for why, the convergence is slower in the “correct” dimension, I suspect it is the pot that fills the time series (the data that is not n_step long is zeroed out). This kind of zeroing would seriously contaminate the data in time series, but the conjecture remains to be tested.

References:

【 1 】. Librosa. Making. IO /

[2]. www.speech.cs.cmu.edu/15-492/slid.

【 3 】. Github.com/pannous/ten…

Simple speech recognition using MFCC and RNN

1. Speech feature extraction

2. RNN model training

Related Posts

Deep Learning Alchemy — First encounter with neural networks: Familiarity with basic architecture, design, and implementation

Information entropy in decision trees

Keras- Linear regression