Implement RNN with Python Numpy theano

This is Github’s code address

In this section we will implement a full RNN from scratch in Python and optimize our implementation using Theano (a library that performs operations on the GPU). The full code is available on Github. I’ll skip some boilerplate code that isn’t necessary to understand recurrent neural networks, but all of that code is also on Github.

Language model

Our goal is to build a language model with RNN, that is, there is now a sentence with M words, and the language model allows us to predict the probability of observing the sentence (in a given data set) :

In other words, the probability of a sentence is the product of the probability that each word gives the word before it. So, P (he went to buy some, chocolate) = P (some | him to buy) * P (chocolate | him to buy some).

Why is this useful? Why do we assign a probability to observing a sentence?

First, such a model can be used as a scoring mechanism. For example, machine translation systems often generate multiple candidates for input sentences. You can use a language model to choose the most likely sentence. Intuitively, the most likely sentence is grammatically correct. It can also be scored in speech recognition.

Solving language modeling problems also has a cool side effect. Because we can predict the probability of a word appearing before it is given, we can generate new text. This is a generative model. Given an existing sequence of words, we sample the next word from the predicted probability and repeat the process until we have a complete sentence. Andrej Karparthy has a great blog about language modeling capabilities. His model, which trains on individual characters rather than whole words, can generate Linux code as well as Shakespeare.

It is important to note that the probability of each word in the above equation is conditional on all previous words. In practice, many models struggle to deal with such long-term dependencies because of computational or memory constraints. They usually limit themselves to the first few words. In theory, RNN can capture such long-term dependencies, but in practice it is a bit more complicated. We will discuss this in a future article.

Training data and preprocessing

To train our language models, we need text to learn. Fortunately, we don’t need any tags to train the language model, just raw text. I downloaded 15,000 Longish Reddit comments from a dataset available on Google’s BigQuery. Our model will generate text that looks (hopefully) like a Reddit comment! But as with most machine learning projects, we first need to do some pre-processing to get the data format right.

TOKENIZE TEXT Now we have the raw TEXT, but we want to make predictions on a word-by-word basis. This means we have to shred our comments into sentences and sentences into words. We could separate each sentence with a space, but this would not handle punctuation correctly, the sentence “He left!” Should be 3 tokens: “He”, “left”, “!” . We’ll use NLTK’s word_tokenize and SENT_tokenize methods, which do most of the work for us.
Remove low-frequency words from most words in our text that only appear once or twice. It’s a good idea to delete these infrequent words. Having a huge vocabulary makes our models slow to train (we’ll talk about why later), and because there aren’t many contextual examples of such words, we won’t learn how to use them properly. This is very similar to human learning, in that to really understand how to use a word properly, you need to see how it is used in different contexts. In our code, we’ll limit the vocabulary size of common words by vocabulary_size (I set it to 8000, but can change it at any time). We replaced all words not included in the vocabulary with UNKNOWN_TOKEN. For example, if the word “nonlinear” is not in our vocabulary, the sentence “nonlinearity is important in neural networks” becomes “UNKNOWN_TOKEN is important in neural networks.” The word UNKNOWN_TOKEN will become part of our vocabulary, and we will predict it just like any other word. When we generate new text, we can replace UNKNOWN_TOKEN again, for example by taking a randomly sampled word that is not in our vocabulary, or we can just generate sentences until we get a token that does not contain unknowns.
Prepare special opening and closing tags we also want to know which words tend to be at the beginning and end of a sentence. To do this, we prefix it with a special SENTENCE_START tag and attach a special SENTENCE_END tag to each sentence. So we can ask: given the first tag SENTENCE_START, what is the most likely next word (actually the first word in the sentence)?
The input of training data matrix RNN is vector, not string. Therefore, we create a mapping between the word and its index, index_to_word and word_to_index. For example, the word “friendly” could be at index 2001. The training example X can look like [0,179,341,416], where 0 corresponds to SENTENCE_START. The corresponding label y will be [179,341,416,1]. Remember, our goal is to predict the next word, so y just moves the X vector one position, and the last element is the SENTENCE_END tag. In other words, the correct prediction of the word 179 above would be 341, the actual next word.

vocabulary_size = 8000
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Read the data and append SENTENCE_START and SENTENCE_END tokens
print "Reading CSV file..."
with open('data/reddit-comments-2015-08.csv', 'rb') as f:
    reader = csv.reader(f, skipinitialspace=True)
    reader.next()
    # Split full comments into sentences
    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print "Found %d unique words tokens." % len(word_freq.items())

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print "Using vocabulary size %d." % vocabulary_size
print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])

# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]

print "\nExample sentence: '%s'" % sentences[0]
print "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0]

# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])Copy the code

Here’s an actual training example:

x:
SENTENCE_START what are n't you understanding about this ? !
[0, 51, 27, 16, 10, 856, 53, 25, 34, 69]

y:
what are n't you understanding about this ? ! SENTENCE_END
[51, 27, 16, 10, 856, 53, 25, 34, 69, 1]Copy the code

Established RNN

RNN is introduced in the first tutorial.

Let’s look specifically at what the RNN of our language model looks like. Enter x as a series of words (as in the example above), where each X and t is a word. One more thing to note: because of how matrix multiplication works, we can’t simply use a word index (such as 36) as input. Instead, we represent each word as a one-hot vector of lexical size. For example, the word with index 36 is a vector that is 0 everywhere except at position 36 where it is 1. So each x t is transformed into a vector, and x is a matrix where each row represents a word. We will do this transformation in our neural network code. The network output O has a similar format. Each O t is a vector of a vocabulary_size element that represents the probability that the word will be the next word in the sentence.

Let’s recall the equation for RNN:

I find it useful to write down the dimensions of matrices and vectors. Suppose we choose vocabulary C = 8000 and hidden layer size H = 100. You can think of the hidden layer size as the “memory” of our network. The larger the hidden layer, the more complex our learning model can be, but the more computationally required. Then we get: x t ∈R 8000 O T ∈R 8000 S T ∈R 100 U ∈R 100 ×8000 W ∈R 100 ×100 V ∈R 8000 ×100

That’s valuable information. Remember, U, V, and W are the parameters of the network that we want to learn from the data. Therefore, we need to learn a total of 2 HC +H 2 parameters. It’s 1,610,000 in the case of C = 8000 and H = 100. Dimension sizes are also a bottleneck in the model. Notice, since x t is a one-hot vector, multiplying by U is essentially picking a column from U, so we need to perform a complete multiplication. The largest matrix multiplication in the network is V, s, t, which is why we want to keep the vocabulary as small as possible.

With that, we can start training.

Initialize training parameters

We first declare an RNN class to initialize our arguments. This class is RNNNumpy. Since we will implement a version of Theano, it is a bit tricky to initialize the parameters U, V, and W, and we cannot initialize them to 0 because this would result in symmetric computations in all layers. We have to randomly initialize them. There are many studies on the influence of correct initialization parameters on training results. It turns out that the best way to initialize parameters depends on the activation function (we used TANh), and a recommended method is to randomly initialize weights on an interval of [−1 n √,1 n √], where n is the number of input connections from the previous level. This may sound overly complicated, but don’t worry too much about it. It usually works fine as long as you initialize the parameters to small random values.

class RNNNumpy:

    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        # Randomly initialize the network parameters
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
Copy the code

Above, word_DIM is the size of our vocabulary, and hidden_DIM is the size of our hidden layer (to which we can assign any value). Don’t worry about the BPTT_TRUNCate parameter now, as we’ll explain later.

The forward propagation

Next, let’s implement the forward propagation (predictive word probability) defined by the equation above:

def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    # During forward propagation we save all hidden states in s because need them later.
    # We add one additional element for the initial hidden, which we set to 0
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)
    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((T, self.word_dim))
    # For each time step...
    for t in np.arange(T):
        # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
        s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))
    return [o, s]

RNNNumpy.forward_propagation = forward_propagationCopy the code

We return not only the computed output, but also the hidden state. Use them later to calculate gradients and return them here again to avoid double counting. Each O t is a probability vector that represents all the words in the vocabulary, but sometimes, for example, when evaluating our model, we want the next word with the highest probability. We call this function predict:

def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o, axis=1)

RNNNumpy.predict = predictCopy the code

Let’s try the implementation and see a sample output:

np.random.seed(10)
model = RNNNumpy(vocabulary_size)
o, s = model.forward_propagation(X_train[10])
print o.shape
print oCopy the code

(45, 8000) [[0.00012408 0.0001244 0.00012603... 0.0001255 0.0001288 0.00012508] [0.0001236 0.0001282 0.00012436... 0.00012482 0.00012456 0.00012451] [0.00012387 0.0001252 0.00012474... 0.00012559 0.00012588 0.00012551]... [0.00012414 0.00012455 0.0001252... 0.00012487 0.00012494 0.0001263] [0.0001252 0.00012393 0.00012509... 0.00012472 0.0001253 0.00012487..., 0.00012463 0.00012536 0.00012665]Copy the code

For each word in the sentence (45 in all), our model made 8,000 predictions about the probability of the next word. Note that these predictions are completely random because we initialized U, V, and W to random values. An index of the highest predicted probabilities for each word is given below:

predictions = model.predict(X_train[10])
print predictions.shape
print predictionsCopy the code

(45,)
[1284 5221 7653 7430 1013 3562 7366 4860 2212 6601 7299 4556 2481 238 2539
 21 6548 261 1780 2005 1810 5376 4146 477 7051 4832 4991 897 3485 21
 7291 2007 6006 760 4864 2182 6569 2800 2752 6821 4437 7021 7875 6912 3575]Copy the code

Calculate loss

To train our network, we need a way to measure the errors it produces. We call this the loss function L, and our goal is to find parameters U, V, and W that minimize the loss function of our training data. The common loss function is cross entropy loss. If we have N training examples (number of words in the text) and C categories (size of vocabulary), then our loss of predicted O and true label Y is given by the following formula: L (y,o) =−1 N ∑ N ∈N y N lo go N

The formula looks a little complicated, but essentially the farther away y (the exact word) is from O (the predicted word), the greater the loss. Calculate_loss implementation function:

def calculate_total_loss(self, x, y): L = 0 # For each sentence... for i in np.arange(len(y)): o, s = self.forward_propagation(x[i]) # We only care about our prediction of the "correct" words correct_word_predictions =  o[np.arange(len(y[i])), y[i]] # Add to the loss based on how off we were L += -1 * np.sum(np.log(correct_word_predictions)) return L def calculate_loss(self, x, y): # Divide the total loss by the number of training examples N = np.sum((len(y_i) for y_i in y)) return self.calculate_total_loss(x,y)/N RNNNumpy.calculate_total_loss = calculate_total_loss RNNNumpy.calculate_loss = calculate_lossCopy the code

Let’s take a step back and think about what a randomly predicted loss is. This will be a benchmark for us and will ensure that our implementation is correct. There are C words in our vocabulary, so each word should be predicted (on average) with a probability of 1 / C, which would result in a loss of L =−1 N N log 1 C = L og C.

# Limit to 1000 examples to save time
print "Expected Loss for random predictions: %f" % np.log(vocabulary_size)
print "Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000])Copy the code

Expected Loss for random predictions: 8.987197
Actual loss: 8.987440Copy the code

Almost always! But keep in mind that assessing the loss of a complete data set is a time-consuming operation, and can take hours if the data set is too large!

Train RNN with SGD and BPTT

Remember, we want to find parameters U, V, and W that minimize the total loss on the training data. The most common method is SGD, random gradient descent. The idea behind SGD is simple. We iterate over all the training samples, and in each iteration, we fine-tune the parameters to reduce the error. These directions are given by the gradient of the loss: ∂ L ∂ U, ∂ L ∂ V, ∂ L ∂ W. SGD also requires a learning rate, which defines how large a step size we need in each iteration. SGD is the most popular optimization method used not only for neural networks, but also for many other machine learning algorithms. Therefore, there has been a lot of research on how to optimize SGD using batch processing, parallelism and adaptive learning rates. Even if the basic idea is simple, implementing SGD in a truly effective way can become very complex. If you want to learn more about SGD this is a good place to start. Here I will implement a simple SGD version that is easy to understand even without the background of optimization.

But how do we calculate those gradients that we mentioned above? In traditional neural networks, we do this with a back propagation algorithm. In RNN, we use the (BPTT) algorithm. Because the parameters in the network are shared at all steps, the gradient of each output depends not only on the calculation at the current moment, but also on the previous moment. If you know calculus, it just applies the chain rule. The next part of this tutorial is about BPTT, so I won’t go into detail here. Check this and this article for a general introduction to back propagation. Now you can think of BPTT as a black box. It accepts the training sample (x, y) and returns the gradient ∂ L ∂ U, ∂ L ∂ V, ∂ L ∂ W.

def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]

RNNNumpy.bptt = bpttCopy the code

Gradient inspection

Whenever you do back propagation, it’s a good idea to also do gradient checking, which is a way to verify that your implementation is correct. The idea behind gradient checking is that the derivative of the parameter is equal to the slope of the point, which we can approximate by slightly changing the parameter: ∂ L ∂ θ ≈lim h →0 J(θ + h)− J(θ−h) 2 h

The gradient calculated using back propagation is then compared with the gradient estimated using the above method. If there is no big difference, then our calculation is correct. The approximation requires calculating the total loss for each parameter, so gradient checking is very expensive (in the example above we have over a million parameters). So it is a good idea to perform gradient checking on a small lexical model.

Def gradient_check(self, x, y, h=0.001, error_threshold=0.01): # Calculate the gradients using backpropagation. We want to checker if these are correct. bptt_gradients = self.bptt(x, y) # List of all parameters we want to check. model_parameters = ['U', 'V', 'W'] # Gradient check for each parameter for pidx, pname in enumerate(model_parameters): # Get the actual parameter value from the mode, e.g. model.W parameter = operator.attrgetter(pname)(self) print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape)) # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1),... it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: ix = it.multi_index # Save the original value so we can reset it later original_value = parameter[ix] # Estimate the gradient using (f(x+h) - f(x-h))/(2*h) parameter[ix] = original_value + h gradplus = self.calculate_total_loss([x],[y]) parameter[ix] = original_value - h gradminus = self.calculate_total_loss([x],[y]) estimated_gradient = (gradplus - gradminus)/(2*h) # Reset parameter to original value parameter[ix] = original_value # The gradient for this parameter calculated using backpropagation backprop_gradient = bptt_gradients[pidx][ix] # calculate The relative error: (|x - y|/(|x| + |y|)) relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient)) # If the error is to large fail the gradient check if relative_error &gt; error_threshold: print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix) print "+h Loss: %f" % gradplus print "-h Loss: %f" % gradminus print "Estimated_gradient: %f" % estimated_gradient print "Backpropagation gradient: %f" % backprop_gradient print "Relative Error: %f" % relative_error return it.iternext() print "Gradient check for parameter %s passed." % (pname) RNNNumpy.gradient_check = gradient_check # To avoid performing millions of expensive calculations we use a smaller vocabulary size for checking. grad_check_vocab_size = 100 np.random.seed(10) model = RNNNumpy(grad_check_vocab_size, 10, bptt_truncate = 1000) model. Gradient_check (,1,2,3 [0], [1, 2, 3, 4])Copy the code

Perform SGD

Now we are able to calculate the gradient of the parameters, so that SGD can be achieved. There are two steps: 1. The sdg_step function calculates the gradient of a batch and performs the update. 2. The external loop of iterating and adjusting the learning rate through the training set.

# Performs one step of SGD. def numpy_sdg_step(self, x, y, learning_rate): # Calculate the gradients dLdU, dLdV, dLdW = self.bptt(x, y) # Change parameters according to gradients and learning rate self.U -= learning_rate * dLdU self.V -= learning_rate *  dLdV self.W -= learning_rate * dLdW RNNNumpy.sgd_step = numpy_sdg_stepCopy the code

# Outer SGD Loop # - model: The RNN model instance # - X_train: The training data set # - y_train: The training data labels # - learning_rate: Initial learning rate for SGD # - nepoch: Number of times to iterate through the complete dataset # - evaluate_loss_after: Evaluate the loss after this many epochs def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100) Evaluate the loss after this many epochs def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100) evaluate_loss_after=5): # We keep track of the losses so we can plot them later losses = [] num_examples_seen = 0 for epoch in range(nepoch): # Optionally evaluate the loss if (epoch % evaluate_loss_after == 0): loss = model.calculate_loss(X_train, y_train) losses.append((num_examples_seen, loss)) time = datetime.now().strftime('%Y-%m-%d %H:%M:%S') print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss) # Adjust the learning rate if loss increases if (len(losses) &gt; 1 and losses[-1][1] &gt; losses[-2][1]): Learning_rate = learning_rate * 0.5 print "Setting learning rate to %f" % learning_rate sys.stdout.flush() # For each training example... for i in range(len(y_train)): # One SGD step model.sgd_step(X_train[i], y_train[i], learning_rate) num_examples_seen += 1Copy the code

Done! Let’s get an idea of how long it takes to train the network:

Np.random. Seed (10) model = RNNNumpy(vocabulary_size) %timeit Model. Sgd_step (X_train[10], y_train[10], 0.005)Copy the code

Oh, bad news. An SGD execution takes about 350 milliseconds on my laptop. We have about 80,000 examples in the training data, so an iterative training (iteration over the entire data set) would take several hours. Multiple iterations of training will take days, even weeks! We are still using small data sets compared to what many companies and researchers use. Now what?

Fortunately, there are many ways to speed up our code execution. We can stick with the same model and make the code run faster, or we can modify the model at a lower computational cost, or both. Researchers have used a number of methods to make models computationally cheaper, such as avoiding large matrix multiplications by using hierarchical Softmax or adding projection layers (also see here or here). But I wanted to keep our model simple, so I went the first route: use the GPU to make things faster. Before we do that, we try to run SGD with a small data set and check if the losses are really reduced:

np.random.seed(10)
# Train on a small subset of the data to see what happens
model = RNNNumpy(vocabulary_size)
losses = train_with_sgd(model, X_train[:100], y_train[:100], nepoch=10, evaluate_loss_after=1)Copy the code

2015-09-30 10:08:19: Loss after num_examples_seen=0 epoch=0: 8.987425
2015-09-30 10:08:35: Loss after num_examples_seen=100 epoch=1: 8.976270
2015-09-30 10:08:50: Loss after num_examples_seen=200 epoch=2: 8.960212
2015-09-30 10:09:06: Loss after num_examples_seen=300 epoch=3: 8.930430
2015-09-30 10:09:22: Loss after num_examples_seen=400 epoch=4: 8.862264
2015-09-30 10:09:38: Loss after num_examples_seen=500 epoch=5: 6.913570
2015-09-30 10:09:53: Loss after num_examples_seen=600 epoch=6: 6.302493
2015-09-30 10:10:07: Loss after num_examples_seen=700 epoch=7: 6.014995
2015-09-30 10:10:24: Loss after num_examples_seen=800 epoch=8: 5.833877
2015-09-30 10:10:39: Loss after num_examples_seen=900 epoch=9: 5.710718Copy the code

Our implementation at least does something useful, and the loss is reduced as much as we want.

Train RNN with Theno and GPU

I wrote a tutorial on Theano before, and since our logic will remain exactly the same, I won’t optimize the code again here. I define an RNNTheano class that replaces numpy calculations with the corresponding calculations in Theano. Like the rest of this article, the code is available Github code.

Np.random. Seed (10) model = RNNTheano(vocabulary_size) %timeit model. Sgd_step (X_train[10], y_train[10], 0.005)Copy the code

This time, an SGD step took 70ms on my Mac (without a GPU) and 23ms on the G2.2 Xlarge Amazon EC2(with a GPU). This is a 15-fold improvement over the initial usage time, which means we can train the model in hours per day instead of weeks. We can still do a lot of optimizations, but it’s good enough for now.

To help you avoid spending days training the model, I have pre-trained a Theano model with a hidden dimension of 50 and a vocabulary of 8000. Fifty iterations were performed in about 20 hours. The losses are still decreasing, and a little longer training might produce a better model. Feel free to try and train for longer. You can find the model parameters in the Github repository in data/ sand-model-theano.npz and load them with the load_model_parameters_theano method:

from utils import load_model_parameters_theano, save_model_parameters_theano

model = RNNTheano(vocabulary_size, hidden_dim=50)
# losses = train_with_sgd(model, X_train, y_train, nepoch=50)
# save_model_parameters_theano('./data/trained-model-theano.npz', model)
load_model_parameters_theano('./data/trained-model-theano.npz', model)Copy the code

The generation of textual

Now that we have the model, we can use it to generate new text! Start by implementing a helper function to generate a new sentence:

def generate_sentence(model): # We start the sentence with the start token new_sentence = [word_to_index[sentence_start_token]] # Repeat until we get an end token while not new_sentence[-1] == word_to_index[sentence_end_token]: next_word_probs = model.forward_propagation(new_sentence) sampled_word = word_to_index[unknown_token] # We don't want to  sample unknown words while sampled_word == word_to_index[unknown_token]: samples = np.random.multinomial(1, next_word_probs[-1]) sampled_word = np.argmax(samples) new_sentence.append(sampled_word) sentence_str = [index_to_word[x] for x in new_sentence[1:-1]] return sentence_str num_sentences = 10 senten_min_length = 7 for i in range(num_sentences): sent = [] # We want long sentences, not sentences with one or two words while len(sent) &lt; senten_min_length: sent = generate_sentence(model) print " ".join(sent)Copy the code

Here are a few generated sentences that I added caps:

Anyway, to the city scene you’re an idiot teenager.
What ? ! ! ! ! ignore!
Screw Fitness, you’re saying: HTTPS
Thanks for the advice to keep my thoughts around girls.
Yep, please disappear with the terrible generation.

Look at the resulting sentences and notice something interesting. This model succeeds in learning the syntax. It correctly places commas (usually before and’s and or’s) and ends sentences with punctuation marks. Sometimes it mimics Internet language styles, such as multiple exclamation points or emojis.

However, the vast majority of the sentences generated are meaningless or grammatically incorrect (I chose to generate the best sentences). One reason could be that we don’t train the network long enough (or don’t use enough training data), but it’s probably not the main reason. Our Vanilla RNN couldn’t generate meaningful text because it couldn’t learn dependencies between words a few steps apart. This is why RNN did not gain popularity when it first appeared. They’re beautiful in theory, but they don’t work very well in practice, and we don’t understand why in time.

Fortunately, the difficulties of RNN training are now better understood. In the next part of this tutorial, we will explore the back Propagation time (BPTT) algorithm in more detail and demonstrate the so-called vanishing gradient problem. This will motivate us to move to more complex RNN models, such as LSTM, which is the current state of many tasks in NLP (and can produce better Reddit comments!). . Everything you learn in this tutorial also applies to LSTM and other RNN models, so don’t be discouraged if the results of a Vanilla RNN are worse than you expected.