“This is my 27th day of participating in the First Challenge 2022. For more details: First Challenge 2022.”

Introduction:

This is a transcript of my previous blog post on CSDN. NLP Study Notes (vi) Text generation

Text Generation

This lesson introduces one application of RNN: text generation. We can train an RNN to automatically generate text.

Main idea

Let’s start with an example. Suppose you type in half a sentence, “The cat sat on The MA “, and ask to predict The next character. We can train a neural network to predict the next character. The training data is a lot of text. The text is split into characters and the characters are represented with one-hot encoding. Input these one-hot vectors in turn into RNN, and the state vector H of RNN will accumulate the information seen. RNN returns the last vector H, above which is a Softmax classifier that outputs the probability of each category through Softmax by multiplying H with the parameter matrix W.

Assuming the network has been trained, the Softmax layer will output the following probability values. We choose the character output corresponding to the maximum probability value, or do random sampling according to the probability value. The t is then appended to the end of the input text.

Then take “The cat sat on The mat” as input and compute The output of The next character to generate The next character. Probably a period. And so on and so on and it’s going to be very long.

Now to see how to train this RNN, the training data is text, such as all the articles in the English Wikipedia. Divide your essay into sections. These segments can overlap.

For example, the red fragment is the input text, followed by the blue character A for the label. Set seg_len=40 to mean that the length of the red fragment is 40, set the stride=3 to mean that the next step will shift the red fragment to the right by 3 characters, and so on.

If the article had 3000 characters, we would have about 1000 red fragments and 1000 blue tags.

The purpose of training the neural network is to give the next character given the input fragment. This is actually a multi-classification problem. If there are 50 different characters including Spaces, letters, and punctuation, the number of categories is 50.

The output style of the trained neural network depends on the text used for training. If you train with Shakespeare, the output is Shakespearean text.

Text generation can be used to do interesting things. Like generating new English names.

If Linux source code were used as training data, the neural network would generate this text

Latex source code is automatically generated using latex source code

Training a Text Generator

Prepare Training Data

We use Python to read the text of a book, regardless of case. Take 60 characters at a time as input data, one character as label

Converting a Character to a Vector

First make a dictionary of characters. The text is then converted into a one-hot vector.

After one-hot encoding, a string is back-encoded into a numeric matrix. The number of rows of the matrix is string length and the number of columns is dictionary size. The reason why word embedding is not done here is because the number of characters in the dictionary is actually very small, only 57 dimensions.

Build a Network

Keras implementation slightly

Predict the next character

After the model is built, we have three strategies for selecting the next character.

Option 1: Greedy Selection

The first way to do this is to do the greedy choice, just pick the one with the highest probability. However, the text generated by this method is determined, and the articles are fixed, and the readability is very poor.

Multinomial distribution is also important

The second method is polynomial sampling according to the probability value of each character output. In this case, it has randomness and good generation effect.

Option 3: Adjust the multinomial distribution

The third way is to increase the power of the original probability distribution and recalculate the probability separations, in which case, the probability that is higher in method 2 is higher. And it works better.

Summary

To generate text, you need to create a recurrent neural network. The first step is to divide the text into many segments. Segment is the input and next_char is the label. The second step is to one-hot encoding the characters into vectors, where v is the number of words and L is the segment length. The third step is to build a neural network.To generate text, you need to give a seed fragment as input, and then you can generate. Repeat the following steps:

  • Enter the segment into the neural network
  • The probability that the neural network outputs each character
  • Take Sample from the probability value to get next_char
  • Appending the newly generated character to the end of the fragment