OpenAI proposed the GPT model in Improving Language Understanding by Generative Pre-training. The GPT model is a model built from the unidirectional Transformer decoder, which was unsupervised pre-trained by the OpenAI team on a very large Book data set, the Toronto Book Corpus.

Later, the OpenAI team proposed the GPT-2 model, which is the successor of THE GPT model. The GPT-2 model is an expanded version of the GPT model with a larger training set and more parameters. Gpt-2’s training set consists of 8 million web pages, 40 gigabytes of text data, crawled across the web by the researchers. The training task is to give the previous text and make the model predict the next word.

The difference between GPT and GPT-2 model is that GPT-2 uses more training data and increases model parameters, and there is no significant difference in specific structure. Therefore, we will mainly introduce the GPT-2 model which performs better in practice.

The core idea of GPT-2

According to the research findings, language has flexible expression ability, that is, it can represent the task, input and output as a string, and the model can be trained with this form of string to learn the corresponding task. For example, in a translation task, a training sample can be written as

(translate to french, english text, french text)
Copy the code

Similarly, in a reading comprehension task, a training sample can be written

(answer the question, document, question, answer)
Copy the code

Moreover, people can use the training samples of multiple tasks in the above format to train a model at the same time, so that the model can obtain the ability to perform multiple tasks at the same time.

So the OpenAI researchers hypothesized that a language model with sufficient capacity would learn to reason and perform the tasks shown in the training sample in order to better predict them. If a language model can do this, it is in effect engaging in unsupervised multitasking. So the researchers decided to test whether this was the case by analyzing the performance of the language model on a variety of tasks, and hence gpT-2.

And, as they predicted, gpT-2 performed well on a variety of tasks, as shown below:

source

So gpT-2 breaks through other pretraining models in that it can perform well on downstream tasks such as reading comprehension, machine translation, question answering, and text summarization without training for specific downstream tasks. This also shows that unsupervised training techniques can train models that perform well on a variety of downstream tasks when the models are large enough and the training data are sufficient.

Because supervised learning requires a lot of data, and it requires carefully cleaned data, obtaining such data is expensive. Unsupervised learning can overcome this shortcoming because it does not require manual annotation and has a large amount of data ready to use. This also indicates the significance of gpT-2 model research.

With the idea behind building the GPT-2 model in mind, let’s look at the structure of the GPT-2 model in detail.

Gpt-2 model structure

The overall structure of GPT-2 is shown in the figure below. Gpt-2 is built on the basis of Transformer, which preprocesses data using the method of byte encoding and pretrains the language model by predicting the next word task. Here we start from the pretreatment method of GPT-2. Let’s take a step-by-step look at GPT-2.

Byte pair encoding

The GPT-2 model uses Byte Pair Encoding (BPE) for data preprocessing. BPE is a method that can solve the problem of unregistered words and reduce the size of dictionaries. It combines the advantages of word-level encoding and character-level encoding. For example, we will encode the following string,

aaabdaaabac
Copy the code

Byte for aa occurs the most times, so we replace it with a character Z that is not used in the string,

ZabdZabac
Z=aa
Copy the code

And then we repeat the process, replacing ab with Y,

ZYdZYac
Y=ab
Z=aa
Copy the code

So let’s go ahead and replace ZY with X,

XdXac
X=ZY
Y=ab
Z=aa
Copy the code

This process is repeated until no byte pairs occur more than once. When decoding is required, the substitution process is reversed.

The following is a paragraph of BPE algorithm in the original text of the BPE algorithm implementation:

One-way Transformer decoder structure

Gpt-2 model consists of the decoder parts of multi-layer one-way Transformer, which is essentially an autoregression model. Autoregression means that each time a new word is generated, the new word is added to the original input sentence as a new input sentence. The Transformer decoder structure is shown below:

source

Only multiple Masked self-attention and Feed Forward Neural Network modules are used in GPT-2 model. As shown below:

source

As you can see, the GPT-2 model enters statements into the structure shown above, predicts the next word, and then adds a new word as a new input to continue the prediction. The loss function calculates the deviation between the predicted value and the actual value.

As we know from the previous section, BERT is built based on bi-directional Transformer structure, while GPT-2 is based on one-way Transformer. Bi-directional and one-way here mean that BERT will consider the influence of words around the shaded words on BERT at the same time when conducting attention calculation. Gpt-2 only considers the effect of the word to the left of the word to be predicted.

The GPT-2 model was trained by the data preprocessing method and model structure as well as a large amount of data. For security reasons, the OpenAI team did not open source all the training parameters, but provided a small pre-training model. Next, we will conduct experiments based on the GPT-2 pre-training model.

Gpt-2 text generation

Gpt-2 is a language model that can predict the next word from above, so it can use what it has already learned in pre-training to generate text, such as news. Other data can also be fine-tuned to produce text with a specific format or theme, such as poetry or drama. So next, we will use gpT-2 model to conduct a text generation experiment.

Pre-trained models generate news

The easiest way to run a pre-trained GPT-2 model directly is to let it work freely, generating random text. In other words, we give it a hint at the beginning, a predetermined starting word, and then let it randomly generate subsequent text.

This can sometimes lead to problems, such as the model falling into a loop that keeps generating the same word. To avoid this, GPT-2 sets a top-k parameter so that the model randomly selects the next word from the list of words with the highest probability k. The following is an implementation of the function that selects top-k,

Let’s introduce the GPT-2 model, which will be packaged in the PyTorch-Transformers model libraryGPT2Tokenizer() 和 GPT2LMHeadModel()To actually see gpT-2’s ability to predict the next word after pre-training. First, you need to install the PyTorch-Transformers.

Next useGPT2LMHeadModel()Build the model and set the model mode to validation mode. Due to the large parameter volume of the pre-training model and its hosting on the extranet, the pre-training model was first downloaded from the Lanbridge cloud class image server for this experiment, and this step was not required for the local experiment.

At the end of the run, we look at the text generated by the model and see that it looks roughly like a normal text, but a closer look reveals logic problems in the statements that the researchers will continue to work on.

In addition to generating text directly from the pre-trained model, we can also use fine-tuning methods to make the GPT-2 model generate text with a specific style and format.

Fine tuning generates dramatic text

Next, we’ll fine-tune gpT-2 with some dramatic scripts. Since the open source gpT-2 model pre-training parameters of the OpenAI team are obtained after pre-training with English data sets, although Chinese data sets can be used for fine-tuning, it requires a large amount of data and time to achieve good results. Therefore, English data sets are used for fine-tuning here, so as to better demonstrate the capability of GPT-2 model.

First, download the training data set, which uses Shakespeare’s Romeo and Juliet as the training sample. We have downloaded the data set in advance and put it in the Blue Bridge cloud class server. You can download it by the following command. ! wget -nc “https://labfile.oss.aliyuncs.com/courses/1372/romeo_and_juliet.zip” ! unzip -o “romeo_and_juliet.zip”

The above is the process of recording the purchased courses,