OpenAI GPT was proposed before Google BERT algorithm. The biggest difference from BERT is that GPT adopts the traditional language model for training, that is, predicting words using the preceding words, while BERT predicts words using both the preceding and the following words. Therefore, GPT is better at the natural language generation task (NLG) and BERT is better at the natural language understanding task (NLU).

1.OpenAI GPT

OpenAI proposed the GPT model in their paper Improving Language Understanding by Generative Pre-training. Later, GPT2 model was put forward in The paper Language Models Are Unsupervised Multitask Learners. The model structure of GPT2 differs little from that of GPT, but larger data sets are used for experiments. Both GPT and BERT use Transformer models. For those unfamiliar with Transformer and BERT, please refer to the previous articles “A Detailed Explanation of Transformer Model” and “A Thorough Understanding of Google BERT Model”.

The training method adopted by GPT is divided into two steps. The first step is to train the language model with unlabeled text data sets, and the second step is to fine-tune the model according to specific downstream tasks, such as QA, text classification, etc. BERT also used this training method. Let’s first understand the main differences between GPT and BERT.

Pretraining: GPT pretrains in the same way as traditional language models, predicting the next word from the previous paragraph; GPT is pretrained by using Mask LM, which can predict words from above and below at the same time. For example, given a sentence [U1, u2… UN], GPT will only use information about [U1, U2… u(i-1)] to predict the word UI, And BERT will also use [u1, u2,…, u (I – 1), u (I + 1),…, UN] information. As shown in the figure below.

Model effect: GPT is better suited for natural language generation tasks (NLG) because of its traditional language model, because these tasks typically generate next-moment information from current information. BERT is more suitable for natural language comprehension tasks (NLU).

Model structure: GPT uses Transformer Decoder, and BERT uses Transformer Encoder. GPT uses the Mask multi-head Attention structure in Decoder to Mask the words after UI when predicting the word UI using [U1, U2… U (i-1)].

2.GPT model structure

GPT uses the Transformer Decoder structure and has made some changes to the Transformer Decoder. The original Decoder includes two multi-head Attention structures. GPT only retains Mask multi-head Attention, as shown below.

GPT uses sentence sequence to predict the next word, so Mask multi-head Attention should be used to block the following words to prevent information leakage. For example, given A sentence containing four words [A, B, C, D], GPT needs to use A to predict B, use [A, B] to predict C, and use [A, B, C] to predict D. [B], [C], [D]

Mask is done before Softmax for self-attention by substituting -INF for the Mask position with an infinitesimal number, and then Softmax, as shown below.

It can be seen that after Mask and Softmax, GPT can only use the information of word A when predicting word B based on word A and word A and B when predicting word C based on [A, B]. This prevents information from leaking out.

Below is the overall model of GPT, which includes 12 decoders.

3.GPT training process

The GPT training process is divided into two parts: unsupervised pre-training language model and supervised fine-tuning of downstream tasks.

3.1 Pre-trained language model

Given the sentence U=[U1, u2… UN], GPT needs to maximize the following likelihood function when training the language model.

It can be seen that GPT is a one-way model, and the input of GPT is represented by H0, which can be calculated as follows.

Wp is Embedding of words and We is Embedding of words. Voc represents the vocabulary size, pos represents the longest sentence length, and dim represents the Embedding dimension. Wp is a pos× Dim matrix and We is a VOC × DIM matrix.

After getting input H0, it is necessary to pass H0 into all Transformer Decoder of GPT in turn, and finally get HT.

And then you get the probability of ht predicting the next word.

3.2 Fine-tuning of downstream tasks

After pre-training, the GPT fine-tunes the model for specific downstream tasks. The fine-tuning process uses supervised learning and training samples include word sequences [X1, X2… XM] and class symbols Y. The GPT fine-tuning process predicts class y based on word sequences [X1, X2… XM].

Wy represents the parameter used to predict the output. The following functions need to be maximized when fine-tuning.

GPT also considers the pre-trained loss function when fine-tuning, so the final function to be optimized is:

4. The GPT summary

GPT used the preceding words to predict the next word in pre-training. BERT predicted words according to the context, so GPT’s performance was worse than BERT’s in many NLU tasks. But GPT is better suited for the task of text generation, which typically generates the next word based on existing information.

I recommend reading huggingFace’s Github code, which includes many Transformer based models, including roBERTa and ALBERT.

5. References

  • Improving Language Understanding by Generative Pre-Training
  • Language Models are Unsupervised Multitask Learners
  • The Illustrated GPT-2