GPT2 – Chinese is Chinese GPT2 training code, be free and at leisure to play with it, let alone are really interesting In this record, installation and use of the reading for later forgotten time

Start by installing python3.7

Versions 3.5-3.8 should work, but 3.7 + PyCharm was used to minimize errors

Create project directory +git clone

  1. Create the gpT2Chinese folder on drive F

  2. From github.com/Morizeyao/G… Pull the source code

    Enter F:\ gpT2Chinese, shift+ right click to open Powser Shell

Pull the git clone https://github.com/Morizeyao/GPT2-Chinese.git code. Python -m venv. activate. Scripts activate install dependent PIP install -r requireences.txt torch Install the CPU version separately if cloud not found is displayed . Verify that PYTHon3.5-3.8 PIP install Torch ==1.5.1+ CPU TorchVision ==0.6.1+ CPU -f is used https://download.pytorch.org/whl/torch_stable.htmlCopy the code

PIP Install has encountered problems

Several errors have been encountered in this step, as well as the failure of get_config to be used during a successful run, which is related here. Transformers TensorFlow Keras version must be fixed and kerAS and TensorFlow version must match. Otherwise an error will be compatible with the from tensorflow. Python. Eager. Context import get_config error

PIP install keras==2.3.1 PIP install Tensorflow ==2.0.1 PIP install keras==2.3.1Copy the code

After the installation, the project structure is as follows, and the data and dist directories are created by ourselves

Purpose of each directory

  • TXT is the original BERT thesaurus, vocab_all. TXT is an additional archaic word, and vocab_small. Make_vocab.py is a script that assists in building a vocabulary on a train.json corpus file.

  • Config Stores the parameter configuration file

  • Scripts contains sample training and generation scripts

  • Dist is self-created to hold the generated text

  • Data is self-created and used to store the original corpus

  • Model is the model directory where downloaded PyTorch_Model models are stored

  • Generate.py and train.py are generated and trained scripts, respectively.

  • Train_single. py is an extension of train.py and can be used for a large list of individual elements (such as training a fight against the Sky book).

  • Eval.py is used to evaluate the PPL score of the generated model.

  • Generate_texts. Py is an extension of generate.py, which can generate several sentences with the starting keywords of a list and output them to a file.

  • Json is an example of the format of the training sample for reference.

  • There are three kinds of Tokenizer available in the Tokenizations folder, including the default Bert Tokenizer, the partialized Bert Tokenizer and the BPE Tokenizer.

Download the model

I used three models

  1. General Chinese model: pan.baidu.com/s/16x0hfBCe…

    After downloading, unzip to the model/ Tongyong folder in the project directory, including three files pytorch_model.bin vocab.txt config.json

  2. The prose model pan.baidu.com/s/1nbrW5iw3…

    TXT config.json. If the last two files do not exist, copy them directly from models/ Tongyong

Model placement figure

Issues that need attention

  1. Corresponding to the version mentioned above
  2. All TXT files and config.json files are saved as UTF-8 encoding, otherwise you may encounter various encoding errors
  3. If GBK decoder and other errors are still reported during generation and training, open the corresponding py file and add parameters to open methodencoding="utf-8"

Generate articles using downloaded models

  1. Try using the prose model first, personally I like this one
Python./generate.py --length=500 length of text --nsamples=2 number of samples --prefix=' Am I going to live my life like this? ' The rest is left to GPT2 --model_path model/sanwen model path --tokenizer_path model/sanwen/vocab.txt Model thesaurus --model_config Json thespot configuration --fast_pattern --save_samples saves the generated text, otherwise the console simply prints -- save_SAMples_path =./dist the text is generated to the dist directoryCopy the code
Python./generate.py --length=500 --nsamples=1 --prefix=' Am I going to spend my whole life alone '--tokenizer_path model/sanwen/vocab.txt --model_path model/sanwen --model_config model/sanwen/config.json --fast_pattern --save_samples --save_samples_path=./distCopy the code

Generated effect

Prose model generates effect

General Chinese model effect

Open source projects used

GPT2-chinese