GPT2 – Chinese is Chinese GPT2 training code, be free and at leisure to play with it, let alone are really interesting In this record, installation and use of the reading for later forgotten time

Start by installing python3.7

Versions 3.5-3.8 should work, but 3.7 + PyCharm was used to minimize errors

Create project directory +git clone

Create the gpT2Chinese folder on drive F
From github.com/Morizeyao/G… Pull the source code

Enter F:\ gpT2Chinese, shift+ right click to open Powser Shell

Pull the git clone https://github.com/Morizeyao/GPT2-Chinese.git code. Python -m venv. activate. Scripts activate install dependent PIP install -r requireences.txt torch Install the CPU version separately if cloud not found is displayed . Verify that PYTHon3.5-3.8 PIP install Torch ==1.5.1+ CPU TorchVision ==0.6.1+ CPU -f is used https://download.pytorch.org/whl/torch_stable.htmlCopy the code

PIP Install has encountered problems

Several errors have been encountered in this step, as well as the failure of get_config to be used during a successful run, which is related here. Transformers TensorFlow Keras version must be fixed and kerAS and TensorFlow version must match. Otherwise an error will be compatible with the from tensorflow. Python. Eager. Context import get_config error

PIP install keras==2.3.1 PIP install Tensorflow ==2.0.1 PIP install keras==2.3.1Copy the code

After the installation, the project structure is as follows, and the data and dist directories are created by ourselves

Purpose of each directory

TXT is the original BERT thesaurus, vocab_all. TXT is an additional archaic word, and vocab_small. Make_vocab.py is a script that assists in building a vocabulary on a train.json corpus file.
Config Stores the parameter configuration file
Scripts contains sample training and generation scripts
Dist is self-created to hold the generated text
Data is self-created and used to store the original corpus
Model is the model directory where downloaded PyTorch_Model models are stored
Generate.py and train.py are generated and trained scripts, respectively.
Train_single. py is an extension of train.py and can be used for a large list of individual elements (such as training a fight against the Sky book).
Eval.py is used to evaluate the PPL score of the generated model.
Generate_texts. Py is an extension of generate.py, which can generate several sentences with the starting keywords of a list and output them to a file.
Json is an example of the format of the training sample for reference.
There are three kinds of Tokenizer available in the Tokenizations folder, including the default Bert Tokenizer, the partialized Bert Tokenizer and the BPE Tokenizer.

Download the model

I used three models

General Chinese model: pan.baidu.com/s/16x0hfBCe…

After downloading, unzip to the model/ Tongyong folder in the project directory, including three files pytorch_model.bin vocab.txt config.json
The prose model pan.baidu.com/s/1nbrW5iw3…

TXT config.json. If the last two files do not exist, copy them directly from models/ Tongyong

Model placement figure

Issues that need attention

Corresponding to the version mentioned above
All TXT files and config.json files are saved as UTF-8 encoding, otherwise you may encounter various encoding errors
If GBK decoder and other errors are still reported during generation and training, open the corresponding py file and add parameters to open methodencoding="utf-8"

Generate articles using downloaded models

Try using the prose model first, personally I like this one

Python./generate.py --length=500 length of text --nsamples=2 number of samples --prefix=' Am I going to live my life like this? ' The rest is left to GPT2 --model_path model/sanwen model path --tokenizer_path model/sanwen/vocab.txt Model thesaurus --model_config Json thespot configuration --fast_pattern --save_samples saves the generated text, otherwise the console simply prints -- save_SAMples_path =./dist the text is generated to the dist directoryCopy the code

Python./generate.py --length=500 --nsamples=1 --prefix=' Am I going to spend my whole life alone '--tokenizer_path model/sanwen/vocab.txt --model_path model/sanwen --model_config model/sanwen/config.json --fast_pattern --save_samples --save_samples_path=./distCopy the code

Generated effect

Prose model generates effect

General Chinese model effect

Open source projects used

GPT2-chinese

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Write articles automatically using GPT2-Chinese

Start by installing python3.7

Create project directory +git clone

PIP Install has encountered problems

Download the model

Issues that need attention

Generate articles using downloaded models

Open source projects used

Write articles automatically using GPT2-Chinese

Start by installing python3.7

Create project directory +git clone

PIP Install has encountered problems

Download the model

Issues that need attention

Generate articles using downloaded models

Open source projects used

Related Posts

Recommendation | Top 10 machine learning open source project

Some methods of text and image data enhancement are briefly described

Shop Scheduling Problem Solving based on MATLAB particle Swarm Optimization algorithm