PK creative Spring Festival, I am participating in the “Spring Festival creative submission contest”, please see: Spring Festival creative submission Contest

The Year of the Tiger

The year of the Tiger arrived, then the couplets also want to call the year of the tiger, then I come.

Reference: from aistudio.baidu.com/aistudio/pr…

In the previous project, we looked at how to write a Seq2Seq model based on flying OARS (click Quick Access) and tweaked Seq2Seq to add more tunable parameters (click Quick Access 1, quick Access 2). In this project, we used Seq2Seq model to train a Tibetan poem generator. It will be the Chinese New Year soon.

New wind rain rain rain, thousands of miles of spring sunset oblique. Spring breeze year after year, spring breeze a cup of wine. Quick happy don't know where to come, spring breeze full of eyes spring breeze. Lele did not know where to come, spring breeze blowing spring breeze. Cow cow head not see human, I do not know where no human language. Spring breeze blows year after year, spring breeze blows down spring breeze. Big adults do not know where people, do not know where not to meet. When ji Ji where do not meet, meet a smile do not meet.Copy the code

My research direction is text processing methods based on complex networks, and I am also committed to exploring methods combined with deep learning. I will update my work from time to time. If there are friends who share the same research direction or are interested in it, please kindly support sanlian.Come to AI Studio and be a fanWaiting for you oh

Click on more project linksA collection of projects for graduate students without entry

Copy the code

Second, data processing

In this project, the data set of ancient poems is used as the training set. The encoder receives the beginning of each word of ancient poems, and the decoder generates all poems using the information of the encoder. For continuity between lines, the encoder also prefixes the header with information from the previous line. For example:

“Day according to the mountains, the Yellow River into the sea, to poor li mu, more on the floor.” Two samples can be generated:

Sample 1: encoder input, “white”; Decoder input, “day according to the mountains, the Yellow River into the sea”

Sample two: encoder input, “Day according to the mountains, the Yellow River into the sea. To “; Decoder input, “to poor thousands of eyes, to the next level.”

1. Paddlenlp upgrade

! pip install -U paddlenlpCopy the code
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting paddlenlp [?25l Downloading https://pypi.tuna.tsinghua.edu.cn/packages/17/9b/4535ccf0e96c302a3066bd2e4d0f44b6b1a73487c6793024475b48466c32/paddlenlp- 2.2.3 py3 - none - any. WHL (1.2 MB) [K | █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 1.2 MB 11.2 MB/s eta 0:00:01 [? 25 hrequirement already  satisfied, skipping upgrade: H5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenLP) (2.9.0) Requirement already satisfied, skipping upgrade: Colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenLP) (4.1.0) Requirement already satisfied, skipping upgrade: Colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenLP) (0.4.4) Requirement already satisfied, skipping upgrade: Seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenLP) (1.2.2) Requirement already satisfied, skipping upgrade: Jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenLP) (0.42.1) Requirement already satisfied, skipping upgrade: Multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1) Requirement already satisfied, skipping upgrade: Six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from H5PY -> paddlenLP) (1.16.0) Requirement already satisfied, skipping upgrade: Numpy >=1.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from H5PY ->paddlenlp) (1.20.3) Requirement already satisfied, skipping upgrade: Scikit-learn >=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) Requirement already satisfied, skipping upgrade: Dill >=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp)  Requirement already satisfied, skipping upgrade: Scipy > = 0.19.1 in/opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages (the from Scikit-learn >=0.21.3->seqeval->paddlenlp) Requirement already given, skipping upgrade: Threadpoolctl > = 2.0.0 in/opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages (the from Scikit-learn >=0.21.3->seqeval->paddlenlp) Requirement already given, skipping upgrade: Joblib > = 0.11 in/opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages (the from Scikit-learn >=0.21.3->seqeval-> paddlenLP) (0.14.1) Installing collected packages: Paddlenlp Found Existing installation: PaddlenLP 2.1.1 Uninstalling paddlenLP-2.1.1: Successfully uninstalled paddlenlP-2.1.1 Successfully installed paddlenlP-2.2.3Copy the code

2. Extract the header

import re
poems_file = open("./data/data70759/poems_zh.txt", encoding="utf8")
For each line of verse read, count the prefix of each line
poems_samples = []
poems_prefix = []
poems_heads = []
for line in poems_file.readlines():
    line_ = re.sub('. '.' ', line)
    line_ = line_.split()
    # Generate training samples
    for i, p in enumerate(line_):
        poems_heads.append(p[0])
        poems_prefix.append('. '.join(line_[:i]))
        poems_samples.append(p + '. ')


Output file information
for i in range(20) :print("poems heads:{}, poems_prefix: {}, poems:{}".format(poems_heads[i], poems_prefix[i], poems_samples[i]))
Copy the code
The poems heads of poems_prefix are not coming out. The poems heads, poems_prefix, for a moment to the sky, chasing the broken stars to catch the moon. Before the sea is dark, before the days are full of poems heads. The poems are full of mountains and rivers. The clouds are high and the smoke is rising. The birds shadow through the sparse trees, the apes sound into the small building. Poems heads, high clouds, and smokes. Day back birds shadow wear sparse wood, wind pass ape sound into the small building, poems: distant xiu like screen horizontal blue fall, broken sail such as leaf cut middle flow. Flying quiet and idle, the poems heads on the river and the mountains. Flying quiet and idle, building on the river before the mountains, poems: Don't go back, break the clear light thousand li days. The poems heads up the huge stones and the poems_prefix mosses. Poems heads, poems_prefix: Knowing where bouldes come, full of moss traces, poems, namely, hiding vegetation at leisure, will add up the qian kun when moving. Poems heads, poems_prefix, mosses. Static that is, the vegetation is collected at leisure, the ones that will float up when moving, Poems: The horizontal day is not necessarily friends yuan evil, holding the sun also zeng Rui sovereign. -Blair: Poems heads, poems_prefix. Quiet is to hide vegetation, and when it moves it will draw up the universe. The heaven may not be friends yuan Evil, hold the day also ceng Rui zhi, Poems: Not alone dynasty in Wu Gorge, Chu King what kind of grind. If you live in China, try to win the poems on Mount Tai. The poems_prefix to the warm law is the first to sneak an eye. The spring ice beats the teeth cold, the poems_prefix cold when swallowing. If you know the heavy, the good, the good, the thin golden knife, the deep ones. The jade tiles are flat, but there is no monopoly on the randomly brushed reed flowers. The poems heads, the poems_prefix, the binges. Cut the silver flowers randomly, poems_prefix the jade leaves crazy. Poems heads, poems_prefix, poems_prefix.Copy the code

3. Generate word lists

# Use PaddleNLP to generate the word list file. Since the sentence structure of the poem is short, we use individual words as word units to generate the word list file
from paddlenlp.data import Vocab

vocab = Vocab.build_vocab(poems_samples, unk_token="<unk>", pad_token="<pad>", bos_token="<", eos_token=">")
vocab_size = len(vocab)

print("vocab size", vocab_size)
print("word to idx:", vocab.token_to_idx)
Copy the code

4. Define the dataset

Define a data reader
from paddle.io import Dataset, BatchSampler, DataLoader
import numpy as np

class PoemDataset(Dataset) :
    def __init__(self, poems_data, poems_heads, poems_prefix, vocab, encoder_max_len=128, decoder_max_len=32) :
        super(PoemDataset, self).__init__() self.poems_data = poems_data self.poems_heads = poems_heads self.poems_prefix = poems_prefix self.vocab  = vocab self.tokenizer =lambda x: [vocab.token_to_idx[x_] for x_ in x]
        self.encoder_max_len = encoder_max_len
        self.decoder_max_len = decoder_max_len

    def __getitem__(self, idx) :
        eos_id = vocab.token_to_idx[vocab.eos_token]
        bos_id = vocab.token_to_idx[vocab.bos_token]
        pad_id = vocab.token_to_idx[vocab.pad_token]
        Make sure encoder and decoder outputs are smaller than the maximum length
        poet = self.poems_data[idx][:self.decoder_max_len - 2]  # -2 contains bos_id and eos_id
        prefix = self.poems_prefix[idx][- (self.encoder_max_len - 3) :]# -3 contains encoding for bos_id, eos_id, and head
        # Encode input and output

        sample = [bos_id] + self.tokenizer(poet) + [eos_id]
        prefix = self.tokenizer(prefix) if prefix else []
        heads = prefix + [bos_id] + self.tokenizer(self.poems_heads[idx]) + [eos_id] 
        sample_len = len(sample)
        heads_len = len(heads)
        sample = sample + [pad_id] * (self.decoder_max_len - sample_len)
        heads = heads + [pad_id] * (self.encoder_max_len - heads_len)
        mask = [1] * (sample_len - 1) + [0] * (self.decoder_max_len - sample_len) # -1 to make equal to out[2]
        out = [np.array(d, "int64") for d in [heads, heads_len, sample, sample, mask]]
        out[2] = out[2] [: -1]
        out[3] = out[3] [1:, np.newaxis]
        return out

    def shape(self) :
        return [([None, self.encoder_max_len], 'int64'.'src'),
                ([None.1].'int64'.'src_length'),
                ([None, self.decoder_max_len - 1].'int64'.'trg')], \
               [([None, self.decoder_max_len - 1.1].'int64'.'label'),
                ([None, self.decoder_max_len - 1].'int64'.'trg_mask')]


    def __len__(self) :
        return len(self.poems_data)

dataset = PoemDataset(poems_samples, poems_heads, poems_prefix, vocab)
batch_sampler = BatchSampler(dataset, batch_size=2048)
data_loader = DataLoader(dataset, batch_sampler=batch_sampler)
Copy the code

Define the model and train it

1. Model definition

from Seq2Seq.models import Seq2SeqModel
from paddlenlp.metrics import Perplexity
from Seq2Seq.loss import CrossEntropyCriterion
import paddle
from paddle.static import InputSpec

# parameters
lr = 1e-6
max_epoch = 20
models_save_path = "./checkpoints"

encoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200."hidden_size": 128."num_layers": 4."dropout": 2.."direction": "bidirectional"."mode": "GRU"}
decoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200."hidden_size": 128."num_layers": 4."direction": "forward"."dropout": 2.."mode": "GRU"."use_attention": True}

# inputs shape and label shape
inputs_shape, labels_shape = dataset.shape()
inputs_list = [InputSpec(input_shape[0], input_shape[1], input_shape[2]) for input_shape in inputs_shape]
labels_list = [InputSpec(label_shape[0], label_shape[1], label_shape[2]) for label_shape in labels_shape]

net = Seq2SeqModel(encoder_attrs, decoder_attrs)
model = paddle.Model(net, inputs_list, labels_list)

model.load("./final_models/model")

opt = paddle.optimizer.Adam(learning_rate=lr, parameters=model.parameters())

model.prepare(opt, CrossEntropyCriterion(), Perplexity())
Copy the code
W0122 21:03:30.616776 166 Device_context. cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 W0122 21:03:30.620450 166 Device_context.cc :465] Device: 0, cuDNN Version: 7.6.Copy the code

2. Model training

(./final_models/model)
model.fit(train_data=data_loader, epochs=max_epoch, eval_freq=1, save_freq=5, save_dir=models_save_path, shuffle=True)
Copy the code

3. Save the model

# save
model.save("./final_models/model")
Copy the code

Fourth, the generation of Tibetan poem

import warnings

def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False) :
    """ Post-process the decoded sequence. """
    eos_pos = len(seq) - 1
    for i, idx in enumerate(seq):
        if idx == eos_idx:
            eos_pos = i
            break
    seq = [idx for idx in seq[:eos_pos + 1]
           if (output_bos oridx ! = bos_idx)and (output_eos oridx ! = eos_idx)]return seq

# Define the class used to generate the message
from paddlenlp.data.tokenizer import JiebaTokenizer

class GenPoems() :
    # content (STR): The STR to generate poems, like "Gong Xi CAI"
    # vocab: the instance of paddlenlp.data.vocab.Vocab
    # model: the Inference Model
    def __init__(self, vocab, model) :
        self.bos_id = vocab.token_to_idx[vocab.bos_token]
        self.eos_id = vocab.token_to_idx[vocab.eos_token]
        self.pad_id = vocab.token_to_idx[vocab.pad_token]
        self.tokenizer = lambda x: [vocab.token_to_idx[x_] for x_ in x]
        self.model = model
        self.vocab = vocab

    def gen(self, content, max_len=128) :
        # max_len is the encoder_max_len in Seq2Seq Model.
        out = []
        vocab_list = list(vocab.token_to_idx.keys())
        for w in content:
            if w in vocab_list:
                content = re.sub("([...])".' ', content)
                heads = out[- (max_len - 3):] + [self.bos_id] + self.tokenizer(w) + [self.eos_id]
                len_heads = len(heads)
                heads = heads + [self.pad_id] * (max_len - len_heads)
                x = paddle.to_tensor([heads], dtype="int64")
                len_x = paddle.to_tensor([len_heads], dtype='int64')
                pred = self.model.predict_batch(inputs = [x, len_x])[0]
                out += self._get_results(pred)[0]
            else:
                warnings.warn("{} is not in vocab list, so it is skipped.".format(w))
                pass
        out = ' '.join([self.vocab.idx_to_token[id] for id in out])
        return out
    
    def _get_results(self, pred) :
        pred = pred[:, :, np.newaxis] if len(pred.shape) == 2 else pred
        pred = np.transpose(pred, [0.2.1])
        outs = []
        for beam in pred[0]:
            id_list = post_process_seq(beam, self.bos_id, self.eos_id)
            outs.append(id_list)
        return outs
Copy the code
Load the prediction model
from Seq2Seq.models import Seq2SeqInferModel
import paddle

encoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200."hidden_size": 128."num_layers": 4."dropout": 2.."direction": "bidirectional"."mode": "GRU"}
decoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200."hidden_size": 128."num_layers": 4."direction": "forward"."dropout": 2.."mode": "GRU"."use_attention": True}

infer_model = paddle.Model(Seq2SeqInferModel(encoder_attrs,
                                             decoder_attrs,
                                             bos_id=vocab.token_to_idx[vocab.bos_token],
                                             eos_id=vocab.token_to_idx[vocab.eos_token],
                                             beam_size=10,
                                             max_out_len=256))
infer_model.load("./final_models/model")
Copy the code
# Happy New Year
# Of course, confessions of love are ok
generator = GenPoems(vocab, infer_model)

content = "Alive and kicking"
poet = generator.gen(content)
for line in poet.strip().split('. ') :try:
        print("{} {} \ t.".format(line[0], line))
    except:
        pass
Copy the code
Life is not visible, where not to meet. Long Long tiger do not know where, the world does not see the world. The living are not human things, unconsciously human can not recognize. Tiger tiger leopard meet cannot be found, I do not know where do not know each other.Copy the code

conclusion

This project describes how to train a model to generate an introductory poem. The result shows that the model has some ability to generate poems. However, due to the limited training set size and training time, there is still a lot of room for improvement in the generated poems. Please look forward to further optimizing the model in the future.