1. Optimization of Chinese address element parsing based on PaddleNLP pre-trained ERNIE model

Intel Innovation Masters Cup Deep Learning Challenge Track 2: CCKS2021 Chinese NLP address elements analysis – Tianchi Competition – Alibaba Cloud Tianchi

1. Question description

The target of Chinese address element parsing task is to decompose an address into the detailed labels of the above parts, such as:

Input: Building 5, Taobaocheng, No.969, West Wenyi Road, Wuchang Street, Yuhang District, Hangzhou city, Zhejiang Province Zhejiang Province= City = Hangzhou District = Yuhang Town = Wuchang Street road= Wenyi West Road

2. Data description

The annotation data set consists of training set, validation set and test set, and the overall annotation data is about 20,000 pieces. Address data is obtained by capturing public address information (such as the Yellow Pages website, etc.) and generated through crowdsourced annotation. Detailed annotation specifications will be given along with the data release.

3. Introduction to named entity recognition

Named entity recognition is a very basic task in NLP, which is an important basic tool for many NLP tasks such as information extraction, question answering system, syntax analysis and machine translation. The accuracy of named entity recognition, which determines the effect of downstream tasks, is a basic problem in NLP. The NER task provides two solutions, one is LSTM/GRU + CRF, RNN class model to extract the information of the underlying text, and CRF(conditional random field) model to learn the connection between the underlying Token; The other is to directly predict Token label information through pre-training models, such as ERNIE and BERT models.

This project will demonstrate how to use PaddleNLP semantic pre-training model ERNIE to extract the name, phone number, province, city, district and detailed address from the express order to form structured information. Assist logistics industry practitioners to extract effective information, so as to reduce the cost of filling in documents for customers and complete the competition.

RNN named entity recognition concept

Prior to 2017, industry and academia relied on a Recurrent Neural Network (RNN), a sequential model, for NLP text processing.

The project of waybill information extraction based on BiGRU+CRF introduces how to use sequence model to complete waybill information extraction task.

In recent years, with the development of deep learning, the number of model parameters increases rapidly. To train these parameters, larger data sets are needed to avoid overfitting. However, for most NLP tasks, it is difficult (and prohibitively expensive) to build large-scale annotated datasets, especially for syntactic and semantic-related tasks. In contrast, the construction of large-scale unlabeled corpora is relatively easy. To take advantage of this data, we can learn a good representation from it and then apply that representation to other tasks. Recent studies show that the Pretrained Models (PTM) based on large-scale unlabeled corpus performs well on NLP task.

In recent years, a large number of studies have shown that Pretrained Models (PTM) based on large corpora can learn universal language representation, which is beneficial to downstream NLP tasks, and can avoid training Models from zero. With the development of computing power, the emergence of depth models (i.e. Transformer) and the enhancement of training skills, PTM continues to evolve from shallow to deep.

This example shows how the pre-training model represented by ERNIE(Enhanced Representation through Knowledge Integration) Finetune to complete the sequence annotation task.

3. Data analysis

1.PaddleNLP Environment preparation

! pip install --upgrade paddlenlpCopy the code

from functools import partial

import paddle
from paddlenlp.datasets import MapDataset
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import ErnieTokenizer, ErnieForTokenClassification
from paddlenlp.metrics import ChunkEvaluator
from utils import convert_example, evaluate, predict, load_dict
Copy the code

2. Data sorting

! unzip'data/data94613/ 'Intel innovation masters cup' deep learning challenge track 2: CCKS2021中文NLP address element analysis.zip'
Copy the code

! mv'б ░ ╙ kind guide ╠ ╪ ┤ ╢) ┤ ╨ ┬ ┤ є ╩ ж ▒ н б ▒ ╔ ю ╢ ╚ ╤ з ╧ ░ ╠ Ї ╒ ╜ ╚ № ╚ № ╡ └ 2 f ║ CCKS2021 ╓ ╨ ╬ ─ NLP ╡ ╪ ╓ ╖ ╥ seem ╦ ╪ ╜ т ╬ Ў'dataset ! mv'the dataset / ╓ ╨ ╬ ─ ╡ ╪ ╓ ╖ ╥ seem ╦ ╪ ╜ т ╬ Ў ▒ are ╫ kind guide ╣ ц ╖ ╢. PDF'Dastaset/Annotation specification for Chinese address elements parsing. PDFCopy the code

3. Data viewing

! head -n10 dataset/train.conllCopy the code

Zhejiang B-Prov E-Prov Hangzhou B-City I-City E-City B-District Dry I-District E-District 9 B-Town Fort I-townCopy the code

! head -n10 dataset/dev.conllCopy the code

Hangzhou B-City State E-City five B-POi continent I-POI national I-POI e-POI Zhejiang B-PROv Jiang I-prov province E-PROVCopy the code

! head dataset/final_test.txtCopy the code

1 1000-0, Xiaoguan Beili, Chaoyang District 2 00, Huixin East Street, Chaoyang District 3, Southeast Corner of nanmofang Road and West Dawang Road, Chaoyang District 4, Panjiayuan Nanli, Chaoyang District 5, Xiangjun Nanli 2nd Lane, Chaoyang District near 0 6, Multiple business outlets in Chaoyang District 7, Multiple business outlets in Chaoyang District 8, Multiple business outlets in Chaoyang District Floor 0, Shangfang Building, no. 00, Beisanhuan Middle Road, Chaoyang DistrictCopy the code

4. Data format adjustment

import os

def format_data(source_filename, target_filename) :
    datalist=[]
    with open(source_filename, 'r', encoding='utf-8') as f:
        lines=f.readlines()
    words=' '
    labels=' '
    flag=0
    for line in lines:  
        if line == '\n':
            item=words+'\t'+labels+'\n'
            # print(item)
            datalist.append(item)
            words=' '
            labels=' '
            flag=0
            continue
        word, label = line.strip('\n').split(' ')
        if flag==1:
            words=words+'\ 002'+word
            labels=labels+'\ 002'+label
        else:
            words=words+word
            labels=labels+label
            flag=1
    with open(target_filename, 'w', encoding='utf-8') as f:
        lines=f.writelines(datalist)
    print(f'{source_filename}After the file format conversion is complete, save as{target_filename}')
Copy the code

format_data('dataset/dev.conll'.'dataset/dev.txt')
format_data(r'dataset/train.conll'.r'dataset/train.txt')
Copy the code

TXT dataset/train.conll file format conversion is completed and saved to dataset/train. TXTCopy the code

! head dataset/dev.txtCopy the code

Boka Garment, No. 0, Boka Road, Qiosi Street, Yuhang city, Zhejiang Province, China B-provI-provE-provB-cityI-cityE-cityB-districtE-districtB-townI-townI-townE-townB-roadI-roadE-roadB-roadnoE-roadnoB-poiI Building 00, Jiyang Bayi New Village, Zhuji City, Zhejiang Province B-provE-provB-districtI-districtE-districtB- towne-town B-poiI- poiE- housenoI- Housenoe-Houseno, 9 / F, Block A, Hangzhou Mansion Mall, Wulin Plaza, Hangzhou city B-cityI-cityE-cityB-poiI-poiI-poiE-poiB-subpoiI-subpoiI-subpoiE-subpoiB-subpoiE-subpoiB-housenoE-housenoB-floornoE-floor Time Electronic Market, 0000 Dengyun Road, Gongshu District, Hangzhou City, Zhejiang Province B-provI-provE-provB-cityI-cityE-cityB-districtI-districtE-districtB-roadI-roadE-roadB-roadnoI-roadnoI-roadnoI-roadnoE-ro Building 00, Lianfeng Gongyu, Zonghan Street, Cixi City, Ningbo City, Zhejiang Province B-provI-provE-provB-cityI-cityE-cityB-districtI-districtE-districtB-townI-townI-townE-townB-poiI-poiI-poiE-poiB-housenoI Louyi Network Technology Co., LTD., Lucheng District Labor Market Cross-border E-commerce Park, Wenzhou city, Zhejiang Province B-provI-provE-provB-cityI-cityE-cityB-districtI-districtE-districtB-poiI-poiI-poiE-poiB-devzoneI-devzoneI-devzoneI-devzo NeE - devzoneB - floornoI - floornoE - floornoB - subpoiI - subpoiI - subpoiI - subpoiI - subpoiI - subpoiI - subpoiE - subpoi kang road # 00 00 0 building cannes industrial park B-roadI-roadE-roadB-roadnoI-roadnoE-roadnoB-devzoneI-devzoneI-devzoneI-devzoneE-devzoneB-housenoI-housenoE-housenoB-floo Rnoe-floorno, Lantian Road, West Industrial Zone, Yongkang city, Jinhua B-cityE-cityB-districtI-districtE-districtB-devzoneI-devzoneI-devzoneI-devzoneE-devzoneB-roadI-roadE-roadB-poiI-poiI-poi E-poi Tissue Factory, Rear Building, 0000 Renmin Road, Yishan B-townE-townB-roadI-roadE-roadB-roadnoI-roadnoI-roadnoI-roadnoE-roadnoB-housenoE-housenoB-poiI-poiE-poiCopy the code

5. Load the custom data set

It is recommended to use MapDataset() to customize the dataset.

def load_dataset(datafiles) :
    def read(data_path) :
        with open(data_path, 'r', encoding='utf-8') as fp:
            next(fp)  # Skip header
            for line in fp.readlines():
                words, labels = line.strip('\n').split('\t')
                words = words.split('\ 002')
                labels = labels.split('\ 002')
                yield words, labels

    if isinstance(datafiles, str) :return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple) :return [MapDataset(list(read(datafile))) for datafile in datafiles]        
Copy the code

# Create dataset, tokenizer and dataloader.
train_ds, dev_ds = load_dataset(datafiles=(
        './dataset/train.txt'.'./dataset/dev.txt'))
Copy the code

for i in range(5) :print(train_ds[i])
Copy the code

([' zhejiang ', 'jiang', 'province', 'temperature', 'state', ', ', 'flat' and 'male', 'county', 'sea', 'the west', 'town', 'song', 'port' and 'male', 'garden', 'south', 'road' and '0', '0', '0', '0', '#'), ['B-prov', 'I-prov', 'E-prov', 'B-city', 'I-city', 'E-city', 'B-district', 'I-district', 'E-district', 'B-town', 'I-town', 'E-town', 'B-poi', 'I-poi', 'I-poi', 'E-poi', 'B-road', 'E-road', 'B-roadno', 'I-roadno', 'I-roadno', 'I - roadno', 'E - roadno]] ([' zhejiang', 'jiang', 'province', 'yu', 'yao', ', ', 'die', ' 'and' city ', 'gold', 'type', 'road' and '0', '0', '0', 'no.', '_', 'sample', 'sample', 'red' and '0', 'A', 'A', 'print'], [' B - prov ', 'I - prov', 'E - prov', 'B - district', 'I - district', 'E - district', 'B - poi, 'I-poi', 'E-poi', 'B-road', 'I-road', 'E-road', 'B-roadno', 'I-roadno', 'I-roadno', 'E-roadno', 'O', 'B-subpoi', 'I - subpoi', 'I - subpoi', 'I - subpoi', 'I - subpoi', 'I - subpoi', 'E - subpoi]] ([' zhejiang', 'river', 'province', 'hang', 'state', ', ', 'jiang', 'dry', 'area', 'white', 'Yang', 'street', 'said', 'the', 'sand', 'a', 'send', 'area', 'the', 'he', 'jiang', 'bank', 'spend', 'garden', 'completed', 'scene', 'bay', '0', '0', 'building'], [' B - prov ', 'I-prov', 'E-prov', 'B-city', 'I-city', 'E-city', 'B-district', 'I-district', 'E-district', 'B-town', 'I-town', 'I-town', 'E-town', 'B-devzone', 'I-devzone', 'I-devzone', 'I-devzone', 'E-devzone', 'B-poi', 'I-poi', 'I-poi', 'I - poi', 'I - poi', 'E poi -', 'B - subpoi', 'I - subpoi', 'E - subpoi', 'B - houseno', 'I - houseno', 'E - houseno]] ([' autumn', 'ling', 'road', 'zhejiang', 'river', 'blue', 'stream', 'gold', 'stand', 'to', 'box' and 'industry', 'a', 'limited' and 'male', 'department'], [' B - road ', 'I - road', 'E - road', 'B - poi', 'I - poi, 'I - poi', 'I - poi', 'I - poi', 'I - poi', 'I - poi', 'I - poi', 'I - poi', 'I - poi', 'I - poi', 'I - poi', 'E poi -']) ([' south ', 'lake', 'area', 'in', 'ring', 'south', 'road' and 'and', 'spend', 'garden' and 'way', '/', 'x', 'mouth', 'fine', 'and', ', ', 'city', 'home' and 'rules',' cross 'and' building ', 'a', 'tube', 'cut', 'committee' and 'part', 'will'], [' B - district ', 'I - district', 'E - district', 'B - road', 'I - road', 'I - road', 'E - road', 'O', 'B - road', 'I - road, 'E-road', 'B-intersection', 'I-intersection', 'E-intersection', 'B-city', 'I-city', 'E-city', 'B-poi', 'I-poi', 'I-poi', 'I-poi', 'I-poi', 'I-poi', 'I-poi', 'I-poi', 'I-poi', 'I-poi', 'E-poi'])Copy the code

6 Label Builds a label table

Each piece of data contains a sentence of text and the corresponding label of each Chinese character and number in the text. For the corresponding relationship, see Chinese Address Elements Parsing annotation Specification. PDF

After that, input sentences need to be processed, such as cutting words, mapping word list ID and so on.

def gernate_dic(source_filename1, source_filename2, target_filename) :
    data_list=[]

    with open(source_filename1, 'r', encoding='utf-8') as f:
        lines=f.readlines()

    for line in lines:
        ifline ! ='\n':
            dic=line.strip('\n').split(' ')[-1]
            if dic+'\n' not in data_list:
                data_list.append(dic+'\n')
    
    with open(source_filename2, 'r', encoding='utf-8') as f:
        lines=f.readlines()

    for line in lines:
        ifline ! ='\n':
            dic=line.strip('\n').split(' ')[-1]
            if dic+'\n' not in data_list:
                data_list.append(dic+'\n')

    with open(target_filename, 'w', encoding='utf-8') as f:
        lines=f.writelines(data_list)    
Copy the code

# Generate DIC from dev file
gernate_dic('dataset/train.conll'.'dataset/dev.conll'.'dataset/mytag.dic')
# gernate_dic('dataset/dev.conll', 'dataset/mytag_dev.dic')
Copy the code

# View the generated DIC file! cat dataset/mytag.dicCopy the code

B-prov
E-prov
B-city
I-city
E-city
B-district
I-district
E-district
B-town
I-town
E-town
B-community
I-community
E-community
B-poi
E-poi
I-prov
I-poi
B-road
E-road
B-roadno
I-roadno
E-roadno
I-road
O
B-subpoi
I-subpoi
E-subpoi
B-devzone
I-devzone
E-devzone
B-houseno
I-houseno
E-houseno
B-intersection
I-intersection
E-intersection
B-assist
I-assist
E-assist
B-cellno
I-cellno
E-cellno
B-floorno
E-floorno
S-assist
I-floorno
B-distance
I-distance
E-distance
B-village_group
E-village_group
I-village_group
S-poi
S-intersection
S-district
S-community
Copy the code

7. Data processing

The pre-training model ERNIE deals with Chinese data on a word-by-word basis. PaddleNLP already has a corresponding Tokenizer built in for various pre-trained models. Specify the name of the model you want to use to load the corresponding Tokenizer.

Tokenizer is used to convert the raw input text into a form of input data that the model can accept.

label_vocab = load_dict('./dataset/mytag.dic')
tokenizer = ErnieTokenizer.from_pretrained('groeb - 1.0')

trans_func = partial(convert_example, tokenizer=tokenizer, label_vocab=label_vocab)

train_ds.map(trans_func)
dev_ds.map(trans_func)
print (train_ds[0])
Copy the code

[2021-06-28 13:26:34.755] [INFO] - Downloading vocab.txt from Downloading to Vocab.txt https://paddlenlp.bj.bcebos.com/models/transformers/ernie/vocab.txt | 100% █ █ █ █ █ █ █ █ █ █ | 90/90 [00:00 "00:00, 4654.25 IT /s [1, 1382, 409, 244, 565, 404, 99, 157, 507, 308, 233, 213, 484, 945, 3074, 53, 509, 219, 216, 540, 540, 540, 540, 500, 2], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 25, [16, 24, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 17, 17, 15, 18, 19, 20, 21, 21, 21, 22, 24])Copy the code

Data is read in

Use the paddles.io.DataLoader interface to load data asynchronously from multiple threads.

ignore_label = -1
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack(),  # seq_len
    Pad(axis=0, pad_val=ignore_label)  # labels
): fn(samples)

train_loader = paddle.io.DataLoader(
    dataset=train_ds,
    batch_size=300,
    return_list=True,
    collate_fn=batchify_fn)
dev_loader = paddle.io.DataLoader(
    dataset=dev_ds,
    batch_size=300,
    return_list=True,
    collate_fn=batchify_fn)
Copy the code

PaddleNLP one key load pre-training model

1. Load the pre-training model

Express order information extraction is essentially a sequential annotation task, and PaddleNLP has built-in fine-tune network for text classification of downstream tasks for various pre-training models. The following tutorial uses ERNIE as a pre-training model to complete sequence annotation tasks.

Paddlenlp. Transformers. ErnieForTokenClassification () training models of one line of code that can be loaded groeb fine – most cerebral sci-film network for sequence tagging task. It splices a fully connected network after ERNIE model for classification.

Paddlenlp. Transformers. ErnieForTokenClassification. From_pretrained () method simply specify wanted to use the model name and the number of text classification categories can complete definition model network.

# Define the model netword and its loss
model = ErnieForTokenClassification.from_pretrained("Groeb - 1.0", num_classes=len(label_vocab))
Copy the code

[2021-06-28 13:26:34.864] [INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams and saved to / home/aistudio /. Paddlenlp/models/groeb - 1.0 [13:26:34 2021-06-28, 866] [INFO] - Downloading ernie_v1_chn_base. Pdparams 100% from https://paddlenlp.bj.bcebos.com/models/transformers/ernie/ernie_v1_chn_base.pdparams | █ █ █ █ █ █ █ █ █ █ | 392507/392507 [00:08 < 00:00, 48559.94 it/s] / opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages/paddle/fluid/dygraph/the layers. The py: 1297: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err))) / opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages/paddle/fluid/dygraph/the layers. The py: 1297: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err)))Copy the code

PaddleNLP supports not only ERNIE pre-training models, but also BERT, RoBERTa, Electra and others. The following table summarizes the various pre-training models currently supported by PaddleNLP. You can use the PaddleNLP model for tasks such as text categorization, sequence tagging, and q&A. At the same time, we provide the parameter weights of many pre-training models for users to use, including the pre-training weights of more than 20 Chinese language models. Chinese pre-training models include Bert-Base-Chinese, Bert-WWM-Chinese, bert-WWM-Ext-Chinese, ErNIE -1.0, ErNIE – Tiny, GPT2-BASE-CN, roberta-wwm-ext, roberta-wwm-ext-large, rbt3, rbtl3, chinese-electra-base, chinese-electra-small, chinese-xlnet-base, Chinese-xlnet-mid, Chinese-XLNET-large, Unified_transformer – 12L-CN, Unified_transformer – 12L-CN-LUge, etc.

For more pre-training models refer to the PaddleNLP Transformer API.

For more examples of fine-tune downstream tasks using pre-training models, please refer to examples.

2. Set up fine-tune optimization strategy and model configuration

The migration optimization learning rate strategy for ERNIE/BERT’s Transformer models is warmup’s dynamic learning rate.

metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters())
Copy the code

5. Model training and evaluation

1. Training model

The process of model training usually includes the following steps:

Extract a batch data from the Dataloader
Feed batch data to Model for forward calculation
Forward calculation results are passed to the loss function to calculate loss. The forward calculation results are transmitted to the evaluation method to calculate the evaluation index.
Loss Indicates that the gradient is updated. Repeat the above steps.

Each time an EPOCH is trained, the program will evaluate the effect of the current model training.

step = 0
for epoch in range(50) :for idx, (input_ids, token_type_ids, length, labels) in enumerate(train_loader):
        logits = model(input_ids, token_type_ids)
        loss = paddle.mean(loss_fn(logits, labels))
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        step += 1
        print("epoch:%d - step:%d - loss: %f" % (epoch, step, loss))
    evaluate(model, metric, dev_loader)

    paddle.save(model.state_dict(),
                './checkpoint/model_%d.pdparams' % step)
Copy the code

Epoch: 49-step: 1832-loss: 0.057792 epoch: 49-step: 1833-loss: 0.053191 epoch: 49-step: 1834-loss: 0.051053 epoch: 49-step: 1835-loss: 0.054221 epoch: 49-step: 1836-loss: 0.036712 epoch: 49-step: 1837-loss: 0.038394 epoch: 49-step: 1838-loss: 0.045484 epoch: 49-step: 1839-loss: 0.068006 epoch: 49-step: 1840-loss: Epoch: 49-step: 1841-loss: 0.049253 Epoch: 49-step: 1842-loss: 0.049330 epoch: 49-step: 1843-loss: Epoch: 49-step: 1844-loss: 0.041376 epoch: 49-step: 1846-loss: 0.051696 epoch: 49-step: 1844-loss: 0.042183 epoch: 49-step: 1845-loss: 0.041376 epoch: 49-step: 1846-loss: 0.040038 epoch: 49-step: 1847-loss: 0.046694 epoch: 49-step: 1848-loss: 0.043038 epoch: 49-step: 1849-loss: 0.046348 Epoch: 49-step: 1850-loss: 0.007658 Eval Precision: 0.997797-recall: 0.998420-F1:0.998109Copy the code

2. Save the model

! mkdir ernie_result model.save_pretrained('./ernie_result')
tokenizer.save_pretrained('./ernie_result')
Copy the code

Six, forecasting

Training Well preserved training can be used for prediction. Customize the prediction data by calling predict(), as shown in the following example code.

import numpy as np
import paddle
from paddle.io import DataLoader
import paddlenlp as ppnlp
from paddlenlp.datasets import load_dataset
from paddlenlp.data import Stack, Tuple, Pad, Dict
from paddlenlp.datasets import MapDataset
from paddlenlp.transformers import ErnieTokenizer, ErnieForTokenClassification
from paddlenlp.metrics import ChunkEvaluator
from utils import convert_example, evaluate, predict, load_dict
from functools import partial
Copy the code

! head -n20 dataset/final_test.txtCopy the code

1 1000-0, Xiaoguan Beili, Chaoyang District 2 00, Huixin East Street, Chaoyang District 3, Southeast Corner of nanmofang Road and West Dawang Road, Chaoyang District 4, Panjiayuan Nanli, Chaoyang District 5, Xiangjun Nanli 2nd Lane, Chaoyang District near 0 6, Multiple business outlets in Chaoyang District 7, Multiple business outlets in Chaoyang District 8, Multiple business outlets in Chaoyang District Nine north third ring road chaoyang district commercial housing building, 00 0 f 10 chaoyang river township home Kang Ying 00 area 11 chaoyang north base business will Taiwan township harmony home village 12 13 village road chaoyang district LangXinZhuang north road chaoyang district home floor 14 0 0 court jiuxianqiao road, chaoyang district building a layer of chaoyang district 15 television building in the southern district of north 0 0 f 16 chaoyang district build the hospital 18 Tower A, Aocheng Rongfu Center, Chaoyang District, 0000 19, Room 0000, Building A0, Inte Apartment, 00 Xibahe Xili, Chaoyang District, 20 Yard 00, Yaojiayuan Road, Chaoyang DistrictCopy the code

1. Define the test dataset

def load_dataset(datafiles) :
    def read(data_path) :
        with open(data_path, 'r', encoding='utf-8') as fp:
            # next(fp) # Skip header
            for line in fp.readlines():
                ids, words = line.strip('\n').split('\ 001')
                words=[ch for ch in words]
                The data set to be predicted does not have label
                labels=['O' for x in range(0.len(words))]

                yield words, labels
                # yield words

    if isinstance(datafiles, str) :return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple) :return [MapDataset(list(read(datafile))) for datafile in datafiles]      
Copy the code

# Create dataset, tokenizer and dataloader.
test_ds = load_dataset(datafiles=('./dataset/final_test.txt'))
Copy the code

for i in range(20) :print(test_ds[i])
Copy the code

([' the ', 'Yang', 'area', 'small' and 'closed', 'north' and 'in' and '0', '0', '0', '-' and '0', 'no.'], [' O ', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'. 'O', 'O', 'O']) ([' the ', 'Yang', 'area', ' ', 'new', 'east' and 'street', '0', '0', 'no.'], [' O ', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'. 'O']) ([' to 'and' male ', 'area', 'south', 'grind', 'room', 'road' and 'and', 'the west', 'big', 'look,' road ', '/', 'mouth', 'east' and 'south', 'Angle'], [' O ', 'O', 'O', 'O', 'O'. 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])Copy the code

2. Load the trained model

label_vocab = load_dict('./dataset/mytag.dic')
tokenizer = ErnieTokenizer.from_pretrained('groeb - 1.0')

trans_func = partial(convert_example, tokenizer=tokenizer, label_vocab=label_vocab)
test_ds.map(trans_func)
print (test_ds[0])
Copy the code

ignore_label = 1
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack(),  # seq_len
    Pad(axis=0, pad_val=ignore_label)  # labels
): fn(samples)
Copy the code

test_loader = paddle.io.DataLoader(
    dataset=test_ds,
    batch_size=30,
    return_list=True,
    collate_fn=batchify_fn)
Copy the code

def my_predict(model, data_loader, ds, label_vocab) :
    pred_list = []
    len_list = []
    for input_ids, seg_ids, lens, labels in data_loader:
        logits = model(input_ids, seg_ids)
        # print(len(logits[0]))
        pred = paddle.argmax(logits, axis=-1)
        pred_list.append(pred.numpy())
        len_list.append(lens.numpy())
    preds ,tags= parse_decodes(ds, pred_list, len_list, label_vocab)
    return preds, tags
Copy the code

# Define the model netword and its loss
model = ErnieForTokenClassification.from_pretrained("Groeb - 1.0", num_classes=len(label_vocab))
Copy the code

/ opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages/paddle/fluid/dygraph/the layers. The py: 1297: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err))) / opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages/paddle/fluid/dygraph/the layers. The py: 1297: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err)))Copy the code

model_dict = paddle.load('ernie_result/model_state.pdparams')
model.set_dict(model_dict)
Copy the code

3. Predict and save

from utils import *
preds, tags = my_predict(model, test_loader, test_ds, label_vocab)
Copy the code

file_path = "ernie_results.txt"
with open(file_path, "w", encoding="utf8") as fout:
    fout.write("\n".join(preds))
# Print some examples
print(
    "The results have been saved in the file: %s, some examples are shown below: "
    % file_path)
Copy the code

The results have been saved in the file: ernie_results.txt, some examples are shown below: 
Copy the code

print("\n".join(preds[:20]))
Copy the code

B-district I-district E-district B-road I-road I-road E-road B-roadno I-roadno I-roadno I-roadno I-roadno E-roadno B-district I-district E-district B-road I-road I-road E-road B-roadno I-roadno E-roadno B-district I-district E-district  B-road I-road I-road E-road O B-road I-road I-road E-road B-intersection E-intersection B-assist I-assist E-assist B-district I-district E-district B-poi I-poi E-poi B-road E-road B-houseno I-houseno E-houseno B-district I-district E-district B-road I-road I-road E-road B-road E-road B-roadno E-roadno B-assist E-assist B-district I-district E-district B-poi I-poi I-poi I-poi I-poi E-poi B-district I-district E-district B-poi I-poi B-poi I-poi I-poi E-poi B-district I-district E-district B-poi I-poi I-poi I-poi I-poi E-poi B-district I-district E-district B-road I-road I-road I-road E-road B-roadno I-roadno E-roadno B-poi I-poi I-poi E-poi B-floorno E-floorno B-district I-district E-district B-town I-town E-town B-poi I-poi I-poi E-poi B-subpoi I-subpoi E-subpoi B-assist E-assist O E-subpoi B-district I-district E-district B-town I-town E-town B-community I-community E-community B-district I-district E-district B-community I-community I-community E-community O B-district I-district E-district B-road I-road I-road I-road E-road B-district I-district E-district B-road I-road I-road E-road B-poi I-poi E-poi B-houseno I-houseno E-houseno O O B-district I-district E-district B-poi I-poi E-poi I-poi E-poi B-subpoi E-poi B-houseno I-houseno E-houseno B-floorno E-floorno B-district I-district E-district B-poi I-poi I-poi E-poi B-district I-district E-district B-road I-road I-road I-road E-road B-roadno I-roadno E-roadno B-poi I-poi I-poi I-poi I-poi I-poi I-poi E-poi B-houseno I-houseno E-houseno B-cellno I-cellno E-cellno O I-houseno I-houseno I-houseno E-houseno B-district I-district E-district B-poi I-poi I-poi I-poi I-poi E-poi B-houseno E-houseno O O O O B-district I-district E-district B-road I-road I-road B-road E-road B-roadno I-roadno E-roadno B-poi I-poi I-poi E-poi B-houseno I-houseno E-houseno O O O O E-floorno B-district I-district E-district B-road I-road I-road E-road B-poi I-poi I-poi E-poiCopy the code

! head ernie_results.txtCopy the code

B-district I-district E-district B-road I-road I-road E-road B-roadno I-roadno I-roadno I-roadno I-roadno E-roadno B-district I-district E-district B-road I-road I-road E-road B-roadno I-roadno E-roadno B-district I-district E-district  B-road I-road I-road E-road O B-road I-road I-road E-road B-intersection E-intersection B-assist I-assist E-assist B-district I-district E-district B-poi I-poi E-poi B-road E-road B-houseno I-houseno E-houseno B-district I-district E-district B-road I-road I-road E-road B-road E-road B-roadno E-roadno B-assist E-assist B-district I-district E-district B-poi I-poi I-poi I-poi I-poi E-poi B-district I-district E-district B-poi I-poi B-poi I-poi I-poi E-poi B-district I-district E-district B-poi I-poi I-poi I-poi I-poi E-poi B-district I-district E-district B-road I-road I-road I-road E-road B-roadno I-roadno E-roadno B-poi I-poi I-poi E-poi B-floorno E-floorno B-district I-district E-district B-town I-town E-town B-poi I-poi I-poi E-poi B-subpoi I-subpoi E-subpoi B-assist E-assist O E-subpoiCopy the code

! head ./dataset/final_test.txtCopy the code

1 1000-0, Xiaoguan Beili, Chaoyang District 2 00, Huixin East Street, Chaoyang District 3, Southeast Corner of nanmofang Road and West Dawang Road, Chaoyang District 4, Panjiayuan Nanli, Chaoyang District 5, Xiangjun Nanli 2nd Lane, Chaoyang District near 0 6, Multiple business outlets in Chaoyang District 7, Multiple business outlets in Chaoyang District 8, Multiple business outlets in Chaoyang District Floor 0, Shangfang Building, no. 00, Beisanhuan Middle Road, Chaoyang DistrictCopy the code

4. Transform and save the results

def main() :
    data_list = []
    with open('ernie_results.txt', encoding='utf-8') as f:
        data_list = f.readlines()
    return data_list


if __name__ == "__main__":
    print('1^ A ^ ab-prov e-prov b-city e-poi e-poi ')
    sentence_list = main()
    print(len(sentence_list))

    final_test = []
    with open('dataset/final_test.txt', encoding='utf-8') as f:
        final_test = f.readlines()
    test_data = []
    print(f'{len(final_test)}\t\t{len(sentence_list)}')
    for i in range(len(final_test)):
        # test_data.append(final_test[i].strip('\n') + '\001' + sentence_list[i] + '\n')
        test_data.append(final_test[i].strip('\n').strip(' ') + '\ 001' + sentence_list[i].strip(' '))
    with open('predict.txt'.'w', encoding='utf-8') as f:
        f.writelines(test_data)
    print(50 * The '*')
    print('write result ok! ')
    print(50 * The '*')

Copy the code

Zhejiang hangzhou ali ^ ^ 1 A, AB - prov E - prov B - city - the city B - poi E - E poi, 50000, 50000, 50000 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *  write result ok! * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

! head predict.txtCopy the code

1 B-district I-district E-District B-road I-road I-Road E-Road B-Roadno I-Roadno I-Roadno I-Roadno I-Roadno No. 00 Huixin East Street, Chaoyang District b-District E-District B-road I-road I-Road I-Roadno I-Roadno E-Roadno 3 B-District i-District E-District B-road I-road E-road O B-road I-road I-road E-Road, southeast Corner of nanmofang Road and West Dawang Road, Chaoyang District Intersection e-intersection B-assist I-Assist e-assist 4 No. 00 Panjiayuan Nanli B-District i-District e-District b-poi i-Poi B-district I-District E-Road I-Road I-Houseno 5 Near No. 0, No.2 Lane, Xiangjunnanli, Chaoyang District B-road E-Road B-Roadno e-Roadno B-Assist E-Assist 6 Multiple branches in Chaoyang District B-district i-district e-district b-poi i-poi i-poi I-poi E-POi 7 Multiple branches in Chaoyang District B-District i-District e-District B-Poi i-Poi B-Poi i-Poi 8 Multiple branches in Chaoyang District B-district B-district E-District B-Road, 0 / F, Shangfang Building, 00 Beisanhuan Zhong Lu, Chaoyang District I-road I-road I-road E-road B-roadno I-roadno E-roadno B-poi I-poi I-poi E-poi B-floorno E-floorno Address b-district I-district E-district B-town I-town e-town B-poi i-poi i-poi E-poi B-mystery I i-Mystery I E-subpoi B-assist E-assist O E-subpoiCopy the code

5. Check the submission format

import linecache


def check(submit_path, test_path, max_num=50000) :
    ":param submit_path: file name of player submitted :param test_path: original test data name :param max_num: test data size :return:" "
    N = 0
    with open(submit_path, 'r', encoding='utf-8') as fin:
        for line in fin:
            line = line.strip()
            if line == ' ':
                continue
            N += 1
            parts = line.split('\ 001')  # id, sent, tags
            if len(parts) ! =3:
                raise AssertionError(F "separator is not correct, please write to file with '\\001' to separate ID, sentence and forecast tag! Error Line:{line.strip()}")
            elif len(parts[1]) != len(parts[2].split(' ')) :print(line)
                raise AssertionError(F "Please make sure that the sentence length is the same as the label length, and the labels are separated by Spaces. ID:{parts[0]} Sent:{parts[1]}")
            elif parts[0] != str(N):
                raise AssertionError(F "Please ensure that the ID of the test data is valid! ID:{parts[0]} Sent:{parts[1]}")
            else:
                for tag in parts[2].split(' ') :if (tag == 'O' or tag.startswith('S-')
                        or tag.startswith('B-')
                        or tag.startswith('I-')
                        or tag.startswith('E-')) is False:
                        raise AssertionError(F "prediction result has invalid label! ID:{parts[0]} Tag:{parts[2]}")

                test_line = linecache.getline(test_path, int(parts[0]))
                test_sent = test_line.strip().split('\ 001') [1]
                iftest_sent.strip() ! = parts[1].strip():
                    raise AssertionError(F "Please do not change the original test data! ID:{parts[0]} Sent:{parts[1]}")

    ifN ! = max_num:raise AssertionError(F "Please ensure the integrity of the test data{max_num}Do not lose or add data!")

    print('Well Done!!!!! ')


check('predict.txt'.'dataset/final_test.txt')

Copy the code

Well Done!!!!!Copy the code

Seven, finally submitted successfully

Optimization of Chinese address element parsing based on PaddleNLP pre-trained ERNIE model