“This is the third day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021.”

Practice 02 Named entity Recognition

Main Contents: Named Entity Recognition (NER) is the process of finding relevant entities from a sentence and identifying their locations. Entities can be defined as people’s names, institutions, places, Depending on the business needs, it can be gender, product model, etc.

For example, Liu Yuanyuan was admitted to Tsinghua University. Here Liu Yuanyuan is a name, Tsinghua University is an institution.

SimpleTransformers Installation Reference: [NLP Practice 01] SimpleTransformers installation and text classification simple implementation

The data set

We used CLUE’s as the benchmark dataset

Select data set:

(1) I will give you a CLUE to fine-grain NER

Chinese Language Comprehension (CLUE)

www.cluebenchmarks.com/dataSet_sea…

On the basis of THUCTC, an open source text classification data set of Tsinghua University, some data were selected for fine-grained named entity annotation. The original data came from Sina News RSS.

Training set: 10748, Verification set: 1343, label category: 10

The labels are:

  • Address
  • Title (Book)
  • Company
  • Game
  • Goverment
  • Movie
  • Name (name)
  • Organization
  • Position
  • (scene)

Cluener Download link: Data download

Mission details: CLUENER2020

{"text": "Dr. Ye Laogui from the enterprise credit Department of Zheshang Bank understood the five thresholds from another Angle. Ye Laogui believes that for the current domestic commercial banks,."label": {
   "name": {"Laurel leaf": [[9.11]]}, 
   "company": {Zheshang Bank: [[0.3]]}}} {"text": "Live forever, CSOL Biogenic Frenzy fills you with bullets."."label": {
   "game": {"CSOL": [[4.7]]}}}Copy the code

Label definitions and rules:

Address: ** No. **, ** Road, ** Street, ** Village, ** District, ** city, ** Province, etc. (if appearing alone, it is also marked), note: the address should be marked completely and marked to the smallest extent. Book: novels, magazines, problem sets, textbooks, teaching AIDS, atlases, cookbooks, a category of books available in bookstores, including e-books. Company: ** Company, ** Group, ** Bank (except the central Bank and the People's Bank of China, which are government agencies), such as New Oriental, including Xinhuanet/China Military net, etc. Game: common games, pay attention to some games adapted from novels and TV dramas, to analyze whether the specific scene is a game. Goverment: it includes two levels: central administration and local administration. The central administrative organs include The State Council, Ministries and commissions of the State Council (including ministries, commissions, the People's Bank of China, and the National Audit Office), agencies directly under The State Council (such as customs, taxation, industry and commerce, and sepA), and the armed forces. Movie (movie) : a movie, also including some documentaries released in cinemas, if it is adapted from the title of the movie, it should be distinguished according to the context of the scene. Name: generally refers to the name of the person, including the characters in the novel, Song Jiang, Wu Song, Guo Jing, the characters in the novel nicknames: Timely rain, flower monk, nickname of famous characters, through this nickname can be corresponding to a specific character. Organization: basketball team, football team, band, community, etc., including the gangs in the novel, such as Shaolin Temple, Gai Gang, Tiezhang Gang, Wudang, Emei, etc. Title in ancient times: Governor, prefecture, national teacher, etc. Modern general managers, journalists, ceos, artists, collectors, etc. Common tourist attractions such as: Changsha Park, Shenzhen Zoo, Aquarium, Botanical Garden, Yellow River, Yangtze River and so on.Copy the code

The data processing

SimpleTransformers requires that the data must be included in at least three columns of Pandas DataFrames.

To name the column’s sentence IDS, text, and tags, SimpleTransformers handles the data. The first column contains the sentence ID of type int. The second column contains words of type STR. The second column contains the label of type int.

import json
import pandas as pd

def load_cluener2020_data(path) :
    "" ADAPTS to simpleTransformer's loading mode
    data = []
    labels_list = []
    with open(path, "r", encoding="utf-8") as f:
        for idx, line in enumerate(f):
            line = json.loads(line.strip())
            text = line["text"]
            label_entities = line.get("label".None)
            words = list(text)
            labels = ['O'] * len(words)
            if label_entities:
                for key, value in label_entities.items():
                    for sub_name, sub_index in value.items():
                        for start_index, end_index in sub_index:
                            assert "".join(words[start_index:end_index+1]) == sub_name
                            if start_index == end_index:
                                labels[start_index] = "S-" + key
                            else:
                                labels[start_index] = "B-" + key
                                labels[start_index+1:end_index+1] = ["I-"+key] * (len(sub_name) - 1)
            for word, label in zip(words, labels):
                data.append([idx, word, label])
                if label not in labels_list:
                    labels_list.append(label)
    data_df = pd.DataFrame(data, columns=["sentence_id"."words"."labels"])
    return data_df, labels_list
Copy the code

After processing, the form is as follows:

Model building and training

First to configuration parameters, Simple Transformers have dict args, details about each args, can have a reference: simpletransformers. Ai/docs/tips – a…

1) Parameter configuration

# configuration config
import argparse
def data_config(parser) :
    parser.add_argument("--trainset_path".type=str, default="data/CLUENER2020/train.json".help="Training set path")
    parser.add_argument("--devset_path".type=str, default="data/CLUENER2020/dev.json".help="Validation set path")
    parser.add_argument("--testset_path".type=str, default="data/CLUENER2020/test.json".help="Test set path")
    parser.add_argument("--reprocess_input_data".type=bool, default=True.help="If True, input data will be reprocessed even if a cache file for input data exists in cache_DIR.")
    parser.add_argument("--overwrite_output_dir".type=bool, default=True.help="If True, the trained model will be saved to ouput_dir and will overwrite existing saved models in the same directory.")
    parser.add_argument("--use_cached_eval_features".type=bool, default=True.help="Evaluation during training uses cached characteristics, setting this to False will result in recalculation of characteristics at each evaluation step.")
    parser.add_argument("--output_dir".type=str, default="outputs/".help=Parameter Changing: Changing the CPU CPU Parameter: Changing the CPU CPU)
    parser.add_argument("--best_model_dir".type=str, default="outputs/best_model/".help="Save the best models from the evaluation process.")
    return parser

def model_config(parser) :
    parser.add_argument("--max_seq_length".type=int, default=200.help="Maximum sequence length supported by the model")
    parser.add_argument("--model_type".type=str, default="bert".help="Model type Bert/Roberta")
    parser.add_argument("--model_name".type=str, default=".. /pretrainmodel/bert".help="Choose which pretraining model to use")
    parser.add_argument("--manual_seed".type=int, default=2021.help="In order to produce reproducible results, random seeds need to be set.")
    return parser

def train_config(parser) :
    parser.add_argument("--evaluate_during_training".type=bool, default=True.help="Set to True to perform the evaluation when training the model, ensuring that the evaluation data is passed to the training method.")
    parser.add_argument("--num_train_epochs".type=int, default=3.help="Model training iterations")
    parser.add_argument("--evaluate_during_training_steps".type=int, default=100.help="Perform an evaluation at each given step, and checkpoint and evaluation results are saved.")
    parser.add_argument("--save_eval_checkpoints".type=bool, default=True)
    parser.add_argument("--save_model_every_epoch".type=bool, default=True.help="Save the model every time epoch")
    parser.add_argument("--n_gpu".type=int, default=1.help="Number of Gpus used in training")
    parser.add_argument("--train_batch_size".type=int, default=16)
    parser.add_argument("--eval_batch_size".type=int, default=8)
    return parser

def set_args() :
    parser = argparse.ArgumentParser()
    parser = data_config(parser)
    parser = model_config(parser)
    parser = train_config(parser)

    args,unknown = parser.parse_known_args()
    return args
Copy the code

2) Model building and training

import logging
from simpletransformers.ner import NERModel

# Available from the training set
labels_list = ["B-company"."I-company".'O'."B-name"."I-name"."B-game"."I-game"."B-organization"."I-organization"."B-movie"."I-movie"."B-position"."I-position"."B-address"."I-address"."B-government"."I-government"."B-scene"."I-scene"."B-book"."I-book"."S-company"."S-address"."S-name"."S-position"]
# training
args = set_args()
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
# fetch data
train_df, labels_list = load_cluener2020_data(args.trainset_path)
dev_df, _ = load_cluener2020_data(args.devset_path)

# Create named entity recognition model
model = NERModel(args.model_type, args.model_name, labels=labels_list, args=vars(args))
model.save_model(model=model.model)  You can download the pretrained model to output_dir

# Train the model and evaluate it during training
model.train_model(train_df, eval_data=dev_df)
Copy the code

Predicted results

After 3 rounds of training, the best F1 value was 0.772964

The complete code

import json
import argparse
import numpy as np
import pandas as pd
import logging
from simpletransformers.ner import NERModel

def data_config(parser) :
    parser.add_argument("--trainset_path".type=str, default="data/CLUENER2020/train.json".help="Training set path")
    parser.add_argument("--devset_path".type=str, default="data/CLUENER2020/dev.json".help="Validation set path")
    parser.add_argument("--testset_path".type=str, default="data/CLUENER2020/test.json".help="Test set path")
    parser.add_argument("--reprocess_input_data".type=bool, default=True.help="If True, input data will be reprocessed even if a cache file for input data exists in cache_DIR.")
    parser.add_argument("--overwrite_output_dir".type=bool, default=True.help="If True, the trained model will be saved to ouput_dir and will overwrite existing saved models in the same directory.")
    parser.add_argument("--use_cached_eval_features".type=bool, default=True.help="Evaluation during training uses cached characteristics, setting this to False will result in recalculation of characteristics at each evaluation step.")
    parser.add_argument("--output_dir".type=str, default="outputs/".help=Parameter Changing: Changing the CPU CPU Parameter: Changing the CPU CPU)
    parser.add_argument("--best_model_dir".type=str, default="outputs/best_model/".help="Save the best models from the evaluation process.")
    return parser

def model_config(parser) :
    parser.add_argument("--max_seq_length".type=int, default=200.help="Maximum sequence length supported by the model")
    parser.add_argument("--model_type".type=str, default="bert".help="Model type Bert/Roberta")
    parser.add_argument("--model_name".type=str, default=".. /pretrainmodel/bert".help="Choose which pretraining model to use")
    parser.add_argument("--manual_seed".type=int, default=2021.help="In order to produce reproducible results, random seeds need to be set.")
    return parser

def train_config(parser) :
    parser.add_argument("--evaluate_during_training".type=bool, default=True.help="Set to True to perform the evaluation when training the model, ensuring that the evaluation data is passed to the training method.")
    parser.add_argument("--num_train_epochs".type=int, default=3.help="Model training iterations")
    parser.add_argument("--evaluate_during_training_steps".type=int, default=100.help="Perform an evaluation at each given step, and checkpoint and evaluation results are saved.")
    parser.add_argument("--save_eval_checkpoints".type=bool, default=True)
    parser.add_argument("--save_model_every_epoch".type=bool, default=True.help="Save the model every time epoch")
    parser.add_argument("--n_gpu".type=int, default=1.help="Number of Gpus used in training")
    parser.add_argument("--train_batch_size".type=int, default=16)
    parser.add_argument("--eval_batch_size".type=int, default=8)
    return parser

def set_args() :
    parser = argparse.ArgumentParser()
    parser = data_config(parser)
    parser = model_config(parser)
    parser = train_config(parser)

    args,unknown = parser.parse_known_args()
    return args

def load_cluener2020_data(path) :
    "" ADAPTS to simpleTransformer's loading mode
    data = []
    labels_list = []
    with open(path, "r", encoding="utf-8") as f:
        for idx, line in enumerate(f):
            line = json.loads(line.strip())
            text = line["text"]
            label_entities = line.get("label".None)
            words = list(text)
            labels = ['O'] * len(words)
            if label_entities:
                for key, value in label_entities.items():
                    for sub_name, sub_index in value.items():
                        for start_index, end_index in sub_index:
                            assert "".join(words[start_index:end_index+1]) == sub_name
                            if start_index == end_index:
                                labels[start_index] = "S-" + key
                            else:
                                labels[start_index] = "B-" + key
                                labels[start_index+1:end_index+1] = ["I-"+key] * (len(sub_name) - 1)
            for word, label in zip(words, labels):
                data.append([idx, word, label])
                if label not in labels_list:
                    labels_list.append(label)
    data_df = pd.DataFrame(data, columns=["sentence_id"."words"."labels"])
    return data_df, labels_list

def train_model() :
    # Available from the training set
    labels_list = ["B-company"."I-company".'O'."B-name"."I-name"."B-game"."I-game"."B-organization"."I-organization"."B-movie"."I-movie"."B-position"."I-position"."B-address"."I-address"."B-government"."I-government"."B-scene"."I-scene"."B-book"."I-book"."S-company"."S-address"."S-name"."S-position"]
    # training
    args = set_args()
    logging.basicConfig(level=logging.INFO)
    transformers_logger = logging.getLogger("transformers")
    transformers_logger.setLevel(logging.WARNING)
    # fetch data
    train_df, labels_list = load_cluener2020_data(args.trainset_path)
    dev_df, _ = load_cluener2020_data(args.devset_path)
    print(train_df.head(10))
    print(dev_df.head(10))

    # Create named entity recognition model
    model = NERModel(args.model_type, args.model_name, labels=labels_list, args=vars(args))
    model.save_model(model=model.model)  You can download the pretrained model to output_dir

    # Train the model and evaluate it during training
    model.train_model(train_df, eval_data=dev_df)

if __name__ == '__main__':
    train_model()
Copy the code

Reference: juejin. Cn/post / 684490… Code reference: github.com/sharejing/S…

Simpletransformers is fast, but only for quick application or baseline writing. You need to change the model structure, flexible combination method or master Transformer and other high degree of freedom python library

NLP cute new, shallow talent, mistakes or imperfect place, please criticize!!