Open source: github.com/xiaosongshi…

Welcome to xiao Song’s official account ** “Minimalist AI” ** Take you to learn deep learning:

Based on the sharing of theoretical learning and application development technology of deep learning, the author will often share the dry contents of deep learning. When learning or applying deep learning, you can also communicate with me on this page if you have any questions.

In recent years, neural network-based deep learning methods have achieved great success in computer vision, speech recognition and other fields, and also made great progress in natural language processing. In the research of Named Entity Recognition (NER), a key basic task of NLP, deep learning has also achieved good results.

directory

0. Concept explanation

0.1 introduction of NER

0.2 Application of deep learning method in NER

2. Programming practice

2.1 an overview of the

2.2 Data Preprocessing

2.3 Model Building

2.4 Model training

2.5 Model Application

3. Summary & to be continued

1. Reference

0. Concept explanation

0.1 introduction of NER

NER, also known as specific name recognition, is a basic task in natural language processing. Named entities generally refer to entities with specific meanings or strong referential properties in the text, usually including people’s names, place names, organization names, dates and times, proper nouns, etc. The **NER system extracts the above entities from unstructured input text and can identify more categories of entities based on business requirements, ** such as product name, model number, price, etc. Therefore, the concept of entity can be very broad, as long as the business needs a special text fragment can be called an entity.

The academic named entities involved in NER generally include 3 categories (entity category, time category, number category) and 7 subcategories (name of person, place name, organization name, time, date, currency, percentage).

In practical applications, NER models usually only need to recognize people’s names, place names, organization names, dates and times, and some systems will also give proper noun results (such as abbreviations, conference names, product names, etc.). Numeric entities such as currency and percentage can be fixed by re. In addition, domain-specific entities are given in some application scenarios, such as book titles, song names, journal names, and so on.

NER is a fundamental key task in NLP. ** From the perspective of the process of natural language processing, NER can be regarded as a kind of unknown word recognition in lexical analysis, which is the problem with the largest number of unknown words, the biggest difficulty in recognition and the biggest impact on word segmentation. ** NER is also the basis of many NLP tasks such as relationship extraction, event extraction, knowledge graph, machine translation, question answering and so on.

NER is not a hot research topic at present, because some scholars in the academic circle think it is a solved problem. Of course, some scholars think that this problem has not been well solved, mainly for the following reasons: named entity recognition only achieves good results in limited text types (mainly news corpus) and entity categories (mainly people’s names, place names and organization names); Compared with other information retrieval fields, entity naming evaluation is expected to be small and easy to produce over-fitting. Named entity recognition pays more attention to high recall rate, but high accuracy is more important in the field of information retrieval. Generic systems that recognize multiple types of named entities perform poorly.

The summary is to extract the key nouns from the statement

0.2 Application of deep learning method in NER

NER has always been a research hotspot in the field of NLP. From early methods based on dictionaries and rules, to traditional machine learning methods, to methods based on deep learning in recent years, the general trend of NER research progress is shown in the figure below.

Figure 1: Trends in NER development

In machine learning-based approaches, NER is treated as a sequence labeling problem. Large scale corpus is used to learn the annotation model, so as to mark each position of the sentence. ** Common models in NER tasks include generative model HMM, discriminant model CRF, etc. ** ConditionalRandom Field (CRF) is NER’s current mainstream model. Its objective function not only considers the input state characteristic function, but also includes the label transfer characteristic function. SGD can be used to learn model parameters during training. When the model is known, it is a dynamic programming problem to calculate the prediction output sequence for the input sequence, that is, the optimal sequence to maximize the objective function, which can be decoded by Viterbi algorithm to obtain the optimal label sequence.The advantage of CRF is that it can make use of rich internal and contextual information when annotating a location.Figure 2: A linear chain piece random field

In recent years, with the development of hardware computing capability and word embedding, neural networks can effectively deal with many NLP tasks. This approach works similarly for sequence annotation tasks such as CWS, POS, NER: The token is mapped from discrete one-hot representation to low-dimensional space to become dense embedding. Then the embedding sequence of sentences is input into RNN. Neural network is used to automatically extract features, and Softmax predicts the tags of each token.

This method makes the training of the model become an end-to-end process, rather than traditional pipeline, which does not rely on feature engineering and is a data-driven method. However, there are many kinds of networks, which are highly dependent on parameter setting and have poor interpretability of the model. In addition, a disadvantage of this method is that the process of labeling each token is carried out independently, and the predicted tags cannot be directly used (the above information can only be transmitted by the implied state), thus the predicted tag sequence may be invalid. For example, it is impossible to follow b-per after i-per. But Softmax won’t use this information.

DL-CRF model is proposed to annotate sequence. The CRF layer is connected to the output layer of the neural network (focusing on the use of label transfer probability) to make tag prediction at the sentence level, so that the labeling process is no longer independent classification of each token.

0.2.1 BILSTM-CRF (RNN Base)

LongShort Term Memory networks, commonly called LSTM, are a special type of RNN that can learn long-distance dependent information. LSTM was proposed by Hochreiter &Schmidhuber (1997) and recently improved and popularized by Alex Graves. On many issues, LSTM has achieved considerable success and has been widely used. LSTM solves the problem of long distance dependency through clever design. All RNNS have a chain form of repeating neural network elements. In a standard RNN, this repeating unit has a very simple structure, such as a TANH layer.

Figure 3: Traditional RNN structure

The LSTM has the same structure, but the repeating unit has a different structure. Unlike normal RNN cells, there are four of them that interact in a very special way.

Figure 4: LSTM structure

LSTM selectively forgets part of the historical information and adds part of the current input information through three gate structures (input gate, forget gate and output gate), and finally integrates the current state and generates the output state.

Figure 5: Each gating structure of LSTM

Bilstm-crf model applied in NER is mainly composed of Embedding layer (mainly word vector, word vector and some additional features), bi-directional LSTM layer and finally CRF layer. ** Experimental results show that BILSTM-CRF has reached or exceeded the CRF model based on rich features, and has become the most mainstream model of NER methods based on deep learning. ** In terms of features, this model inherits the advantages of the deep learning method. Without feature engineering, word vector and character vector can be used to achieve good results. If there are high-quality dictionary features, it can be further improved.

Figure 6: Schematic diagram of BILSTM-CRF structure

0.2.2 IDCNN-CRF (CNN Base)

In terms of sequence annotation, ordinary CNN has a deficiency, that is, after convolution, the neuron at the last layer may only get a small piece of information in the original input data. ** For NER, each word in the entire input sentence may affect the annotation of the current position, which is the so-called long-distance dependency problem. ** In order to cover all input information, more convolution layers need to be added, resulting in deeper and deeper layers and more and more parameters. To prevent overfitting, regularizations like Dropout are added, bringing in more hyperparameters and making the model bulky and difficult to train. Because of disadvantages like CNN, people still choose network structures like biLSTM for most of the sequence labeling problems, and try to use the memory of the network to remember the information of the whole sentence to mark the current word.

However, this brings another problem. BiLSTM is essentially a sequence model, which is not as powerful as CNN in the utilization of GPU parallel computing. How can you give a GPU a battlefield with all its firepower, like CNN, and remember as much input as possible with a simple structure like LSTM?

Fisher Yu and Vladlen Koltun put forward dilated CNN model in 2015, which means “inflated” CNN. The idea is not complicated: the filter of normal CNN acts on a continuous region of input matrix and keeps sliding for convolution. Dilated CNN adds a dilation width to the filter. When applied to the input matrix, all input data in the middle of dilation width will be skipped. The size of the filter itself remains the same, so the filter gets the data from a broader input matrix, which looks like it is “inflated”.

When specifically used, dilated width will increase exponentially with the increase of layers. In this way, as the number of layers increases, the number of parameters increases linearly, while receptive field increases exponentially, which can quickly cover all input data.Figure 7: SCHEMATIC diagram of IDCNN

FIG. 7 shows that the receptive domain expands at an exponential rate. The original receptive domain is the 1×1 region located at the central point:

(a) In the figure, the original receptive domain is diffused outwards with step size of 1, and 8 1×1 regions are obtained to form the new receptive domain with size of 3×3;

(b) After diffusion with step size of 2, the receptive domain of 3×3 in the previous step expands to 7×7;

(c) As shown in figure 4, the receptive domain of the original 7×7 expands to the receptive domain of 15×15. The number of parameters for each layer is independent of each other. The sensing domain expands exponentially, but the number of parameters increases linearly.

Corresponding to the text, the input is a one-dimensional vector, and each element is a character embedding:

Figure 8: AN IDCNN block with a maximum expansion step of 4

IDCNN generates a Logits for each word of the input sentence, which is exactly the same as the output logits of biLSTM model. CRF layer is added and the annotation result is decoded by Viterbi algorithm.

CNN Base method achieves the function of extracting whole sentences by means of empty convolution + multi-layer, and also achieves the acceleration of parallel computation (compared with RNN, the speed difference between CNN and RNN can be referred to my blog, CNN RNN parallel understanding).

CRF layer is attached to the end of network model such as biLSTM or IDCNN, which is a common method for sequence annotation. BiLSTM or IDCNN calculates the probability of each tag of each word, while CRF layer introduces the transfer probability of sequence, and finally calculates the loss feedback back to the network.

That leaves just one question: what is the CRF layer? Why?

0.2.3 CRF layer explanation

Next, a brief introduction to the model. The schematic diagram is as follows:

  • Firstly, each word in sentence XXX is expressed as a vector, which contains the above word embedding and character embedding, where the character embedding is randomly initialized. Word embedding is usually initialized by pre-training model. All embeddings will be fine-tuned during training.
  • Secondly, the input for the BILSTM-CRF model was the embeddings described above, and the output was the prediction label for each word in the sentence XXX.

Although we are talking about the CRF layer, we do not need to know the details of the BiLSTM layer, but in order to understand the CRF layer, we must know the meaning of the BiLSTM layer output.

As can be seen from the figure above, the output of BiLSTM layer is the score of each tag, such as the word w0W_0W0. The output of BiLSTM is 1.5 (B-person), 0.9 (i-person), 0.1(B-organization), 0.08 (I-Organization) and 0.05 (O), these scores are inputs to the CRF layer. Feed the score predicted by BiLSTM layer into CRF layer, and the tag sequence with the highest score will be the best result predicted by the model.

What if there were no CRF layer?

According to the above, it can be found that if there is no CRF layer, that is, we train BiLSTM named entity recognition model as shown in the following figure:

Because BiLSTM’s output for each word is the tag score, for each word, we can choose the tag with the highest score as the predicted result. For example, for W0W_0W0, “B-person” has the highest score (1.5), so we can choose “B-person” as its prediction label. Similarly, w1W_1W1 is labeled “i-person”, w2W_2W2 is labeled “O”, W3W_3W3 is labeled “B-organization”, and w4W_4W4 is labeled “O”. According to the above method, although we get the correct label for XXX, it is impossible to get the correct label in most cases, such as the example below:

Obviously, the output labels “i-organization i-person” and “B-organization i-person” are incorrect.

CRF can learn constraints from training data

CRF layer can add some constraints to the final constraint tag to ensure the validity of the prediction tag. These constraints are automatically learned by the CRF layer from the training data. Constraints might be:

  • The first word in a sentence should be labeled “B-” or “O”, not “I-“;
  • “B – label1 I – label2 I – label3 I -…” Label1, label2, label3… It should be the same named entity tag. For example, b-person i-person is valid, but b-person i-organization is invalid.
  • “O I-label” is invalid. The first label of a named entity should start with “B-“, not with “I-“, in other words, with the pattern “O b-label”;

With these constraints, the number of invalid predictive tag sequences will decrease dramatically.

CRF layer is to add constraints to make the output more in line with the requirements, but also increase the cost of the algorithm, some similar to the function of beam search, let’s take a look at how CRF layer works.

Frame by frame softmax#

CRF is mainly used for sequence labeling, which can be simply understood as classifying every frame in the sequence. Since it is classified, it is natural to think of encoding this sequence with CNN or RNN, and then activating a full connection layer with Softmax, as shown in the figure below

Frame by frame SoftMax does not directly consider the context of the output

Conditional random field#

However, when we design tags, such as s, B, M, e for word segmentation, the target output sequence itself will have some context, such as S cannot be followed by M and E, etc. Tag-by-tag SoftMax doesn’t take into account the context associations at the output level, so it means putting these associations at the coding level, hoping that the model will learn them on its own, but sometimes “strong models fail”.

CRF, on the other hand, is a bit more straightforward, separating out the correlation at the output level, which makes the model more “easy” to learn:

CRF explicitly considers context correlation at the output

mathematics#

Of course, if you just introduce the correlation of the output, it’s not all of CRF, but what’s really neat about CRF is that it’s in terms of paths, it’s in terms of probabilities of paths.

Model profile#

If an input has NN frames, and each frame’s label has KK possibilities, then theoretically there are different outputs in KNKN. It can be easily visualized in the following network diagram. In the figure below, each dot represents the possibility of a label, the lines between dots represent the associations between labels, and each label result corresponds to a complete path on the graph.

Output network graph in 4TAG word segmentation model

In the sequence labeling task, our correct answer is generally unique. For example, “It’s a nice day”, if the corresponding participle is “today/weather/no/wrong”, then the target output sequence is bebess, other paths do not meet the requirements. In other words, in the sequence labeling task, the basic unit of our research should be the path. What we need to do is to choose the correct path from KNKN, which means that if it is regarded as a classification problem, it will be a classification problem of KNKN class.

To summarize: THE CRF function optimizes correlation between output entities

2. Programming practice

2.1 an overview of the

  • The actual combat project reference blog post
  • The project uses the conLL2003_v2 dataset, which annotates nine named entities in total:
['O', 'B-LOC', 'B-PER', 'B-ORG', 'I-PER', 'I-ORG', 'B-MISC', 'I-LOC', 'I-MISC']
Copy the code

Implements a model that recognizes inputs as named entities, as follows:

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
Copy the code
['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']
Copy the code

2.2 Data Preprocessing

Data download and unpack for training, address files. Ai/deeppavlov_ deeppavlov….

Download after decompression can see three files: test. TXT, “train”. TXT, valid. TXT

When opened, you can see that the data format is as follows: we only need the beginning and last data of each line, which are text information and named entities respectively.

Data reading and preprocessing

We need to process the data into a form that the network can receive.

Read the data and test the output

def read(self, data_path):
        data_parts = ['train', 'valid', 'test']
for data_part in tqdm(data_parts):
            file_path = data_path + data_part + extension
            dataset[data_part] = self.read_file(str(file_path))
def read_file(self, file_path):
        fileobj = open(file_path, 'r', encoding='utf-8')
            content = content.strip('\n')
if content == '-DOCSTART- -X- -X- O':
                    samples.append((tokens, tags))
                contents = content.split(' ')
                tokens.append(contents[0])
                tags.append(contents[-1])
if __name__ == "__main__":
    ds_rd = NerDatasetReader()
    data1 = ds_rd.read("./conll2003_v2/")
for sample in data1['train'][:2]:
for token, tag in zip(*sample):
            print('%s\t%s' % (token, tag))
Copy the code

The output

(['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'])
(['Peter', 'Blackburn'], ['B-PER', 'I-PER'])
Copy the code

It can be seen that the data has been sorted out, and each sentence is saved into two lists, one is the list of words, the other is the list of annotations

But there are still two problems: 1. The network can’t handle the word-level data, so we need to convert it to a numerical value. 2. The length of each sentence is different, so it cannot be unified and normalized

For problem 1. We can numerically convert it to a dictionary.

w_all_dict,n_all_dict = {},{} for token, tag in zip(*sample): if token not in w_all_dict.keys(): if tag not in n_all_dict.keys(): sort_w_list = sorted(w_all_dict.items(), key=lambda d: d[1], reverse=True) sort_n_list = sorted(n_all_dict.items(), key=lambda d: d[1], reverse=True) w_keys = [x for x,_ in sort_w_list[:15999]] n_keys = [ x for x,_ in sort_n_list] w_dict = { x:i for i,x in  enumerate(w_keys) } n_dict = { x:i for i,x in enumerate(n_keys) } if __name__ == "__main__": ds_rd = NerDatasetReader() data1 = ds_rd.read("./conll2003_v2/") w_dict,n_dict = get_dicts(data1["train"]) print(len(w_dict),n_dict)Copy the code

Test output

8000 {'O': 0, 'B-LOC': 1, 'B-PER': 2, 'B-ORG': 3, 'I-PER': 4, 'I-ORG': 5, 'B-MISC': 6, 'I-LOC': 7, 'I-MISC': 8}
Copy the code

We kept the first 15,999 commonly used words and added “UNK” for unknown words.

Now we’re going to use these dictionaries to replace words with numbers

def w2num(datas,w_dict,n_dict):
        num_w_list,num_n_list = [],[]
for token, tag in zip(*sample):
if token not in w_dict.keys():
            num_w_list.append(w_dict[token])
            num_n_list.append(n_dict[tag])        ret_datas.append((num_w_list,num_n_list,len(num_n_list)))
if __name__ == "__main__":
    ds_rd = NerDatasetReader()
    dataset = ds_rd.read("./conll2003_v2/")
    w_dict,n_dict = get_dicts(dataset["train"])
    data_num["train"] = w2num(dataset["train"],w_dict,n_dict)
    print(data_num["train"][:4])
    print(dataset["train"][:4])
Copy the code

Test the output results, have achieved the requirements of numerical. For the convenience of counting the sentence length, the last digit of each ancestor is saved as the sentence length.

[[[6, 957, 11983, 233, 762, 4147, 209, 6182, 1], [3, 0, 6, 0, 0, 0, 6, 0, 0], 9), ([732, 2068], [2, 4], 2), ([1379, 134], [1, 0], 2), ([18, 226, 455, 13, 12, 66, 35, 8127, 24, 233, 4148, 6, 2476, 6, 11984, 209, 6182, 407, 3542, 2069, 499, 1789, 1920, 651, 287, 39, 8128, 6, 1921, 1], [0, 3, 5, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 30)] [(['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']), (['Peter', 'Blackburn'], ['B-PER', 'I-PER']), (['BRUSSELS', '1996-08-22'], ['B-LOC', 'O']), (['The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 'shun', 'British', 'lamb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.'], ['O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])]Copy the code

We output the statistics of sentence length and find that the maximum value is 113 and the minimum value is 1. For the convenience of unified training, we normalized the length to 80

data_num["train"] = w2num(dataset["train"],w_dict,n_dict)
w_lens = [data[-1] for data in  data_num["train"]]
print(max(w_lens),min(w_lens))
Copy the code

Sentence length normalization operation, where the padding is 0, is used as “UNK” and “O”, in fact, can also use Mask method

def len_norm(data_num,lens=80):
for sample1 in list(data_num):
            sample[0] = sample[0][:lens]
            sample[1] = sample[1][:lens]
        ret_datas.append(sample[:2])
if __name__ == "__main__":
    ds_rd = NerDatasetReader()
    dataset = ds_rd.read("./conll2003_v2/")
    w_dict,n_dict = get_dicts(dataset["train"])
    data_num["train"] = w2num(dataset["train"],w_dict,n_dict)
    data_norm["train"] = len_norm(data_num["train"])
    print(data_norm["train"][:4])
Copy the code

The test output is

[[[6, 957, 11983, 233, 762, 4147, 209, 6182, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [3, 0, 6, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], [[732, 2068, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], [[1379, 134, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [[18, 226, 455, 13, 12, 66, 35, 8127, 24, 233, 4148, 6, 2476, 6, 11984, 209, 6182, 407, 3542, 2069, 499, 1789, 1920, 651, 287, 39, 8128, 6, 1921, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 3, 5, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]]Copy the code

2.3 Model Building

BiRNN method is adopted for model building, specifically BiLSTM. For the convenience of explanation, RNN+Softmax method is adopted instead of CRF. I will update a VERSION of CRF later. The network structure is as follows:

Model building code

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers  import *
def build_model(num_classes=9):
    model.add(Embedding(16000, 256, input_length=80))    model.add(Bidirectional(LSTM(128,return_sequences=True),merge_mode="concat"))    model.add(Bidirectional(LSTM(128,return_sequences=True),merge_mode="concat"))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(num_classes, activation='softmax'))
Copy the code

Output model structure

Layer (type) Output Shape Param ================================================================= embedding (Embedding) (None, 80, 256) 4096000 _________________________________________________________________ bidirectional (Bidirectional (None, 80, 256) 394240 _________________________________________________________________ bidirectional_1 (Bidirection (None, 80, 256) 394240 _________________________________________________________________ dense (Dense) (None, 80, 128) 32896 _________________________________________________________________ dense_1 (Dense) (None, 80, 9) 1161 ================================================================= Trainable params: 4918537 _________________________________________________________________Copy the code

2.4 Model training

if __name__ == "__main__": ds_rd = NerDatasetReader() dataset = ds_rd.read("./conll2003_v2/") w_dict,n_dict = get_dicts(dataset["train"]) data_num["train"] = w2num(dataset["train"],w_dict,n_dict) data_norm["train"] = len_norm(data_num["train"]) model.compile(loss="sparse_categorical_crossentropy",optimizer=opt) train_data = np.array(data_norm["train"]) train_x = train_data[:,0,:] train_y = train_data[:,1,:] The model fit (x = train_x, y = train_y, epochs = 10, batch_size = 200, verbose = 1, validation_split = 0.1)Copy the code

It takes five minutes to train 10 epoches on MX150GPU, and it can be found that both Train_loss and VAL_loss are decreasing

12636/12636 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 5 ms - 68 - s/sample - loss: 0.3199 - val_loss: 0.1359 12636/12636 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 58 5 ms/s sample - loss: 0.1274 - val_loss: 0.1201 12636/12636 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 5 ms - 63 - s/sample - loss: 0.1099 - val_loss: 0.0957 12636/12636 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 58 5 ms/s sample - loss: 0.0681 - val_loss: 0.0601 12636/12636 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 5 ms - 63 - s/sample - loss: - val_loss 0.0372:0.0498Copy the code

2.5 Model Application

Eventually train 10 Epochs

if __name__ == "__main__": ds_rd = NerDatasetReader() dataset = ds_rd.read("./conll2003_v2/") w_dict,n_dict = get_dicts(dataset["train"]) data_num["train"] = w2num(dataset["train"],w_dict,n_dict) data_norm["train"] = len_norm(data_num["train"]) model.compile(loss="sparse_categorical_crossentropy",optimizer=opt) train_data = np.array(data_norm["train"]) train_x = train_data[:,0,:] train_y = train_data[:,1,:] The model fit (x = train_x, y = train_y, epochs = 10, batch_size = 200, verbose = 1, validation_split = 0.1) model. The load_weights (" model. The h5)" pre_y = model.predict(train_x[:4]) pre_y = np.argmax(pre_y,axis=-1) for i in range(0,len(train_y[0:4])): print("label "+str(i),train_y[i]) print("pred "+str(i),pre_y[i])Copy the code

By testing the output results, it can be found that the prediction of the first four training sets achieves good results.

label 0 [3 0 6 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 pred 0 [3 0 6 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 label 1 [2 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 pred 1 [2 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 label 2 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 pred 2 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 label 3 [0 3 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 pred 3 [0 3 5 0 0 0 0 0 0 6 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Copy the code

3. Summary & to be continued

In order to simplify, this paper only uses the RNN+Softmax method to test the training set. There are still many improvements to be made, such as adding CRF, using Mask method, using all three data sets, etc., which will be updated later when time is available. We also welcome you to communicate and improve together.

1. Reference

  1. www.jiqizhixin.com/articles/20…
  2. Blog.csdn.net/suan2014/ar…
  3. Spaces. Ac. Cn/archives / 55…
  4. Blog.csdn.net/chinateleco…