preface

NLP is called by many people as the pearl in the crown of artificial intelligence, which shows its importance in the FIELD of AI, and named entity recognition (NER) has always been a research hotspot in the field of NLP, so this task is a must for NLP.

Early implementations of NER were based primarily on dictionaries and rules, followed by traditional machine learning, such as HMM, MEMM, and CRF. With the rise of deep learning, a lot of CRF combined with recurrent neural networks or convolutional neural networks. The most recent are based on attention model and transfer learning.

In fact, the mainstream core algorithm of NER is conditional random field (CRF), and both the later deep learning and attention model need to be combined with CRF. Therefore, this paper looks at how CRF realizes named entity recognition.

On the condition of airfield

CRF is Conditional Random Fields, which is a Conditional probability distribution model of another set of output Random variables given a set of input Random variables. It is a probabilistic undirected graph model of discriminant. Since it is discriminant, it is to model Conditional probability distribution.

In NLP, the probability of CRF is used for labeling and division of sequence data model, based on the definition of CRF, relative sequence is given observation sequence X and Y output sequences, and then by defining conditional probability P (Y | X) to describe the model.

See the previous article conditional Random Fields for Machine Learning (CRF) for more details.

NER corpus

To facilitate direct use of the named entity recognition corpus provided by NLTK, download the following.

>>> import nltk
>>> nltk.download('conll2002')
[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\84958\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\conll2002.zip.
Copy the code

Reading corpus,

train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
Copy the code

The characteristic function

The definition of our feature function is actually more like the template for defining feature functions, because the real feature function will be generated according to the template defined, and generally the number of generated feature functions is quite large, and then the corresponding weight of each feature function will be determined through training.

The following code is used to see the selection of features, including lowercase word, suffix of word 2 and 3, uppercase or not, title or not, number or not, tag, tag prefix, attributes related to the previous word, attributes related to the last word.

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias'.'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag,
        'postag[:2]=' + postag[:2],
    ]
    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:postag=' + postag1,
            '-1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('BOS')
    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:postag=' + postag1,
            '+1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('EOS')
    return features
Copy the code

Training model

Then you can start to create the Trainer for training, turn every sentence in the corpus into a list of features and tags, set the Trainer’s parameters, and add samples to the Trainer to start training. The model is eventually saved to model_path.

def train():
    X_train = [sent2features(s) for s in train_sents]
    y_train = [sent2labels(s) for s in train_sents]

    trainer = pycrfsuite.Trainer(verbose=False)
    trainer.set_params({
        'c1': 1.0.'c2': 1e-3, 
        'max_iterations': 50.'feature.possible_transitions': True
    })

    for xseq, yseq in zip(X_train, y_train):
        trainer.append(xseq, yseq)

    trainer.train(model_path)
Copy the code

To predict

Create a Tagger and load the model to select a sentence to label in the test set.

def predict():
    tagger = pycrfsuite.Tagger()
    tagger.open(model_path)
    example_sent = test_sents[3]
    print(' '.join(sent2tokens(example_sent)), end='\n\n')
    print("Predicted:".' '.join(tagger.tag(sent2features(example_sent))))
    print("Correct: ".' '.join(sent2labels(example_sent)))
Copy the code

Take, for example, the predictions below.

Garcia Aranda Presento a la Prensa el Sistema Amadeus, Que Utilizan la Mayor Parte de Las Agencias de Viajes Espanolas para Reservar Billetes de Avion o Tren, Asi Como Plazas de Hotel, Y Que Ahora Pueden Utilizar Tambien Los Usuarios Finales a traves de Internet. Predicted: B-PER I-PER O O O O O O B-MISC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O B-MISC O Correct: B-PER I-PER O O O O O O B-MISC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O B-MISC OCopy the code

assessment

Finally, evaluate the overall effect of our model, input all sentences in the test set into the trained model, compare the predicted results with the labels corresponding to the sentences in the test set, and output various indicators.

def bio_classification_report(y_true, y_pred):
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))

    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split(The '-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}

    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels=[class_indices[cls] for cls in tagset],
        target_names=tagset,
    )


def evaluate():
    tagger = pycrfsuite.Tagger()
    tagger.open(model_path)
    X_test = [sent2features(s) for s in test_sents]
    y_test = [sent2labels(s) for s in test_sents]
    y_pred = [tagger.tag(xseq) for xseq in X_test]
    print(bio_classification_report(y_test, y_pred))
Copy the code

Take the result below.

Precision recall F1-score SUPPORT B-LOC 0.78 0.75 0.76 1084 I-LOc 0.66 0.60 0.63 325 B-MISc 0.69 0.47 0.56 339 I-MISc 0.61 0.49 0.54 557 b-org 0.79 0.81 0.80 1400 b-per 0.82 0.87 0.84 735 i-per 0.87 0.93 0.90 634 Avg/Total 0.77 0.76 0.76 6178Copy the code

github

https://github.com/sea-boat/nlp_lab/blob/master/crf_ner/crf_ner.py

The last

CRF is more flexible in feature design than HMM. CRF is an undirected graph. For example, the directed graph of HMM can extract more features, so the overall effect is better than HMM.

————- Recommended reading ————

My 2017 article summary – Machine learning

My 2017 article summary – Java and Middleware

My 2017 article summary – Deep learning

My 2017 article summary — JDK source code article

My 2017 article summary – Natural Language Processing

My 2017 Article Round-up — Java Concurrent Article

—————— advertising time —————-

Talk to me, ask me questions:

The public menu has been divided into “distributed”, “machine learning”, “deep learning”, “NLP”, “Java depth”, “Java concurrent core”, “JDK source”, “Tomcat kernel” and so on, there may be a suitable for your appetite.

Why to write “Analysis of Tomcat Kernel Design”

Welcome to: