1. The principle of CRF

1.1 the CRF, for example,

CRF simply refers to whether adjacent variables in the probability graph satisfy the characteristic functionOne of the models, such as the example below, is a CRF application for merchant identification. Input merchant output address, name, keywords, business scope and other information, using BIOS marking method, marking as follows:

  • Transfer characteristic function t (yi – 1, yi, x, I), t (y_ {1} I -, y_i, x, I), t (yi – 1, yi, x, I)

  • State characteristic function: s(yi,x, I)s(y_i, x, I)s(yi,x, I)

  • The transition characteristic function (t) accepts four parameters, and the state characteristic function (s) accepts three parameters:

    • XXX, sentences to be marked with part of speech
    • Iii, for the i-th word in sentence S
    • Yiy_iyi, the part of speech marked with the i-th word
    • Yi −1y_{i-1}yi−1 indicates the part of speech marked with the i-1 word

    Its output value is 0 or 1, 0 means that the annotation sequence to be graded does not conform to this feature, 1 means that the annotation sequence to be graded does conform to this feature, λ,μ\lambda, \muλ,μ are the weights of the transition characteristic function t and the state characteristic function S respectively

1.2 In the merchant identification task above

  • BUSINESS KEYWORDS followed by BUSINESS scope, we can give positive marks, transfer characteristic function :(i-keywords b-business)

    T (yi – 1 = “KEYWORDS”, yi = “BUSINESS”, x, I), t (y_ {1} I – = “KEYWORDS”, y_ {I} = “BUSINESS”, x, T (I) yi – 1 = “KEYWORDS”, yi = “BUSINESS”, x, I) = 1;

  • To mark Meijia as KEYWORDS, we can give a positive score and state characteristic function:


    s ( y i = K E Y W O R D S . x = meijia . i ) = 1 S (y_i=”KEYWORDS”, x=” meijia “, I)=1

1.3 Parameterize the above processes


s c o r e ( y x ) = i . k Lambda. k t k ( y i 1 . y i . x . i ) + i . l mu l s l ( y i . x . i ) score(y|x) = \sum_{i,k}^{}{\lambda_{k}}t_{k}(y_{i-1},y_{i},x,i)+\sum_{i,l}^{}{\mu_{l}}s_{l}(y_{i},x,i)

Probabilization (using softmax function) :


P ( y x ) = 1 Z ( x ) e x p ( i . k Lambda. k t k ( y i 1 . y i . x . i ) + i . l mu l s l ( y i . x . i ) ) P(y|x) = \frac{1}{Z(x)}exp( \sum_{i,k}^{}{\lambda_{k}}t_{k}(y_{i-1},y_{i},x,i)+\sum_{i,l}^{}{\mu_{l}}s_{l}(y_{i},x,i) )

The transfer feature function and state feature function are combined, and the parameters are expressed by WWW. The above formula can be written as:


P ( y x ) = 1 Z ( x ) e x p ( w F ( y . x ) ) P(y|x) = \frac{1}{Z(x)}exp( w\cdot F(y,x) )

Where Z(x), Z(x), and Z(x) are normalized by, Z (x) = ∑ yexp (∑ I, k lambda KTK (yi – 1, yi, x, I) + ∑ I, l mu LSL (yi, x, I)), Z (x) = \ sum_y exp ( \sum_{i,k}^{}{\lambda_{k}}t_{k}(y_{i-1},y_{i},x,i)+\sum_{i,l}^{}{\mu_{l}}s_{l}(y_{i},x,i) (Z) (x) = ∑ yexp ∑ I, k lambda KTK (yi – 1, yi, x, I) + ∑ I, l mu LSL (yi, x, I))

2. CRF characteristic structure

The CRF model involves the following two types of feature templates:

2.1 Basic features, commonly used features in CRF model, including the following four categories:

  • Is it a number

    • English numbers: 1-10

    • Whether Chinese Numbers: ‘a’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘ten’

    • Whether Chinese traditional digital: ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’ and ‘sin’, ‘I’, ‘kwai’

  • Whether uppercase/lowercase

  • Text start/text end

  • Whether the first adjacent one is lowercase/uppercase; Whether the next word is lowercase/uppercase

2.2 Ngrams class characteristics

Ngram itself also refers to a collection of N words or words, each of which has an order and does not require them to be different from each other

3. Application of CRF in NER

CRF is widely used in sequence annotation. The following uses sklearn-CrFSuite package to realize a CRF sequence annotation model from four aspects: data import, feature generation, training and evaluation. The code can be run directly

3.1 Data Preparation

The import dependence on package

import sklearn
import scipy.stats

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
Copy the code

Download CoNLL 2002 data using NLTK

import nltk
nltk.download('conll2002') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --The partition line is output, not code
>>> [nltk_data] Downloading package conll2002 to /root/nltk_data...
    [nltk_data]   Package conll2002 is already up-to-date!
    True
Copy the code

Load conll2002 data

%%time
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
Copy the code

Viewing a piece of data

train_sents[0] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --The partition line is output, not code
>>> [('Melbourne'.'NP'.'B-LOC'),
    ('('.'Fpa'.'O'),
    ('Australia'.'NP'.'B-LOC'),
    (') '.'Fpt'.'O'),
    (', '.'Fc'.'O'),
    ('25'.'Z'.'O'),
    ('may'.'NC'.'O'),
    ('('.'Fpa'.'O'),
    ('EFE'.'NC'.'B-ORG'),
    (') '.'Fpt'.'O'),
    ('. '.'Fp'.'O')]
Copy the code

Usually our data only has text and NER annotations, so we just take the text and BIOS annotations of the above data and view one piece of data

train_sents_ner = [[(i[0], i[2])  for i in row] for row in train_sents]
test_sents_ner = [[(i[0], i[2])  for i in row] for row in test_sents]
train_sents_ner[0] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --The partition line is output, not code
>>> [('Melbourne'.'B-LOC'),
    ('('.'O'),
    ('Australia'.'B-LOC'),
    (') '.'O'),
    (', '.'O'),
    ('25'.'O'),
    ('may'.'O'),
    ('('.'O'),
    ('EFE'.'B-ORG'),
    (') '.'O'),
    ('. '.'O')]
Copy the code

3.2 Generated Features

Generate templates using the features of official documents

def word2features(sent, i) :
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0.'word.lower()': word.lower(),
        'word[-3:]': word[-3:].'word[-2:]': word[-2:].'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit()
    }
    if i > 0:
        word1 = sent[i-1] [0]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper()
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1] [0]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper()
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent) :
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent) :
    return [label for token, label in sent]

def sent2tokens(sent) :
    return [token for token, label in sent]
Copy the code

What does a transformed feature look like

sent2features(train_sents_ner[0[])2]
# In the first training data, the characteristics of the third word (Australia) are as follows

----------------- The partition line is output, not code> > > {'+1:word.istitle()': False.'+1:word.isupper()': False.'+1:word.lower()': ') '.'-1:word.istitle()': False.'-1:word.isupper()': False.'-1:word.lower()': '('.'bias': 1.0.'word.isdigit()': False.'word.istitle()': True.'word.isupper()': False.'word.lower()': 'australia'.'word[-2:]': 'ia'.'word[-3:]': 'lia'}
Copy the code

Both training data and test data are converted to their characteristic representation

X_train = [sent2features(s) for s in train_sents_ner]
y_train = [sent2labels(s) for s in train_sents_ner]

X_test = [sent2features(s) for s in test_sents_ner]
y_test = [sent2labels(s) for s in test_sents_ner]
Copy the code

3.3 Model training

%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True)
    
crf.fit(X_train, y_train)

----------------- The partition line is output, not code
>>> CPU times: user 35 s, sys: 21.8 ms, total: 35.1 s
    Wall time: 35.1 s
Copy the code

3.4 Model prediction

labels = list(crf.classes_)
labels.remove('O')
labels

----------------- The partition line is output, not code
>>> ['B-LOC'.'B-ORG'.'B-PER'.'I-PER'.'B-MISC'.'I-ORG'.'I-LOC'.'I-MISC']
Copy the code
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

----------------- The partition line is output, not code
>>> 0.7860514251609507
Copy the code
# group B and I results
sorted_labels = sorted(labels,key=lambda name: (name[1:], name[0]))

print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --The partition line is output, not code
>>>             precision   recall   f1-score   support

       B-LOC      0.800     0.778     0.789      1084
       I-LOC      0.672     0.631     0.651       325
      B-MISC      0.721     0.534     0.614       339
      I-MISC      0.686     0.582     0.630       557
       B-ORG      0.804     0.821     0.812      1400
       I-ORG      0.846     0.776     0.810      1104
       B-PER      0.832     0.865     0.849       735
       I-PER      0.884     0.935     0.909       634

   micro avg      0.803     0.775     0.789      6178
   macro avg      0.781     0.740     0.758      6178
weighted avg      0.800     0.775     0.786      6178


       
Copy the code