Basic knowledge of

Supervised learning

one-hot representation

Words said

The element corresponding to the word is 1, and all other elements in the vector are 0. The dimension of the vector is equal to the number of words in the thesaurus

  • All vectors are orthogonal to each other, so we can’t effectively represent the similarity between two vectors
  • Vector dimensions are too large.

The word bag model

Bag of Words, or BoW for short. The word bag model assumes that we only consider the weight of all words, regardless of the contextual relationship between words in the text. The weight is related to the frequency with which the word appears in the text.

After word segmentation, we can get the word-based features of the text by counting the number of occurrences of each word in the text. If these words of each text sample are put together with the corresponding word frequency, it is called vectorization. After vectorization is completed, TF-IDF is generally used to modify the weight of features, and then the features are standardized. After some additional feature engineering, the data can be fed into a machine learning algorithm for classification and clustering.

The following three parts of the word bag model are summarized: tokenizing, counting and normalizing.

A model very similar to the bag of Words model is the Set of Words model (SoW for short). The only difference with the bag model is that it only considers whether a word appears in the text, not word frequency. That is, one occurrence of a word in a text is the same as multiple features. Most of the time, we use the word bag model, and the rest of the discussion will be dominated by the word bag model.

Of course, the word bag model has great limitations, because it only considers the word frequency, not the context, so it will lose part of the text’s semantics. But most of the time, if our goal is to classify and cluster, the word bag model works fine.

The following uses Sklearn’s CountVectorizer to implement the word set model and the word bag model respectively.

Word set model code

from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import matplotlib.pyplot as plt

corpus = ['Time flies flies like an arrow.'.'Fruit flies like a banana.']

vocab = set([word for sen in corpus for word in sen.split("")])

one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()

print(one_hot_vectorizer.vocabulary_)
print(one_hot)

sns.heatmap(one_hot, annot=True,cbar=False, xticklabels=vocab, yticklabels=['Sentence 2'])
plt.show()
Copy the code

Output result:

{'time': 6, 'flies': 3.'like': 5, 'an': 0.'arrow': 1, 'fruit': 4.'banana': 2} [[1 1 0 1 0 1 1] [0 0 1 1 1 1 1 0]]Copy the code

Word bag model code

If set to False, word frequency information is also included, which is the word bag model. The output is as follows:

{'time': 6, 'flies': 3.'like': 5, 'an': 0.'arrow': 1, 'fruit': 4.'banana': 2} [[1 1 0 2 0 1 1] [0 0 1 1 1 1 1 1 0 0]]Copy the code

Hash Trick

In large-scale text processing, dimensionality may be very scary due to the dimension of features relative to the size of the subparticiple vocabulary, so dimensionality should be reduced instead of vectorization directly. The most commonly used text dimensionality reduction method is Hash Trick.

We no longer know the name and meaning of the feature represented by Hash Trick after dimensionality reduction, so Hash Trick is not highly explanatory.

TF-IDF

Term Frequency — Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and text mining. The importance of a word increases with the number of times it appears in the document, but decreases inversely with the frequency of its occurrence in the corpus.

Term Frequency refers to the Frequency with which terms (keywords) appear in text. This number is usually normalized (typically word frequency divided by the total number of words in the article) to prevent it from being biased toward long documents.

IDF: Divide the total number of files by the number of files containing the term, and take the logarithm of the quotient. The IDF of rare words is very high, and that of high-frequency words is very low.

Tf-idf is actually: TF * IDF

application

  • Keywords: After calculating the TF-IDF value of each word in the article, sort it and select several words with the highest value as keywords.
  • Article similarities: calculate each article of key words, from which to select different number of keywords, merged into a collection, calculation of each article for the collection of the words in the word frequency, generate two articles in their respective word frequency vector, and then by Euclidean distance or cosine distance the cosine similarity degree of two vectors, the larger the value means the similar.

Calculate TFIDF using sklearn

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer,TfidfVectorizer
from pprint import pprint
import seaborn as sns
from matplotlib.pylab import plt

corpus = ['Time flies flies like an arrow.'.'Fruit flies like a banana.']

one_hot_vectorizer = CountVectorizer()
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()

pprint(one_hot) # Output word frequency

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(one_hot)

vocab = one_hot_vectorizer.get_feature_names()

print(vocab) # Print a dictionary
pprint(transformer.idf_ ) Output inverse document frequency
pprint(tfidf.toarray()) # output TFIDF

sns.heatmap(tfidf.toarray(), annot=True, cbar=False, xticklabels=vocab,
            yticklabels= ['Sentence 1'.'Sentence 2'])

plt.show()
Copy the code

The output is as follows:

array([[1.1.0.2.0.1.1],
       [0.0.1.1.1.1.0]], dtype=int64)
['an'.'arrow'.'banana'.'flies'.'fruit'.'like'.'time']
array([1.40546511.1.40546511.1.40546511.1.        , 1.40546511.1.        , 1.40546511])
array([[0.42519636.0.42519636.0.        , 0.60506143.0.        ,
        0.30253071.0.42519636],
       [0.        , 0.        , 0.57615236.0.40993715.0.57615236.0.40993715.0.        ]])
Copy the code

Simple authentication

The word frequency of document 1 is expressed as TF, and the formula for calculating IDF is


import numpy as np
tf = np.array([1.1.0.2.0.1.1])
x = np.array([1.1.1.2.1.2.1])

idf = np.log(3/ (1+x))+1

print(idf)
tfidf = tf * idf
print(tfidf / np.linalg.norm(tfidf))
Copy the code

The output is as follows:

[1.40546511 1.40546511 1.40546511 1.1.40546511 1.1.40546511] [0.42519636 0.42519636 0.0.60506143 0.0.30253071 0.42519636]Copy the code

It can also be used in one step:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf2 = TfidfVectorizer()
result = tfidf2.fit_transform(corpus)
print(result)
Copy the code

A few questions

  • Why is idF size always limited?
  • What is the IDF value of terms that appear in all documents?
  • Can the tFIDF weight of the term exceed 1?
  • How does the logarithmic base of IDF affect TFIDF?
  • Assuming that IDF is base 2, a simple approximation of IDF is given.

Vector Space Model (VSM)

Using TF-IDF, documents can be represented as vectors, where each component represents a term, but this representation ignores the relative order of terms. The vector representation of different documents is called a vector space model. Then the document similarity can be expressed by cosine similarity:

  • Euclidean distance

  • Manhattan distance

Topk similarity calculation and optimization

Topic model

In the field of traditional information retrieval, there are already many methods to measure document similarity, such as the classical VSM model. However, these methods are often based on a basic assumption: the more words that are repeated between documents, the more likely they are to be similar. This is not always true in practice. Most of the time, the degree of relevance depends on the underlying semantic connection, rather than the surface word repetition. For example, let’s say there are two sentences we want to know if they are related:

The first was: “Jobs is no longer with us.”

The second is: “Will apple prices come down?”

There are no common words between these two sentences, but they are still very related.

Topic model, as its name implies, is a method of modeling the topic implied in text. A topic is a conditional probability distribution of words on a vocabulary. The more closely related words are to the topic, the higher the conditional probability is, and vice versa.

The topic model is to use a large number of known “word-document” matrix, through a series of training, deduce the right “word-topic” matrix φ and “subject-document” matrix θ.

Advantages of the topic model:

1) It measures semantic similarity between documents. For a document, the distribution of the topic can be regarded as an abstract representation of it. For the probability distribution, we can calculate the semantic distance between two documents by some distance formula (such as KL distance), so as to get the similarity between them.

2) It can solve the problem of polysemy. Recall the original example, “apple” could be the fruit, or it could be the apple company. Through the probability distribution of “word-theme”, we can know which theme “apple” belongs to, and calculate its similarity with other characters through the matching of theme.

3) It can eliminate the influence of noise in the document. In general, noise in a document tends to be in a secondary topic, so we can ignore it and just keep the main topic in the document.

4) It is unsupervised and fully automated. We only need to provide training documentation, and it can automatically train various probabilities without any manual labeling process

5) It has nothing to do with language. As long as any language can do word segmentation, it can be trained to get its theme distribution.

There are two main methods of topic model training inference, one is pLSA (Probabilistic Latent Semantic Analysis), the other is LDA (Latent Dirichlet Allocation). PLSA mainly uses EM (expectation maximization) algorithm; LDA adopts Gibbs Sampling method.

pLSA

The approach adopted by pLSA is called the EM (Expectation maximization) algorithm, which consists of two iterative processes: the E (expectation) process and the M (maximization) process.

LDA

LDA uses the word bag model. The word bag model is a document in which we only consider whether a word appears, not the order in which it appears. In the word bag model, “I like you” and “you like me” are equivalent. The opposite of the bag model is the N-gram, which takes into account the order in which words appear. LDA (Latent Dirichlet Allocation) is a document topic generation model, also known as a three-layer Bayesian probability model, which contains word, topic and document three-layer structure. The so-called generation model holds that every word in an article is obtained through a process of “selecting a topic with a certain probability and selecting a word from the topic with a certain probability”. Documents to topics follow a polynomial distribution, and topics to words follow a polynomial distribution.

SVD

LSA

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
documents = ["doc1.txt"."doc2.txt"."doc3.txt"] 
  
# raw documents to tf-idf matrix: 
vectorizer = TfidfVectorizer(stop_words='english', 
                             use_idf=True, 
                             smooth_idf=True)
# SVD to reduce dimensionality: 
svd_model = TruncatedSVD(n_components=100,         // num dimensions
                         algorithm='randomized',
                         n_iter=10)
# pipeline of tf-idf + SVD, fit to and applied to documents:
svd_transformer = Pipeline([('tfidf', vectorizer), 
                            ('svd', svd_model)])
svd_matrix = svd_transformer.fit_transform(documents)
Copy the code

NMF

Characteristics of the engineering

Feature engineering mainly includes word segmentation, word form reduction, word removal and so on. This part of Chinese is different from English. Here is mainly the Chinese practice.

Text classification

Classification indexes

  • TP: The predicted value is 1, but the actual value is 1, and the forecast is correct.
  • FP: the predicted value is 1, but the actual value is 0. The forecast is wrong.
  • FN: the predicted value is 0, but the actual value is 1, and the prediction is wrong.
  • TN: the predicted value is 0, the actual value is 0, and the forecast is correct.

accuracy

Accuracy is defined as the percentage of the total sample that predicts correct outcomes

Accurate rate

Precision refers to the predicted results, and it means the probability of actually being positive among all the predicted positive samples, expressed as

The recall rate

Recall refers to the original sample, which means the probability of being predicted as a positive sample in the actually positive sample, expressed as

F1 score

F1 Score (F1-score). The F1 score is balanced by taking into account both accuracy and recall, so that both are maximized at the same time. The expression of F1 score is

Roc and AUC curves

True case rate TPR and false positive case rate FPR were defined as:


The ROC curve is shown below, where the abscissa is the false positive rate (FPR) and the ordinate is the true rate (TPR).

Area Under Curve (AUC) refers to the Area Under the Curve in ROC, which is used to judge the merits of the model. As shown by the ROC curve, the area connected to the diagonal is just 0.5, which means the random judgment and prediction results. Both positive and negative sample coverage should be 50%. Also, the steeper the ROC curve, the better, so the ideal value is 1, square. So the VALUE of AUC is usually between 0.5 and 1.

Probability interpretation of AUC: the probability that a pair of positive and negative samples are randomly selected and the score of positive samples is greater than that of negative samples. AUC is insensitive to the proportion of positive and negative samples, so it is often used as a model evaluation standard for unbalanced data sets.

AUC evaluation criteria can be referred to as follows:

  • 0.5-0.7: the effect is low.
  • 0.7-0.85: general effect.
  • 0.85-0.95: Good effect.
  • 0.95-1: Very good effect.

Focal Loss

TextCNN

Convolutional Neural Networks for Sentence Classification

TextRNN

“Recurrent Neural Network for Text Classification with multi-task Learning”

RCNN

A Recurrent Convolutional Neural Networks for Text Classification Keras Implementation

HAN

Thesis: Hierarchical Attention Networks for Document Classification

HAN mainly aims at document-level classification. First, BiGRU coding is performed for each word in the sentence, followed by Attention, then BiGRU coding is performed for the sentence level, followed by Attention, and finally classification.

Keras implementation

Sequence Labeling

HMM

Condition with airport CRF

Viterbi decoding

Language model

Given thesaurus V, a sentence can be viewed as a sequence of words (). Write the probability of sentence occurrence asSuch a joint probability distribution is the language model. The vocabulary of languages is very large. For example, the vocabulary of Chinese is on the order of 100,000. And the above joint distribution isSuch a model is large and sparse, which is not easy to calculate. To make the language model more compact, markov hypothesis can be introduced.

Ngram language model

First, the joint distribution is decomposed into the multiplication of conditional probability:



In these conditional probabilities, the probability of each word takes into account all the words that precede it. In fact, the relationship between two words that are too far apart is weak. Markov hypothesis is the assumption that the probability of each word is related only to the few words that precede it.

When n=1, a unigram model is:

When n=2, a bigram model is:

Degree of confusion

Perplexity is often used to evaluate the quality of language models. The basic idea of perplexity is that the language models that assign high probability values to sentences in the test set are better. When the language model is trained, the sentences in the test set are all normal ones, so the better the trained model is, the higher the probability on the test set is, the formula is as follows:


  • In the best case, the model always predicts the probability of label category as 1, and the confusion degree is 1.
  • In the worst case, the model always predicts the probability of the tag category to be 0, and the confusion is infinite.
  • In the baseline case, the model always predicts the same probability of all categories, and the confusion degree is the number of categories.

Obviously, the confusion of any valid model must be less than the number of categories. In the language model, the degree of confusion must be less than the dictionary size VOCab_size.

Seq2seq

Seq2Seq model as its name implies, input a sequence, encode a vector U with an RNN (Encoder), and then decode it into a sequence output with another RNN (Decoder), and the length of the output sequence is variable.

Use the Teacher Forcing in the training phase to prevent the mistakes of the previous moment from spreading to that moment. By using Teacher Forcing, errors can be blocked, model training corrected and parameter convergence accelerated.

When the model is trained, you can’t use the Teacher Forcing in the test phase, because the desired output sequence is invisible in the test phase, so you have to wait for a word to be output at the last moment before deciding what to input at the next moment.

Beam Search

Beam Search is only useful in the test phase. When the previous version of SEQ2SEQ outputs the sequence, only the words with top 1 probability are selected as the output words at each moment (equivalent to the local optimal solution), and then these words are strung together to get the final output sequence. It’s actually a greedy strategy.

However, if Beam Search is used, top K words will be selected at each moment as the output of this moment, and the next moment will be predicted one by one as the input of the next moment, and then top K will be selected as the output of the next moment from the K*L (L is the size of the word list) results, and so on. At the last moment, select Top 1 as the final output. It’s actually a prune deep search strategy.

Sequence Loss

In fact, the calculation of loss on “_PAD” is useless, because “_PAD” itself has no meaning, and it does not expect decoder to output this character, but is only used for placeholder, calculation of loss will bring side effects, affecting the optimization of parameters. Therefore, it is necessary to multiply loss by a mask matrix, which can screen out loss in the position of “_PAD”.

Attention

The Attention mechanic in Seq2Seq

Transformer

This part is explained mainly by teacher Li Hongyi’s courseware. Let’s first review the traditional Seq2Seq model.

  • RNN: It’s very difficult to parallelism and there are long-term dependency issues
  • CNN: At the bottom, CNN has a hard time seeing distant information.

Self-attention can replace CNN and RNN, which is also a Seq model.

For self attention, input x to get q, K,v by linear transformation.

And then Q dot K to get the Attention matrix.

Softmax normalization is performed

Then compute the product of Attention and V to get the output.

Vectorization of the Attention process

Multiple Attention, first split input, perform Attention, then cancat output

Location code

Layer Norm and residual connection

BERT

Contextualized word embedding (dynamic word embedding)

BERT

Other intellectual

  • KNN, SVM, Decision tree, Perceptron, LDA, LR (Linear regression, logistic regression), Neural Network, CRF (Conditional Random Field), Boosting; Generation models: Naive Bayes, HMM, GMM (Gaussian mixture model), Topic document Generation Model (LDA), limited Boltzmann machine

  • The raw_input function removed from Python3. Use the input function to get user input.

  • Immutable objects: values, strings, tuples mutable objects: lists, dictionaries. An immutable object is an object that cannot be changed by its reference.

  • Bagging is the integration of many learning machines that are individually trained; Dropout is the integration of many individually trained subnetworks, with some weights shared. Boosting is not individually trained, but in a certain order, which is interdependent. Stacking is learning done by two layer learning machines.

  • The differences between Bagging and Boosting are as follows: 1) Sample selection: Bagging: Training sets are selected in the original set with replacement, and each training set selected from the original set is independent. Boosting: The training set of each round remains the same, but the weight of each sample in the training set in the classifier changes. The weights are adjusted according to the classification results of the previous round. Bagging: Uniform sampling is used, and the weight of each sample is equal to Boosting: The weight of each sample is continuously adjusted according to the error rate, and the greater the error rate, the greater the weight. 3) Prediction function: Bagging: All prediction functions are equally weighted. Boosting: Each weak classifier has a corresponding weight, and more weight will be given to the classifier with small classification error. Bagging: Each prediction function can generate Boosting in parallel: Each prediction function can only be generated sequentially, because the latter model parameter needs the results of the previous round model

QA

  1. Why does Transfomer need location code? CNN?

The Transformer model does not have the ability to capture sequential sequences without positional information, which means Transformer will get similar results no matter how the sentence structure is scrambled. In other words, Transformer is just a more powerful word bag model. RNN is a linear sequence structure, so it naturally encodes location information. The convolution kernel of CNN can retain the relative positions between features.

  1. What is the use of the residual structure in Transfomer?

  1. Why multi-head Attention?

If the input is regarded as a concat of multi-dimensional information, multi-head Attention is equivalent to the ensemble of N different self-attention. Taking the input as a whole imposes different attention weights on different parts of a vector.

  1. Self Attention formula, why do I divide by the square root of d?

In case it gets too big, it divides by a scale scale

  1. LSTM, GRU structure and parameter calculation

  2. Dying Relu Dying Relu phenomenon refers to that when Relu is used as the activation function, due to a large learning rate or some reasons, bias of a certain layer learns a large negative value, so that the output of this layer is always 0 after passing the Relu activation function. When you enter this state, there is almost no way to return to normal. Because when it returns, the value is zero and the gradient is zero.

  3. CRF and HMM?

  4. L2-regular and L1-regular? Why is L1 sparse?

  5. Law of Large numbers and central limit theorem?

The law of large numbers tells us that the sample mean converges to the population mean, which is essentially the expectation. And the central limit theorem tells us that when you take a large enough sample, the distribution of the sample mean will gradually become a normal distribution.

  1. Inverse matrix?
  2. SVD decomposition?
  3. Clustering methods?
  4. Conditional probability calculation?
  5. Reservoir sampling?

Problem description: When the memory cannot load all the data, how to randomly select K data from the data stream containing unknown size and ensure that the probability of each data being extracted is equal?

A simple sampling algorithm is a random selection of k elements from a fixed number of n elements, such that each element has an equal k/ N probability of being selected. Simple sampling is the simplest sampling algorithm and also the most commonly used algorithm. One of the preconditions for simple sampling is that you have to know in advance the size of the target population, n.

Different from simple sampling, pond sampling is a dynamic sampling method. Concrete proofs and algorithms

  1. Tree of merger
  2. The substring with the largest product?
  3. Two Sum?

code

  1. Keras Attention
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints
from keras.layers.merge import _Merge


class Attention(Layer):
    def __init__(self, step_dim, W_regularizer=None, b_regularizer=None, W_constraint=None, b_constraint=None, bias=True, **kwargs):
        """ Keras Layer that implements an Attention mechanism for temporal data. Supports Masking. Follows the work of Raffel Et al. [https://arxiv.org/abs/1512.08756] # Input shape 3 d tensor with shape: `(samples, steps, features)`. # Output shape 2D tensor with shape: `(samples, features)`. :param kwargs: Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True. The dimensions are inferred based on  the output shape of the RNN. Example: model.add(LSTM(64, return_sequences=True)) model.add(Attention()) """
        self.supports_masking = True
        # self.init = initializations.get('glorot_uniform')
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[- 1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[- 1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        input_shape = K.int_shape(x)

        features_dim = self.features_dim
        # step_dim = self.step_dim
        step_dim = input_shape[1]

        eij = K.reshape(K.dot(K.reshape(x, (- 1, features_dim)), K.reshape(self.W, (features_dim, 1))),- 1, step_dim))

        if self.bias:
            eij += self.b[:input_shape[1]]

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        A workaround is to add A very small positive number ε to the sum.
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        # print weigthted_input.shape
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        # return input_shape[0], input_shape[-1]
        return input_shape[0], self.features_dim
# end Attention
Copy the code
  1. Attention RNN Model
import keras
from keras import Model
from keras.layers import *
import Attention

class TextClassifier(a):

    def model(self, embeddings_matrix, maxlen, word_index, num_class):
        inp = Input(shape=(maxlen,))
        encode = Bidirectional(CuDNNGRU(128, return_sequences=True))
        encode2 = Bidirectional(CuDNNGRU(128, return_sequences=True))
        attention = Attention(maxlen)
        x_4 = Embedding(len(word_index) + 1,
                        embeddings_matrix.shape[1],
                        weights=[embeddings_matrix],
                        input_length=maxlen,
                        trainable=True)(inp)
        x_3 = SpatialDropout1D(0.2)(x_4)
        x_3 = encode(x_3)
        x_3 = Dropout(0.2)(x_3)
        x_3 = encode2(x_3)
        x_3 = Dropout(0.2)(x_3)
        avg_pool_3 = GlobalAveragePooling1D()(x_3)
        max_pool_3 = GlobalMaxPooling1D()(x_3)
        attention_3 = attention(x_3)
        x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3], name="fc")
        x = Dense(num_class, activation="sigmoid")(x)

        adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08,amsgrad=True)
        model = Model(inputs=inp, outputs=x)
        model.compile(
            loss='categorical_crossentropy',
            optimizer=adam)
        return model
Copy the code