Many downstream tasks in NLP (text categorization, sentiment analysis, intention inference, and so on) rely on the first step — text strings being transformed into sentence feature vectors.

There are two ways to obtain sentence vectors in NLP:

(1) Sentence vector is obtained through post-processing of word vector;

(2) Sentence vector can be obtained directly

GloVe is not going to be introduced here

1. Sentence vector is obtained by post-processing word vector

As we all know, sentences are made up of words, and word vector technology just turns single words into fixed dimensional vectors.

So how do you get a vector of sentences made up of multiple words?

This paper will introduce the following unsupervised methods to generate sentence vectors from word vectors, they are: accumulation method, average method, TF-IDF weighted average method and SIF embedding method.

1.1 accumulation method

Summation is the easiest way to get sentence vectors

Imagine a sentence like this: I’m having fun

NLP processing a section of text first needs to carry out word segmentation and stop word processing for a section of text. After the stop word processing, the following word distance can be obtained from the above text:

[” I “, “very”, “happy”]

In this paper, Word2vec model in Gensim of Python is used to get the word vector, and the following word vector of the above words can be obtained (5-dimensional word vector is used to demonstrate for clearer visualization).

[[-0.46499524-2.8825798 1.1845024-1.6874554-0.05758076]]

[[-2.26874 0.99428487-0.9092457-0.67786723 4.244918]]

[[-1.0627153-0.7416505 0.41102988-0.39201248 0.6933297]]

The method of accumulation is to superimpose the word vector of all non-stop words in a sentence. If the sentence has N non-stop words, the word vector of the sentence can be obtained by the following means:

Vx = x + x + x + x + Vwordn

According to this method, the sentence vector of “I am happy” can be obtained as follows:

Vsentence = V1+ Vn2+ V3

[[-3.79645047-2.62994546 0.68628651-2.75733513 4.8806668]]

Analysis: cannot express the whole meaning of the sentence. When you add up multiple vectors of words, you add up all the other words, but they all mean different things.

Overall code:

import numpy as npfrom gensim.models.word2vec import Word2Vecimport jiebafrom sklearn.model_selection import train_test_splitdef readgh(path): res=[] label=[] f=open(path, "r", encoding='utf-8-sig') for line in f: If the int (line. The split (' ', 1) [0]) = = 1: label, append ([1, 0]) else: Label. Append ([0,1]) res.append(line.split(",1)[1]) return res,labeldef divide(result,label): X_train, x_test, y_train, y_test = train_test_split(result,label, test_size=0.3, random_state=666) return x_train, x_test, y_test, y_test = train_test_split(result,label, test_size=0.3, random_state=666) x_test, y_train, y_testdef fenci1(data,stopword): result = [] for text in data: word_list = ' '.join(jieba.cut(text)).split(" ") result.append(list(filter(lambda x : x not in stopword, word_list))) return resultdef buildWordVector(sentence,size,w2v_model): vec = np.zeros(size).reshape((1,size)) count = 0. for word in sentence: try: vec += w2v_model[word].reshape((1,size)) count += 1 except KeyError: continue if count ! = 0: vec /= count return vecdef buildWordVector1(sentence,size,w2v_model): vec =np.zeros(size).reshape((1,size)) data_vec=np.zeros(size).reshape((1,size)) for word in sentence: 0 print(word) 0 print(0) 0 print(0) 0 print(0) 0 print(0) 0 Def get_train_vecs(x_train,x_test) def get_train_vecs(x_train,x_test): n_dim = 5 #Initialize model and build vocab w2v_model = Word2Vec(size = n_dim,min_count = 10) W2v_model. Train (x_train,total_examples = w2v_model. Corpus_count,epochs = Train_vecs = np.concatenate([buildWordVector(line,n_dim,w2v_model) for line in x_train]) Print ("Train word_vector shape:",train_vecs.shape) print("Train word_vector shape:",train_vecs.shape) print("Train word_vector shape:",train_vecs.shape W2v_model.corpus_count,epochs = w2v_model.iter) # test_VECs = Np. concatenate([buildWordVector(line,n_dim,w2v_model) for line in x_test] Shape :",test_vecs.shape) # Save w2v_model.save('D:\ learn data \\ project \ mail gateway \\zh_cnn_text_classify-master\\test_model.pkl') Return train_vecs, test_vecSText =[" I am happy "]with open('D:\\ learning data \\ item \\ emotion analysis \\stopwords. TXT ', encoding='utf8') as f: stopword = f.read().splitlines()result = fenci1(text, Stopword)model= word2vec.load ("D:\\ learning data \\ item \\ mail gateway \\zh_cnn_text_classify-master\\test_model.pkl")test_vec=test_vecs = np.concatenate([buildWordVector1(line,5,model) for line in result])print(test_vec)Copy the code

Method of accumulation:

def buildWordVector1(sentence,size,w2v_model): vec =np.zeros(size).reshape((1,size)) data_vec=np.zeros(size).reshape((1,size)) for word in sentence: 0 print(word) 0 print(0) 0 print(0) 0 print(0) 0 print(0) 0 ") print(data_vec) return data_vecCopy the code

1.2 average method

The average method is similar to the accumulative method in that all the non-stop words in a sentence are added up, but at the end, the added vector is divided by the number of non-stop words. The word vector of a sentence is obtained by:

Vsentence = (x + x + x + x) + Vwordn) / n

[[-1.26548349-0.87664849 0.22876217-0.91911171 1.62688893]]

Code:

def buildWordVector(sentence,size,w2v_model): vec = np.zeros(size).reshape((1,size)) count = 0. for word in sentence: try: vec += w2v_model[word].reshape((1,size)) count += 1 except KeyError: continue if count ! = 0: vec /= count return vecCopy the code

1.3 TF-IDF weighted average method

Tf-idf weighted average method needs to use TF-IDF technology, which is a common text processing technology. Tf-idf model is often used to evaluate the importance of a word to a document, and is often used in the field of search technology and information retrieval. The tF-IDF value of a word is directly proportional to the frequency of its occurrence in documents and inversely proportional to the frequency of its occurrence in corpus. Tf-idf is obtained by multiplying Term Frequency of TF and Inverse Document Frequency of IDF.

Tf-idf weighting method not only needs to get the word vector of each non-stop word in the sentence, but also needs to get the TFIDF value of each non-stop word in the sentence. The TF part of each non-stop word is easy to calculate, while the IDF part depends on which corpus the user uses. For query retrieval, the corpus corresponding to the IDF part is all query sentences. If text self-similar clustering is done, then the corpus corresponding to IDF part is all sentences to be classified. Then the sentence vector _:_ weighted by TF-IDF is obtained by the following means

Vsentence = x + x + x + x + x + x + x + x +TFIDFwordn *Vwordn

Suppose “I’m having a good time.” is a query query, then computing the it-IDF corresponding corpus is the entire query sentence. If there are 100 query sentences in total; Sixty of the query sentences contained the word “I”, 65 contained the word “very”, and seven contained the word “happy”. Then the tF-IDF number of each non-stop word in this sentence is as follows:

Me: 1/(1+1+1) * log(100/(1+60)

Very: 1/(1+1+1) times log(100/(1+65))

Happy: 1/(1+1+1) * log(100/(1+7)

Therefore, the IT-IDF weighted data vector of this sentence is:

Vx = x * x + x * x + x + TFIDFg*V

TF – IDF code:

S1_words = [' today ', 'on', 'NLP', 'class'] s2_words = [' today ', 'the', 'course', 'a', 'meaning'] s3_words = [' data ', 'course', 'and', 'a', 'meaning'] data_set = [s1_words,s2_words,s3_words]from collections import Counterfrom collections import defaultdictword_dict = [' today ', 'on', 'NLP', 'the', 'course', 'a', 'meaning', 'data', 'or'] N = len (data_set) # document In_doc = total defaultdict (int) # says the number of documents containing the word for word in word_dict: for doc in data_set: In_doc[word]+=1tfidfs_all=[]for doc in data_set: Cont =Counter(doc) tfidf=[] for word in word_dict: if word not in cont: tfidf.append(0) else: tf=cont[word]/n idf=In_doc[word]/N tfidf.append(tf*idf) tfidfs_all.append(tfidf)print(tfidfs_all)Copy the code

1.4 SIF embedding method

ISF weighted average method is similar to TF-IDF weighted average method, which can get the sentence vector of the whole sentence well according to the word vector of each word. SIF embedding method needs to make use of principal component analysis and estimated probability of each word. Specific operations of SIF embedding method are shown as follows:

Firstly, the input of the whole algorithm is as follows: (1) word vector of each word, (2) all sentences in the corpus, (3) adjustable parameter A, and (4) estimated probability of each word

The output of the whole algorithm is: a sentence vector

The specific steps of the algorithm are as follows: (1) Get the preliminary sentence vector

Traverse each sentence in the corpus, assume that the current sentence is S, and obtain the preliminary sentence vector of the current sentence S through the following calculation formula:

                                       

In the process of weighted averaging, each word vector is multiplied by the coefficient A /(a+ P (w) and then superimposed. The final superimposed vector is the number of words in the sentence S. For the adjustable parameter A, the author used 0.001 and 0.0001. P(W) is the unigram probability of the word in the total corpus, that is, the word frequency of the word W is the sum of the word frequency of all words in the corpus.

(2) Principal component calculation The first principal component U of all preliminary sentence vectors is calculated by principal component analysis

(3) The target sentence vector is obtained through the secondary processing of the preliminary sentence vector in the following calculation

Just roughly ran the code, which is mainly English sentences and words, later have actual combat opportunity to study again, here to steal a lazy……….

Code: github.com/PrincetonML…

2 directly yields the sentence vector

One of the biggest defects of the above method is that it ignores the influence of the middle order of words on the whole sentence.

The commonly used methods for obtaining sentence vector directly are Doc2Vec and Bert.

2.1 doc2vec

Doc2vec is an improvement made on the basis of Word2vec, which not only considers the semantics between words, but also considers word order. Doc2Vec model Doc2Vec has two models, which are:

Pv-dm: Distributed Memory Model of Paragraph Vectors

Pv-dbow (Distributed Bag of Words version of Paragraph Vector) The PV-DM model predicts word probability given context and document Vector.

Specific principles please refer to other information to supplement, here is not specific.

Doc2vec DM model is similar to Word2vec CBOW model, DBOW model is similar to Word2vec skip-gram. Doc2Vec trains vectors of the same length for paragraphs of different lengths; Word vectors for different paragraphs are not shared; The word vectors trained by the training set have the same meaning and can be shared.

Code:

#coding:utf-8import jiebaimport sysimport gensimimport sklearnimport numpy as npfrom gensim.models.doc2vec import Doc2Vec, LabeledSentenceTaggededDocument = gensim models. Doc2Vec. For Chinese word segmentation TaggedDocument# def cut_files (path) : f = open(path, "r", encoding='utf-8-sig') res=[] text=[] for line in f: res.append(line.split(' ', 1)[1]) for line in res: Join (list(jieba.cut(line))) text.append(curLine) return text# join(list(jieba.cut(line)) text.append(curLine) return text# Def get_datasest(data): x_train = [] for I, text in enumerate(data): word_list = text.split(' ') l = len(word_list) word_list[l - 1] = word_list[l - 1].strip() document = TaggededDocument(word_list, tags=[I]) x_train. Append (document) return x_train (x_train, size=50): model_dm = Doc2Vec(x_train, min_count=5, window=3, size=size, sample=1e-3, negative=5, workers=4) model_dm.train(x_train, total_examples=model_dm.corpus_count, Epochs =70) model_dm.save('model_dm_doc2vec') return model_dm# def test(): Load ("model_dm_doc2vec") test_text =[' I don't want to work ',' I like you '] text=[] for line in test_text: curLine =' '.join(list(jieba.cut(line))) text.append(curLine) inferred_vector_dm = model_dm.infer_vector(text) sims = model_dm.docvecs.most_similar([inferred_vector_dm], topn=10) return simsif __name__ == '__main__': Path ="D: learning materials item emotional analysis ng.txt "path1="D: Learning materials item emotional Analysis pos.txt" Text1 = Cut_files (path) Text2 =cut_files(path1) text=text1+text2 x_train=get_datasest(text) model_dm = train(x_train) sims = test() for count, sim in sims: sentence = x_train[count] words = '' for word in sentence[0]: words = words + word + ' ' print (words, sim, len(sentence[0]))Copy the code

2.2 Bert

The biggest disadvantage of using word vector to obtain sentence vector through various post-processing methods is that it is impossible to understand the semantic meaning of the context. The same word may have different meanings in different contexts, but it will be represented as the same word vector, which will adversely affect the semantic calculation of polysemous sentences. However, sentence vectors generated by Bert do not have the above defects. The vectors of the same word are different in different contexts, and polysemous words produce different vectors due to different contexts. In this way, sentence vectors produced by Bert can be really applied in semantic calculation. The advantage of Sentence vector generated by BERT is not only that sentence meaning can be understood, but also that the error caused by weighted word vector can be eliminated.

Code:

from bert_serving.client import BertClientimport numpy as npdef main(): [' The sky is blue today, Sunny ', 'the weather is good today sunny', 'how's the weather now', 'natural language processing (NLP)', 'machine learning tasks']) print (len (doc_vecs [0])) if __name__ = = "__main__' : the main ()Copy the code

Now the general Bert method is mainly to download the Chinese model on Github to call, and then get the word vector.

BERT uses the Transformer model, using the Encoder part of Transformer. So how does it solve the problem that language models can only use information in one direction? The answer is that its Pretraining does not train a normal language model, but a Mask language model.

The BERT model requires a fixed Sequence length, such as 128. If not, the padding is used, otherwise the extra tokens are cut off to ensure that the input is a fixed length Token sequence.

I have time to study Bert