Study Note CB008: Word sense disambiguation, supervised, unsupervised, semantic role tagging, information retrieval, TF-IDF, implicit semantic indexing model

Word sense disambiguation, the basis of sentence and text semantic understanding, must be solved. Languages have a large number of words with multiple meanings. Word sense disambiguation can be solved by machine learning. Word sense disambiguation has a supervised machine learning classification algorithm to determine the category of word meaning. Word sense disambiguation unsupervised machine learning clustering algorithm clustering word meanings into multiple classes, one meaning for each class.

There are supervised word sense disambiguation methods. Based on the mutual information word sense disambiguation method, the two languages were compared, and the word sense disambiguation model based on a large number of Chinese and English corpus could be used. Source information theory, a random variable contains the amount of information of another random variable (English information contains the amount of information of Chinese information), assume that the probability of two random variables X and Y are P (X) and P (Y) respectively, the joint distribution probability is P (X, Y), mutual information calculation formula, I(X; Y) = ∑ ∑ p (x, Y) log (p (x, Y)/(p (x) p (Y))). Mutual information, a random variable is known to reduce the uncertainty of another random variable, uncertainty, entropy, I(X; Y) = H (X) – H (X | Y). Continuous iterative training of corpus, I(X; Y) decreases continuously, and the algorithm termination condition I(X; Y is no longer decreasing. Mutual information based word sense disambiguation is the best method for machine translation systems. Disadvantages: bilingual corpus is limited, and ambiguity can be recognized in multiple languages is also limited (for example, the same word in Chinese and English has ambiguity).

Disambiguation method based on Bayesian classifier. Conditional probability, contextual context, any polysemy meaning is relevant to the context below. Context (context) c hypothesis, semantic (semantic) s, polysemous word (word) w, polysemy w have semantic s probability under the context of c, p | c (s), p (s | c) = p (c | s) p (s)/p (c). | c p (s) s take a certain semantic maximum probability, p (c) established, only consider the maximum, s estimated = Max (p (c | s) p (s)). Context c must communicate by words in natural language processing, is composed of multiple v (word), Max (p (s) ∏ p | s (v)).

P (s) expresses the s probability of a certain semantic meaning of polysemous word W, and the maximum likelihood estimation of a large number of corpora is calculated. P (s) = N(s)/N(w). | s p (v) a polysemous word w semantic word v s condition probability, p (v | s) = N (v, s)/N (s). Training of p (s) and p (v) | s, a polysemous word w disambiguation calculation (p (c | s) p (s)) maximum probability.

Unsupervised word sense disambiguation method. Completely unsupervised word sense disambiguation is impossible, meaning cannot be defined without labeling, so word sense identification can be done by unsupervised method. Unsupervised word recognition, a kind of bayesian classifier, parameter estimation is not based on have labeled training corpus, is the first random initialization parameter p | s (v), according to the EM algorithm to estimate the probability value, each context of w c p (c | s) calculation, get the real data likelihood values, to estimate p | s (v), to calculate the likelihood values, The model parameters are constantly updated and the classification model is finally obtained, which can classify words. Ambiguous words will be divided into different categories in different contexts. Based on a single language context vector. Vector similarity, cosine of a,b is equal to ∑ab over SQRT (∑a^2∑b^2).

Shallow semantic tagging is an effective language analysis method. Based on semantic roles, shallow semantic analysis method can describe the relationship between sentence semantic roles. Semantic role, predicate, agent, patient, time of occurrence, quantity. Semantic role annotation analyzes the role information, and the computer extracts important structural information to understand the language meaning.

Semantic role labeling rely on syntactic analysis, syntactic analysis including phrase structure analysis, shallow parsing, interdependent relationship analysis, semantic role labeling tree points based on the structure of semantic role labeling method, based on the results of shallow syntactic semantic role labeling method, based on the results of interdependence syntactic semantic role labeling method. Process, parsing -> candidate argument cutting -> argument identification -> argument tagging -> semantic role tagging results. Argument cut off, in more candidates to remove the argument is definitely not part. Argument recognition, binary classification, is argument and is not argument. Argument labeling, multi – value classification.

Semantic role annotation method based on phrase structure tree. The phrase structure tree expresses the structure relations, and the semantic role labeling process relies on the structure relations to design complex strategies. Analyze the strategy of argument cutting, semantic role is predicate-centered, phrase structure tree is predicate-node-centered, parallel analysis first, different from the client, if the current node sibling node and the current node are not syntactic structure juxtaposed relationship, as candidate argument. Argument recognition, binary classification, machine learning based on annotated corpus, machine learning binary classification method is fixed, predicate itself, phrase structure tree path, phrase type, argument position in predicate, predicate voice, argument center word, dependent category, argument first and last word, combination features. Argument labeling, machine learning multivalued classifier.

Based on the results of dependency parsing and block-based semantic role annotation. The process of primitive pruning is based on different syntactic structures. Semantic role annotation based on the results of dependency parsing can directly extract predicate-argument relations based on dependency syntax. The clipping strategy takes the predicate as the current node, all the child nodes of the current node are candidate arguments, and the parent node of the current node as the current node repeats the above process until it reaches the root node. Feature design of meta-recognition algorithm based on semantic role annotation methodology based on dependency parsing results, with more features of parent and child nodes.

Fusion method, weighted summation, interpolation.

Currently, semantic role labeling is not very effective, which relies on accuracy of syntactic analysis and poor domain adaptability. The new method, using bilingual parallel corpus, makes up for the accuracy problem and the cost is much higher.

Both Google and Baidu cannot do without TF-IDF algorithm for information retrieval, which is simple and effective but lacks semantic features.

TF – IDF. Term Frequency (TF), the frequency with which a word appears in a document. IDF(Inverse Document Frequency), the number of documents in which a word appears. The same word appears as often in short documents as it does in long documents, and is worth more for short documents. A word with a low probability of occurrence, once in a document, is worth more than other common occurrences. In the information retrieval domain vector model to do similarity calculation is very effective, used to be the killer technology of Google. Chatbots only consider individual words without any semantic information.

Implicit semantic indexing model. In tF-IDF model, all words constitute a high-dimensional semantic space, and each document is mapped to a point. The dimension is generally high, and each word is used to separate the relationship between words in one dimension. Treating words and documents as equals, construct a low-dimensional semantic space where each word and each document are mapped to a point. Mathematics, document probability, word probability, joint probability. To design a hypothetical between implied class included in documents and words, choose a document probability p (d), find a implied probability p (z | d), to generate a word probability p (w) | z w. P (d, W) joint probability is estimated based on observed data. Z is an implicit variable and expresses a semantic feature. P (d, w) is used to estimate the p (d), p (z | d) and p | z (w), according to the p (d), p (z | d) and p (w) | z more accurate p (w, d), correlation between words and documents. The logarithmic likelihood function of the optimization objective function is designed, L=∑∑n(d, w) log P(d, w). P (d, w) = p (d) * p | d (w), p = ∑ p (w) | d | z (w) p (z | d), p (z | d) = p (z) p (d | z) / ∑ p (z) p (d | z), P (d, w) = p (d) (∑ p | z (w) p (z) p (d | z) / ∑ p (z) p (d | z) = ∑ p (z) * p | z (w) * p (d | z).

EM algorithm, according to the maximum likelihood principle, firstly randomly shoot a distribution parameter, classify it to a certain part according to the distribution, re-count the number according to the classification, estimate the distribution parameters according to the maximum likelihood, then reclassify, adjust parameters, estimate, and finally get the optimal solution. Each training data classification, p (z | d, w), take a p (z), p (d | z), p | z (w), p (z | d, w) = p (z) p (d | z) p | z (w) / ∑ p (z) p (d | z) p | z (w), molecular is a z, the denominator is z and all. P (z | d, w) maximum likelihood estimate of the probability estimation (E) process, for each training sample be classified, according to the good classified statistics of n (d, w), according to the formula of p (z) = 1 / R ∑ n (d, w) p (z | d, w) update parameters. P (d | z) = ∑ n (d, w) p (z | d, w) / ∑ n (d, w) p (z | d, w), molecular is a d and the denominator is all d and calculate p (d | z) maximum likelihood estimate. P | z (w) = ∑ n (d, w) p (z | d, w) / ∑ n (d, w) p (z | d, w), molecular a w and the denominator is all w and, calculate p | z (w) of the maximum likelihood estimate. To calculate p (z | d, w), p (z | d, w) = p (z) p (d | z) p | z (w) / ∑ p (z) p (d | z) p | z (w). Repeat EM to maximize the log-likelihood function, L=∑∑n(d, w) log P(d, w). Through the above iterations, the relevancy between p(w, D) and the word and the document is obtained, and the relevancy is used for retrieval.

Ci of the correlation between p (w, d) multiplied by the transpose, p = p (w, w) (w, d) x trans (p (w, d)). Query keywords form the word vector Wq, and document D represents the word vector Wd. The correlation between query and document D is R(query, d) = Wq×p(w,w)×Wd. The ranking of documents from most relevant to least relevant is the search result.

Compared with TF-IDF, the implied semantic index model is more suitable for research and development of chatbots for corpus training and analysis by adding semantic information, considering word-word relations and conducting information retrieval according to semantics. Tf-idf is more suitable for completely independent word-based information retrieval and more suitable for pure text search engine.

References:

Python Natural Language Processing

http://www.shareditor.com/blogshow?blogId=88

http://www.shareditor.com/blogshow?blogId=89

http://www.shareditor.com/blogshow?blogId=90

Welcome to recommend machine learning opportunities in Shanghai. My wechat account is Qingxingfengzi

Study Note CB008: Word sense disambiguation, supervised, unsupervised, semantic role tagging, information retrieval, TF-IDF, implicit semantic indexing model

Related Posts

Foundation of data processing: Pandas Initial exploration of data

【 path planning 】 Robot maze path planning based on MATLAB A_star algorithm

TensorFlow Distributed environment (1) — overall architecture