Reading thesis: 2013 Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

background

This paper studies the latent semantic model with deep structure, projects documents and queries into a low-dimensional space (represented as a low-dimensional vector), and calculates the similarity between queries and documents through the low-dimensional vector. CTR task is used to train the model. In this paper, n-gram hashing technique is used to process large vocabularies.

Problems in modeling

model

The full-text model is shown in the figure below (also known as the two-tower model). Input is query and document encoded by 01, using Word Hashing to reduce dimension, followed by three-layer full-connection layer, output is 128-dimension vector, through which query and document and similarity are calculated, and the probability of similarity is given.

The activation function in this paper uses TANH, and the final calculation of shape similarity is shown below.


R ( Q . D ) = cosine ( y Q . y D ) = y Q y y D y Q y D R(Q, D)=\operatorname{cosine}\left(y_{Q}, y_{D}\right)=\frac{y_{Q}{ }^{y}{y}_{D}}{\left\|y_{Q}\right\|\left\|y_{D}\right\|}

word hashing

Use letter trigams to split words (a group of three letters, with # indicating the beginning and end), for example, good as #good, and then split into #go, goo, ood, and OD#. Code the result with 01, which reduces the coding space. Three-letter expressions often represent prefixes and suffixes in English, and prefix suffixes often have universal semantics. The experimental results are shown below, using a group of 3 letters, which better consider the compression ratio and the probability of conflict.

The experiment

In the training, query and title in documents are used to calculate relevance, and the relevance of documents returned by a query is normalized, as shown below


P ( D Q ) = exp ( gamma R ( Q . D ) ) D D exp ( gamma R ( Q . D ) ) P(D \mid Q)=\frac{\exp (\gamma R(Q, D))}{\sum_{D^{\prime} \in D} \exp \left(\gamma R\left(Q, D^{\prime}\right)\right)}

D+D^{+}D+ and D−D^{-}D−, where D+D^{+}D+ is the document clicked and D−D^{-}D− is the document not clicked. Extract four {Dj−; j=1,.. ,4}\{D_j^{-}; j=1,.. 4 \}, {Dj -; j=1,.. , 4}. Through maximum likelihood estimation, the loss function is minimized. The training loss function is as follows, and SGD is used for training.


L ( Λ ) = log ( Q . D + ) P ( D + Q ) L(\Lambda)=-\log \prod_{\left(Q, D^{+}\right)} P\left(D^{+} \mid Q\right)

The experimental results are as follows, and the introduction of comparison method and parameter setting refer to the article. The evaluation index is NDCG, and NDCG is used as the evaluation index of ranking results, and the accuracy of ranking is evaluated (the higher the better). The introduction shows: NDCG and its implementation. 1-8 in the experiment are the methods in other work, and 9-12 are the DSSM model in this paper (specific details vary).

The specific conclusions are as follows

  1. The DSSM model proposed in this paper has the best effect (compared with L-WH DNN and 1-9).
  2. Models using supervised learning on CTR data are better for sequencing results (DNN vs. DAE)
  3. Word hashing is good for sorting large thesaurus, that is, it improves sorting index and reduces model parameters (compare L-WH DNN and DNN)
  4. It is better to use depth model to represent semantic information (unsupervised method: DAE to compare LSA, supervised method: L-WH DNN to compare L-WH Linear and L-WH Non-Linear)

conclusion

Deep semantic matching model DSSM and his siblings

Advantages: DSSM uses word vector as input, which can not only reduce the dependence of cut words, but also improve the model’s normalization ability, because the semantics expressed by each Chinese character can be reused. On the other hand, the traditional input layer uses Embedding (such as Word2Vec’s word vector) or theme model (such as LDA’s theme vector) to map words directly, and then adds or spliced the vectors of each word. As Word2Vec and LDA are unsupervised training, This will introduce errors into the whole model. DSSM adopts unified supervised training, and there is no need to do unsupervised model mapping in the middle process, so the accuracy will be relatively high.

Disadvantages: AS mentioned above, DSSM uses the word bag model (BOW), thus losing word order and context information. On the other hand, DSSM adopts weak supervision and end-to-end model, and the prediction results are not controllable.

The resources

  1. Detail deep semantic matching model DSSM and his siblings
  2. NDCG and implementation