Ranking learning is the core approach to recommendations, search, and advertising, and LTR is a supervised machine learning algorithm for ranking tasks. So LTR is still the traditional machine learning processing paradigm, construction features, learning objectives, training models, predictions. LTR generally falls into three types: PointWise, PairWise, and ListWise. These three algorithms are not specific algorithms, but three design ideas, and the main differences are reflected in the loss function, label labeling and optimization methods.

## 1. PointWise

For search tasks, PointWise only considers the absolute relevance of the current Qeury to each document, and does not consider the relevance of other documents to Qeury. PW methods usually encode documents into feature vectors, train classification model or regression model according to training data, and score documents directly in the prediction stage. Ranking documents according to this score is the search result.

The processing logic is shown as follows:

- Implementation details

The format of the training data is triples: (Qi, DJ, YIj)(q_i, D_j, Y_ {ij})(Qi, DJ, Yij). Yijy_ {ij}yij is two values, indicating correlation or irrelevant. Train a binary model or regression model to directly fit yiJY_ {ij}yij. Loss function: Classification model Loss function can use cross entropy, regression model Loss function can use mean square error (MSE). Prediction stage; The score is used directly for sorting.

- The problem of PointWise

- The PointWise only considers the correlation between query and single document simq,dsim_{q,d}simq,d, and does not consider the relationship between candidate documents. Since our goal is to rank the candidates, we really want to calculate the relative score, and the direct use of simq,dsim_{q,d}simq,d is often not so accurate. In fact, simq,dsim_{q,d}simq,d is only probability of accuracy, not true relative order probability.
- PointWise does not take into account internal dependencies between documents corresponding to the same Query. This leads to the following problems: 1. The samples in the input space are not independent synchronization score (IID), which violates the basic assumption of machine learning. 2. When different Queries have different numbers of documents, overall Loss is easily dominated by query groups that have more documents (training data).
- Sorting is concerned with the accuracy of TopK, so the setting of loss function needs to add the relative position sorting information.

## 2. PairWise

The basic idea of PairWise is to compare samples in pairs, build partial-order document pairs, and learn order from comparison. As analyzed in PointWise, what we need for a query is the correct order of the search results, not the correlation score of the search results with the query. PairWise hopes to get the correct order of the whole by correctly estimating the order of a pair of documents. For example, if the correct order is: “A>B>C”, PairWise learns “A>B>C” by learning the relationship between two pairs: “A>B”, “B>C” and “A>C”.

The processing logic is as follows:

- Implementation details

Training data format for (qi, di +, di -) (^ ^ + _i q_i, d, d – _i) (qi, di +, di -), is a positive and negative examples of the query. Also commonly called :(anchor,positive,negative). PairWise is actually a metric learning approach to learn their relative distance directly, regardless of the actual value. There are two kinds of loss functions: 1. Enter a pair’s Ranking loss: L (r0, r1, y) = yd (r0 – r1) + 1 – (y) Max (0, margin – d (r0 – r1)) L (r_0, r_1, y) = yd (r_0 – r_1) + (1 – y) Max (0, margin (r_0 – r_1) – d) L (r0, r1, y) = yd (r0 – r1) + 1 – (y) Max (0, margin – d (r0 – r1)) one of the y value of 0 or 1. 2: Enter the Triplet Triplet Loss or Contrastive Loss: L (ra, rp, rn) = Max (0, margin (ra, rp) + d (ra, rn) – d) L (r_a r_p, r_n) = Max (0, margin (r_a r_p) – d + d (r_a r_n)) L (ra, rp, rn) = Max (0, margin + d (RA, RP)− D (RA, RN)); Like PointWise, scores are used directly for sorting.

- The problem of PairWise

- Due to the need to construct data sets in pair format, which can be n times as many as doc (depending on the construction strategy), the problem in PointWise that “when different Queries have different numbers of documents, the overall loss is easily dominated by query sets with more documents (training data)” still does not exist. Even more so.
- PairWise is more sensitive to noisy data than PointWise, meaning that a single mistag can cause multiple pairs to fail.
- PairWise still only considers the relative position of a pair of doc’s, and the loss function still doesn’t consider the relationship between candidate documents. Can be considered as PointWise optimization version, the basic idea has not changed.
- Similarly, PairWise does not consider internal dependencies between documents corresponding to the same Query. As a result, the samples in the input space are not independently synchronized (IID), which violates the basic assumption of machine learning.

## 3. ListWise

Both PointWise and PairWise directly learn whether each sample is related or whether the correlation between two positive and negative samples is more like metric learning, which is to try to deduce the global ranking result through sampling learning. This kind of thinking has fundamental disadvantages. The basic idea of ListWise is to try to optimize sorting metrics like NDCG directly to learn the best sort results.

- Implementation details

The format of one sample input is Query with all of its doc candidates. Given: qiq_iqi, and its candidate doc and tag: C(di1,.. ,dim)C(d_{i1},.. ,d_{im})C(di1,.. , dim), Y (yi1,.. ,yim)Y(y_{i1},.. ,y_{im})Y(yi1,.. Yim). The value of the tag YYY represents the order of all doc candidates. Such as a candidate set {a, d, c, b, e} \ {a, d, c, b, e \} {a, d, c, b, e}, if is the natural order, its corresponding labels for 5,2,3,4,1} {\ {5,2,3,4,1 \} {5,2,3,4,1}. The model was trained by various ListWise algorithms. Prediction stage; Sort by score.

- ListWise has three basic ideas:

- The first is measure-specific

This approach is to directly optimize metrics such as NDCG. This approach is typically “ideal is full, reality is full”, because ranking metrics such as NDCG, MAP and AUC are mathematically “non-continuous” and “non-differentiable”. Based on this reality, There are usually three solutions: First, find a “continuous” and “differentiable” alternative function that approximates the NDCG index and optimize NDCG by optimizing this alternative function. Representing algorithms: SoftRank and AppRank. The second method: try to write a mathematical “boundary” of NDCG and other indicators, and then optimize the “boundary”. For example, if an upper bound is derived, NDCG can be optimized by minimizing this upper bound. Representative algorithms: SVM-MAP and SVM-NDCG. The third method, direct optimization algorithm, can be used to deal with “discontinuous” and “non-differentiable” NDCG indicators. Representative algorithms: AdaRank and RankGP. 2. The second method is non-measure -specific. This method attempts to reconstruct the order according to a known optimal order, and then measures the gap between the two, that is, optimize the model to try to reduce the gap, such as using KL divergence as Loss. Representative algorithm: The core goal of this method is still to optimize the sorting index of NDCG class and design an alternative objective function. With the alternative function, Optimization and computation are handled directly in a PairWise fashion. Representative algorithms: LambdaRank and LambdaMART.

- ListWise’s pros and cons
- Constructing training data is difficult in many scenarios.
- Since sorting loss needs to be calculated, the computational complexity is usually higher.
- With plenty of good quality data, ListWise tends to perform better than PairWise and PointWise in learning and optimizing directly at the target task, which is sorting.

## 4. Common evaluation indicators

nDCG

Explanation of nDCG

Mean Average Precision(MAP)

In a sort task, each Query has a sorted list. As the name implies, MAP is the average AP of all queries on the test set. Let’s look at AP first:

$AP(\pi,l)=\frac{\sum^m_{k=1}{P@k*I_{\{ l_{\pi^{-1}(k)}=1\}}}}{m_1}$

π\ PI π represents the item list, a list of pushed results. M represents the total number of result lists, and m1M_1M1 represents the number of query-related items in the result lists. I {l PI – 1 (k) = 1} I_ {\ {l_ {\ PI ^ (k)} {1} = 1 \}} I {l PI – 1 (k) = 1}, said at the location k label is related, 1, 0 means no. P@kP@kP@k is topk Precision: P (PI) l = ∑ @ k t < = kI {l PI – 1 (k) = 1} kP @ (\ PI, l) = k \ frac {\ sum_ < = k {t} {I_ {\ {l_ {\ PI ^ (k)} {1} = 1 \}}}}} {k P (PI) l @ k = k t < = ∑ kI {l PI – 1 (k) = 1}

The attached chart makes it very clear:

Code implementation:

`Def _ap (ranked_list ground_truth) : # ranked_list: list of results, such as [' a ', 'b', 'd', 'c', 'e'] # ground_truth: Related item list, for example [' A ', 'D '] hits = 0 SUM_precs = 0 for N, item in enumerate(ranked_list): if item in ground_truth: Hits += 1 sum_precs += hits/(n + 1.0) return sum_precs/Max (1.0, len(ground_truth))Copy the code`

- reference

- Search evaluation index — NDCG