First of all, what are we talking about when we talk about this topic?

What do I care?

Hey, brother, since you have clicked in, it means you are a skeleton amazing, motivated, young and naive… Ah no, it is the keyword caught in want to learn something good boy or girl bai, chat ~

With the rapid development of network and automation, this era is full of all kinds of unprecedented large data sets, how to extract valuable information from it, using explicit instructions is obviously no longer an optimal method. For example, Google/Bing/Baidu search information ranking, Toutiao/Douyin and other recommended apps, spam, malicious phone number marking, etc. Assuming that all the above “tailored” behaviors for individual users need to rely on specific programs, the amount of code behind them has been too large to be estimated. Well, aren’t you curious about what’s going on behind the scenes?

What is machine learning?

For the record, machine learning is defined as “the science of learning using a computer without the aid of an explicit program”. Don’t be scared off by the AI craze, it’s all capitalist gimmicks, it doesn’t matter if the wave comes and goes, whether it’s high or not, whether it’s forced. Problem solving is king!

Simple classification: Two! Supervised learning & unsupervised learning

Supervised learning algorithm Supervised learning is the training data set provided for calculation is m group “standard input” and the corresponding “standard answer”, hope that the algorithm learning standard input and standard answer direct connection, this training result can be used for new input, to provide the corresponding predicted result.

For example 🌰, our exam-oriented education, this article the author of this sentence want to express what kind of emotion? You must have answered that question before. Why did you answer that question is because you know there are several points involved. Why do you know? Because you’ve found a pattern between the question and the answer, and that pattern is the result of your learning. Ahem, in all seriousness, the goal of a supervised algorithm is to find patterns that can be used to give new inputs the correct results.

Further subdivided, supervised learning includes regression problems or classification problems. The “formula” of regression problems is used to produce continuous results, such as training results that may be a continuous function that can be valued based on the area of the house. The “formula” of the classification problem is to ask for a discrete result, such as a discrete function with a value of 0 or 1, whether your first child will be a boy or a girl.

Compared with the above, the significant difference is that there is no standard answer for the algorithm. Let the algorithm itself find some kind of data structure in the data set, divided into different clusters. Clustering algorithm is a typical unsupervised learning algorithm, which is used to find closely related information and is often used in scenarios with a large amount of data. Such as:

  1. Large data centers, large computing clusters in which machines tend to work together to increase speed.
  2. Social network analysis, identification of friend rating intimacy circles.
  3. Market user data, automatic market segmentation, customer segmentation into different market segments.
  4. Analysis of astronomical data.
  5. NN (NearestNeighborSearch) is commonly used in recommendation systems. The task of unsupervised NearestNeighborSearch is to find a predetermined number or range of points closest to the query point from the training sample.

What is implicit recall?

Recommendation system

Firstly, I will briefly introduce the recommendation system, which is a kind of information filtering system used to predict users’ ratings or preferences for items, and then realize personalized recommendation. Toutiao, Douyin, Baidu Information Flow, Google information flow and so on, their core is behind this recommendation system.

What does a recall mean?

The recommendation system itself is also a very large topic system, including the underlying recall, rough sorting above the recall, and fine sorting before the return, and finally select a dozen of billions of information to feed back to users. The so-called recall may be difficult to understand by its literal meaning when first contacting the recommendation system. Recall can be understood as a coarse selection of products to be recommended to users, which is equivalent to providing a coarse selection of candidate sets.

Implicit recommendation is part of recall module in recommendation system.

The core algorithm of implicit recall: The core of implicit recall is the clustering algorithm of unsupervised learning mentioned above.

What does implicit mean?

Implicit is relative to explicit.

Display recall is easy to understand, it can be interpreted as keyword recall, for example, searching the keyword “ANN” may return some related Approximate Nearest Neighbor papers (ANN), Or the latest on Artificial Neural Networks (ANN).

So what is an implicit recall?

When it comes to implicit, let’s start with vector support machines. Learning algorithms use a lot of attributes/features/cues to make predictions. This huge amount of information can run out of memory or be impossible to calculate at all. We hope that by mapping this information into a multidimensional vector, individual information can be compressed. This vector information is also called an “implicit” recall because it does not appear to make clear sense individually.

Word vector: A common way to prepare data for a clustering algorithm is to define a common set of numeric data that can be used to compare data items.

The position of the words corresponding to the one-hot Vector is set to 1, and the other words are set to 0. For example: In King, Queen, Man and Woman, the vector corresponding to Queen can be expressed as [0,1,0,0]. The shortcoming of independent thermal coding is very obvious. Because it is too sparse, the dimension of a single vector is very high and the information content is too low.

In order to solve the problem of independent thermal coding, the concept of word vector is introduced.

Word vector: Word vector can solve the problem of single-heat coding mentioned above by training each word to be mapped to a shorter vector:

  1. All word vectors make up each word of the vocabulary.
  2. Word vectors measure similarity. (Context is introduced in the training process.) The above process is called word embedding, that is, high-dimensional word vector is embedded into a low-dimensional word vector.

Word2Vec is a model commonly used in the industry for generating word vectors (semantic vectoring). It is a simplified neural network that represents each word as a vector of k-dimensional real numbers (each real number corresponds to a feature), mapping groups of similar words to places close to each other in the k-dimensional vector space. Word2Vec has two important models, CBOW and Skip-Gram. The input of CBOW is the word vector of context-related words of a particular word, which is suitable for small libraries. Skip-gram inputs are word vectors for specific words and are better suited for large libraries.


Hey hey, to sum up ~

Implicit recall is a part of the recommendation recall system, which retrives the most similar TOPK object as the candidate set of recommendation through semantic vectorization.

The core of implicit recall uses machine learning-related algorithms such as clustering to realize the process of unsupervised learning.

That’s all for today, follow up to update the study notes ~