Personalized recommendations were one of the hottest concepts of 2016 — 2017. Probably started with today’s headlines, circle on the Internet were taken up a “personalized” agitation, no matter what product, seems to add a personalized recommendation system can tremendously improve operational efficiency and conversion rates, especially in the content distribution, areas such as electricity, social practice was brilliant (weibo, all news portals, jingdong, visited have achieved good grades). Personalized recommendations have become a product infrastructure, and even now personalized recommendations have been upgraded to “artificial intelligence.”

Where exactly is “personalized recommendation”? This series mainly discusses the personalized recommendation system under the background of artificial intelligence. This is the first part of this series: “How to build a personalized recommendation system from 0 to 1?” , and will continue to share and discuss the optimization ideas and practices of personalized recommendation system.

Let’s take a look at the required modules of a complete recommendation system, the core of which includes content source, content processing, user mining, algorithm, recommendation search engine and ABtest system. This article introduces each module of the recommended architecture one by one.

First, a large number of levels of recommendable content, namely recommended SKUs

Personalized recommendation is the nature of ascension information screening efficiency, if small scale makes little sense to personalized information (such as a video website can only produce 10 news every day, can only within this article 10 personalized cycle, there was no difference) for users, personalized recommendation SKU at least thousand level or the level, but also in theory, The more quality content and the wider distribution of categories, the better personalized recommendation effect.

These contents can be captured copyrighted content, UGC, copyright cooperation PGC and other sources. Due to different sources, the style and quality may vary greatly, so it is usually necessary to do content capture, cleaning, transcoding, etc., to ensure the uniformity of style, and may also need user management system, anti-garbage and other cooperation to build content ecology. Different personalized recommendation systems may be similar, and different recommended contents will lead to different user scenarios and product barriers. Content, in essence, is a kind of resource.

Second, the standardization of the content

The first step is to prepare the content, and the next step is to process the content into features (such as categories, labels, product libraries, and so on) that machines and algorithms can understand. How to deal with it depends on the business needs, the technology needed: if it is articles, news, micro-blog, etc., it needs natural language processing; If it is a picture or video, it will involve image recognition and processing; If it is a song, movie, commodity, etc., it is difficult for the machine to label the content directly. It is best to establish a set of user label mechanism, or to label by manual filling or grabbing.

But no matter what content, we should first establish a set of our own label system, which is the process of defining standards. For example, to label movies, we should first define how many kinds of movies there are. Usually, the label body of our department will be a tree or mesh structure. Secondly, a large number of training samples may be collected. For example, in order to label images, tens of thousands of images need to be manually labeled for machine learning, and the labeled samples need to be constantly updated, which involves a lot of repetitive and tedious human labor. So insiders often joke that the focus of “ai” is actually “artificial”.

Third, user behavior log collection, transmission, mining, storage

The basis of recommendation is data. The first two steps mine content data, and the third step mines user behavior to generate user portraits.

Collection: usually in the way of front burying points, users’ clicking, sharing, favorites and other behaviors are reported. Log collection is a very important part of data mining. If the collection is missing or wrong (which is likely), then no matter how you do it later, it will have no effect. At the same time, changes in the front end may also affect the log, and if not effectively coordinated, it will have a great impact on the back end.

Transfer: often used for users interested in collecting the sooner the better, so that users of an operation can rapid feedback to the next step in the recommended, so you need to log the stability of the transmission and update, but due to cost considerations, the user profile is not can be updated in real time, some may delay 1 hour, some May 1 day, 1 week more, or more.

Mining: this process is the user data calculation, mining processing into the characteristics we want (commonly known as “user portrait”, the industry is usually called user profile), user mining is usually combined with the algorithm, but can not dig features out of thin air, no algorithm application is no value user portrait.

Storage: A user’s interests do not change much over time, so you can accumulate a user profile based on the user’s long-term behavior, and you need to store these profiles. If there are a large number of users, then the storage resources required are also massive, which requires a database capable of Distributed storage of large amounts of data, and it needs to be reliable and cheap, such as HDFS. If you want to calculate users’ interests in real time, They need databases that can be accessed quickly, such as Redis, so the purchase of servers is also a big expense for weibo, Toutiao and other companies.

Of course, the user’s interest is not constant, so the user’s interest needs to “decay” over time. Setting a reasonable attenuation coefficient is also important for the user profile.

In addition, user behavior mining has a historical problem — user cold start, this topic we need to discuss in a single article.

Fourth, sorting algorithm

The first three steps have content and user data, and the fourth step can use the algorithm to match the two. Personalized recommendation is essentially a Top N ranking, usually including two modules of “recall” and “ranking”. For example, if I have 100,000 messages, but users may only see 10 messages a day, what are the 10 recommended messages for users? I can sort the 100,000 entries in order from 10,000 to 100,000, so that no matter how many entries the user wants to see, I can just pick from my 10,000 entries from the front to the back. This process is called sorting. However, this sorting method is too much to calculate in the real-time index, which may lead to a high delay. Therefore, we first select the relatively reliable 1000 from the 100,000 in a relatively simple method, and then sort the 1000. The process of selecting 10000 from 100,000 is called “recall”.

There are many ways to improve the algorithm. After that, a single article will introduce the most effective algorithm commonly used in the current recommendation system in detail. In addition, any algorithm needs to use “dynamic metrics” (such as CTR) after the content recommendation, but how do we get this dynamic metrics before the recommendation? The cold start of content, which is covered here, will also be discussed separately later.

Fifth, recommend search engines

Why are there search engines? Yes, you read that right. In fact, personalized recommendation and search are very similar fields. They are both information filtering methods and “relevance” rank. The objective function is very similar (click rate). However, search focuses more on the relevance of the user’s current search keywords, while recommendation focuses more on the relevance of content to the user’s profile. Every time a user browsed is a real-time request, so it is necessary to calculate the current content most in line with the user’s interest in real time. This step is undertaken by the online search engine. However, due to the performance requirements, the online index step should not do too time-consuming calculation, generally the sorting algorithm calculates the initial results, the online engine does algorithm scheduling and normalized sorting, in addition, the online index will also undertake the receipt of requests, output data, exposure click and other services. It also typically performs secondary ordering of business and product requirements (such as inserting ads, breaking up same-type content, etc.).

Sixth, ABtest system

ABtest system is not a necessary module of personalized recommendation system, but the recommendation system without ABtest must be a fake recommendation system! Optimization of recommendation system is actually a y= F (x), y is the objective function, first of all, the objective function must be very clear, and quantifiable index; F (x) is composed of the selected algorithm, algorithm characteristic parameters, algorithm scheduling, etc. In fact, there are always several effective algorithms and algorithm principles in the industry. However, how to select features and parameters based on one’s own product scene has become a key factor for the accuracy of personalized recommendation. If there is an ABtest system, we can try to bring in a variety of parameters and features, and obtain the best Y through the ABtest experiment. In this way, the recommendation system can be constantly iterated and optimized.

Of course, the optimization of the algorithm is not so simple as changing parameters. The recommendatory person needs to be sensitive to data and be able to abstract complex problems to quantifiable indicators, and then iterate quickly with ABtest experiment. The process of algorithm optimization summarized by me is: “data analysis to find problems, reasonable assumptions, design experiments, implementation, data analysis, conclusion or new hypothesis”, which is repeated repeatedly. Among them, modifying parameters is only the “realization” step, which is also the simplest step, and most people only pay attention to “realization”, but pay too little attention to the process of analysis and hypothesis, so that the effect of optimization is not guaranteed, and some products and technical personnel will fall into the mistake of blind ABtest, aimless attempt. Frequently doing ABtest found no difference in AB group data, and even produced the idea of low efficiency of ABtest. These analytical ideas open the gap between algorithm engineers.

Apsara Clouder Big data specific skills certification system

This certification system explains the concept, application and algorithm principle of recommendation system, and introduces ali’s recommendation engine product RecEng in detail. Finally, through a micro project, let students build a recommendation system by hand. The whole process is divided into four parts: data upload, data preprocessing, recommendation system setting and test on-line. Students can refer to this experiment and apply what they have learned to practice in combination with their own business and needs.

The content list

01

Recommended system concepts and application scenarios

This section describes the background, concepts, features, and application scenarios of the recommendation engine.

02

The algorithm principle of recommendation engine

This paper introduces the commonly used recommendation engine algorithms, as well as the principles, advantages and disadvantages of each algorithm.

03

RecEng is recommended

The feature, capability and data model of RecEng are introduced.

04

Recommended engine RecEng basic operation demonstration

Demonstrates the basic operation of the recommendation engine RecEng.

05

Practice: build e-commerce recommendation system

This paper introduces how to use RecEng to build a recommendation system to support the recommendation business needs of enterprises.

06

Experiment manual: Build e-commerce recommendation system

Detailed experimental operation manual, take you step by step to complete the establishment of e-commerce recommendation system.

For details, click:

Apsara Clouder Big Data special skills certification: Build personalized recommendation system

Official website of Ali Yun University (Official website of Ali Yun University, Innovative Talent Workshop under cloud Ecology)