Dry goods | become a recommender systems engineer never too late

The author | Chen Kaijiang

Coordinating editor | He Yongcan

Recommended system engineer skill tree

Skills in core principles

Mathematics: Calculus, statistics, linear algebra
Peripheral subjects: Fundamentals of information theory
Recommended algorithms: CF, LR, SVM, FM, FTRL, GBDT, RF, SVD, RBM, RNN, LSTM, RL
Data mining: classification, clustering, regression, dimensionality reduction, feature selection, model evaluation

Skills to implement systematic testing ideas:

Operating system: Linux
Programming languages: Python/R, Java/C++/C, SQL, shell
RPC frameworks: Thrift, Dubbo, gRPC
Web services: Tornado, Django, Flask
Data stores: Redis, hbase, Cassandra, mongodb, mysql, HDFS, Hive, Kafka, ElasticSearch
Machine learning/Deep learning: Spark MLib, GraphLab/GraphCHI, Angel, MXNet, TensorFlow, Caffe, Xgboost, VW, libxxx
Text processing: Word2vec, Fasttext, Gensim, NLTK
Matrix decomposition: Spark ALS, GraphCHI, Implicit, QMF, libfm
Analogous calculation: kgraph, annoying, nmslib, GraphCHI, columnSimilarities(spark.rowmatrix)
Real-time calculations: Spark Streaming, Storm, Samza

The ability to take responsibility for results

Familiar with common offline performance indicators: accuracy, recall, AUC, Gini coefficient
Ability to define product performance metrics: click through rate, retention rate, conversion rate, complete view rate
I will do comparative tests and analyze the experimental results: visualization of index data
Know the difference between common recommendation products: Feed stream recommendation, related recommendation, TopN recommendation, personalized push

Other soft Skills

English reading; Read papers from top conferences, classic papers from top companies and industry leaders, and tech blogs. Talk to people on Quora and Stack Overflow.
Code reading; Be able to read open source code to learn the implementation of classical algorithms by excellent projects;
Communication and expression; Can communicate with other positions, explain the principle and method of the module, understand the requirements and thinking of non-technical personnel, and can distinguish the real demand from the fake demand and reach a consensus.

Figure 1 Recommended systems engineer skill tree

Recommend the development roadmap of system engineer

The article “Item-based Collaborative Filtering Recommendation Algorithms” was published in 2001. According to Google Academic, it has been cited 6,599 times, which shows its great influence on recommendation system.

After more than 20 years of development, item-based recommendation system has become the standard configuration, and recommendation system has become the standard configuration of Internet products. Many products in the first edition will even be investors or founder requirements must be “personalized,” visible, recommender systems have been flying off the shelves, as the growth of the recommendation system engineers also want more easily than before, you know I just work, even with the same for the rest of the r&d engineers such as PHP engineer (no black point, “I do referrals,” they also look puzzled, wondering why “referrals” is an engineering position.

Nowadays, even though the words “big data” and “AI” bombard us with 360 degrees every day, making us easily impetuous and anxious, we have to admit that this is a good time to be a recommendation system engineer.

Compared with normal code farmers, recommend system engineers do not need to throw the requirements of THE PM to the pixel level implementation, thus stacking into mountains; Unlike machine learning researchers, they don’t have to indulge in mathematical derivation to produce a beautiful, self-consistent model that can dominate the academic debate; Compared to being a data analyst, you don’t need to draw beautiful charts, make cool powerpoint presentations to the CEO, and get to the top of your life.

What is the position of the recommended systems engineer? Why do you need those skills? Let me combine my own experience to answer them one by one. I divide the skills of the recommended systems engineer into four dimensions:

Mastering core principles is a basic skill to know why;
Hands-on skills: Solid engineering skills are required to implement systems and test ideas;
The ability to take responsibility for results: This is the biggest difference between recommended systems engineers and other jobs;
Soft Skills: Any engineer needs to grow and work as a team.

What to read in English: Read papers from top conferences, classic papers from top companies and industry leaders, and tech blogs. Talk on Quora and Stack Overflow.
Code reading: be able to read open source code and learn the implementation of classical algorithms by excellent projects;
Communication and expression: able to communicate with other positions, explain the principle and method of responsible module, understand the requirements and thinking of non-technical personnel, can distinguish the true and false needs and reach a consensus.

Master the most basic principles

Thanks to open source, there are many tools out of the box that make it easy to build a recommendation system. But floating sand can not build a tower, the basic knowledge must have, otherwise in the industry, by the round concept whirlwind blowing north. Of all the foundations, the most basic is of course mathematics.

Being able to read some classic papers is very helpful for the implementation of the system: from basic assumptions to formal definitions, from derivation to algorithm flow, from experimental design to result analysis. This requires a basic knowledge of calculus, and a basic knowledge of calculus in order to understand gradient descent and other basic optimization methods.

Probability and statistics knowledge recommend we set up a system engineer through the most basic: don’t look at things, with is absolutely useful to uncertainty every event of thinking to think about the product, because the implementation recommendation system, not like implement interface on a button to respond to events that clear examine. Big data constructs a high-dimensional data space, and basically everything from data to recommendation targets can be formalized from the perspective of matrix, such as common recommendation algorithms such as collaborative filtering and matrix decomposition.

Machine learning algorithms, when viewed in terms of matrix operations, help us understand the big difference between vectorization and the cycles of traditional software engineering. The dot product between high dimensional vectors, the operation between matrices, if the vectorization is realized in a more efficient way than in a circular way. To build such a mindset, you also need to learn linear algebra.

Besides learning basic mathematics, we should extend our study to some basic subjects of information science, especially information theory. Information theory is based on probability, and it provides a basic framework for many problems in the computer field: problems are regarded as communication problems.

Recommendation system to solve the problem is also a communication problem: the user in a very unclear way to our product, tell us what he likes/dislikes, we have to decode after receiving, and also to reply to them, if the communication is not smooth, the user will lose contact. My major is information and communication engineering.

When I was a graduate student, I was engaged in research related to NLP. Many problems and methods in NLP used information theory knowledge, which greatly influenced me by information theory. Armed with this basic knowledge, it becomes much easier to keep track of the new algorithms and models that are constantly emerging.

Recommendation systems use many traditional data mining and machine learning methods. Learning classic machine learning algorithms, such as logistic regression, is a simple classification algorithm, but it is more widely used in the field of recommendation than any other algorithm. In ng’s deep learning course, he started with logistic regression and gradually talked about multi-layer neural network and more complex RNN. How do you master these classical algorithms? The most direct way to do this is to do it yourself from 0.

Recommendation system is not only a model, but also a whole data processing process. Therefore, the upstream of the model, even some data mining knowledge, including basic classification and clustering knowledge and dimension reduction knowledge, should also be mastered.

Develop solid engineering skills

The previous emphasis is on the need to implement the algorithm for mastering the algorithm, but in the actual development of the recommendation system, if it is not necessary, it must not repeat the wheel. Recommendation system is also a software system, of course, to be stable and efficient. Open source mature wheels are of course the first choice. There are some things common sense and some useful tools that need to be listed to implement the recommendation system.

The first is the Linux operating system. Due to the monopoly position of Windows in THE PC market, many software engineers will only develop under Windows, which is a very common, serious and easily ignored shortcoming. If your PERSONAL computer is a Mac, it will be much better, because macOS is the Unix operating system at the bottom, and Linux is a close relative. The terminal using Mac is basically similar to the command line under Linux. If not, you must have your own Linux environment for your daily practice, and buying a standing cloud server is a good choice. There are two key points:

Using Linux;
Use more command lines and less ides (Eclipse, VS, etc.).

Why is that? There are three reasons:

Almost all of the open source tools for recommended systems were developed and tested first on Linux and then ported to Windows (poorly tested or not ported at all);
The keyboard is faster than the mouse, using the command line programming will use the keyboard more, less use the mouse, after familiarity with the efficiency greatly improved. And Linux command is very rich, processing is also standard text, after mastering a lot of time do not have to write a program can do a lot of data processing work.
Almost Linux is the standard server operating system of Internet companies. If you can’t develop under Linux, you can’t find a job.

I am often asked what programming language is best for implementing recommendation systems. The standard official answer is: Use the language you’re good at. But I know this answer won’t solve the questioner’s question. In fact, my advice is: you need to know a compiled language: C++ or Java, and then an interpreted language. Python or R is recommended. Here’s why:

These languages are most common in open source projects for recommendation systems;
For fast data analysis and processing, model debugging, result visualization, system prototyping, etc., Python and R are good choices, especially Python;
When Python is an efficient bottleneck somewhere, it is usually implemented in C++ and then called in Python;
Java is very advantageous in the construction of background services, some big data open source projects also use Java to achieve;

Python is recommended if time is limited and you just want to master one language. From models to back-end services to the Web, Python is arguably the first programming language in the AGE of AI.

Recommendation system is an online product. No matter how cool the offline model is and how cool the visualization is, it must be completed as an online service in the end. This involves two aspects of work: 1. System prototype; 2. Servitization of algorithms. This involves:

Data storage. Includes a storage model for online real-time computing, and stores recommended results of offline computing. In addition to the traditional relational database MySQL, it is also necessary to master non-relational databases, such as KV database Redis, column database Cassandra and HBase, which are often used to store recommendation results or model parameters. Recommended candidate items may also exist in MongoDB.
The RPC and the web. RPC frameworks, most popular such as Thrift or Dubbo, are important because of the need to make their algorithmic computing modules available as services to others to call across processes and servers. On top of RPC services, prototyping also requires a little basic web development knowledge. Python, PHP, and Java all have corresponding Web frameworks to quickly complete the most basic recommendation presentation.

Of course, the core is algorithm implementation. Mainly machine learning algorithm. Here is a detailed list of common machine learning/deep learning tools:

Spark MLib: Probably the most widely used machine learning tool, because Spark is so popular that it leads to an MLib that is not at its core. MLib implements common linear models, tree models, and matrix decomposition models. Spark MLib provides Scala, Java, and Python interfaces, and provides many examples. Learning Spark MLib is worth running its examples yourself, using documentation and source code interfaces, serialization and deserialization of models.
GraphLab/GraphCHI: GraphCHI is standalone and open source. GraphLab is distributed but not open source. Therefore, recommendation system engineers are advised to focus on GraphCHI, which has Java and C++ versions. It implements common recommendation algorithms and can run high results on a single machine. One thing we have to admit is that GraphCHI and GraphLab are not widely used in the industry.
Angel: Tencent’s open source distributed machine learning platform, developed by Java and Scala in 2017, has been applied at an industrial level under Tencent’s 1 billion dimension. Finally, it fills the gap of distributed computing focusing on traditional machine learning (as opposed to deep learning), which is worth learning. Because the development team is Chinese, so the documents are mainly in Chinese, when learning more communication with the development team will benefit a lot, rapid progress.
VW: This is an open source distributed machine learning tool from Yahoo. It also supports stand-alone machines, and distributed with Hadoop. Windows is still supported because the main developer has since left for Microsoft. Reading the source code of this tool is very helpful to understand the training of logistic regression. VW was used in the first model training of microblog recommendation team and advertising team. Its developers answered questions actively in Yahoo Group. This is a method of learning and growth, suggesting that new scholars always ask questions in email groups or discussion groups, regardless of whether the questions are silly or not, and regardless of the teasing.
Boosting: Xgboost, a machine learning tool called kaggle, is worth learning and using, especially for understanding Boosting and the tree model. There are many tutorials online, and the main developer, Tianqi Chen, is Also Chinese, so it’s easy to find someone to talk to when you have a problem.
Libxxx: here XXX is a wildcard, including various machine learning tools that start with lib, such as liblinear, libsvm, libfm, libmf. These are stand-alone tools, but they are sufficient to solve the recommendation problems of many small and medium datasets. Some of the classification algorithms in sciKit-Learn are packaged tools such as libSVM. In addition, libSVM is not only a machine learning tool, but it also defines a widely used de facto machine learning training data format: libSVM.
MXNet, TensorFlow, Caffe: Deep learning is popular, and take the amazing effect on recognition problem, nature also indirectly promote the upgrade of recommendation system algorithm, therefore, mastering the deep learning tool is very necessary, which is especially with TensorFlow is given priority to, it not only has the deep learning model, and the realization of the traditional machine learning model, Python interface, The barrier to entry is low for those who know Python. Deep learning tools still suggest running a few examples, and playing with something fun will get you started quickly, such as changing the style of a photo, or training an animal/face recognizer to get a glimpse. Take a systematic look at Ng’s online course, which also explains the use of TensorFlow, and the programming assignments are well designed.

The ability to take responsibility for the end result

Recommendation systems are ultimately responsible for product performance. To measure the effect of recommendation system, there are two stages: offline and online.

Offline phase. For some models, there will be clearly defined indicators to measure the hypothesis verification of the model itself, such as accuracy, recall rate, AUC, etc. The effect of this stage is good, which can only show that it is in line with the expected hypothesis, but it cannot guarantee that it is in line with the final effect of the product, so it needs the actual inspection on the wire.
Online phase: In addition there are some relatively general indicators, such as customer retention rates, hours used, clickthrough rate, etc., more are linked to the positioning of the product itself, such as short video recommended attention vv, news recommend attention CTR and so on, these commercial interests and closer index is to examine the effects of recommendation system eventually index, recommendation system engineers are responsible for this, Don’t just focus on the offline and technical aspects of the effect.

Understand the requirements of different product presentation forms for the implementation of recommendation system, feed flow, related recommendation, guess you like and other different products behind different technical requirements, different effect assessment, more observation, more use, more thinking.

Finally, learning to understand the product itself in product language, exporting technical capabilities as a service to the rest of the team is a soft skill.

Status quo in the field of recommendation systems

Collaborative filtering was put forward in the 1990s. In the past 20 years, recommendation systems have adopted neighbor recommendation, content-based recommendation and machine learning method recommendation represented by matrix decomposition. In recent years, the popularity of deep learning has naturally brought significant improvement to recommendation systems. No one doubts the role of recommendation system. To give a few examples, 80% of Netflix movies are watched by the audience through recommendation system, and 60% of the click events on YouTube are contributed by recommendation system.

What is the status quo in the field of recommendation systems? Here respectively from the technical and product take a look. In terms of technology, recommendation systems rely on three types of technology: traditional recommendation technology, deep learning and reinforcement learning.

First, traditional recommendation techniques are still very effective. The construction of the first version of recommendation system still needs these traditional recommendation system technologies, including user-based and item-based nearest neighbor methods, content-based recommendation based on text as the main feature source, and traditional machine learning algorithms represented by matrix decomposition.

When the user behavior data of an Internet product is accumulated to a certain extent, we use these traditional recommendation algorithms to build the first version of the recommendation system, which will achieve good results and achieve a breakthrough of 0. This kind of traditional recommendation algorithm has accumulated enough practical experience and open source implementation. Because the demand for recommendation systems is more extensive than ever, and these technologies are mature enough, there is a tendency for these technologies to be SaaS and be handed over to specialized third-party companies, rather than small and medium-sized, vertical companies building their own teams.

Deep learning has made great achievements in identifying problems, so it is naturally attracted the attention of recommendation system engineers and has been combined into recommendation systems. For example, YouTube has built their video recommendation system with DNN, and Google has used Wide&Deep model in Google Play. CTR prediction was performed by combining shallow Logistic regression model and deep model, which achieved better results than shallow regression model or deep model alone. Wide&Deep model has also been integrated into TensorFlow in an open source way. Are widely used in this combination of deep learning and shallow modeling. In 2014, Spotify experimented with RNN for sequential recommendations, and RNN was later included in Yahoo News’s recommendation system. Among the traditional recommendation algorithms, there is a classical algorithm called FM, which is often used for CTR prediction, which is a shallow model. Recently, some people have tried to combine deep learning and proposed DeepFM model for CTR prediction.

AlphaGo, Alpha Master and Alpha Zero are better than each other. Their ability to play chess on and off has brought reinforcement learning into the public’s attention. It is quite natural for reinforcement learning to be applied to recommendation systems, which regard users as a changing environment, while recommendation systems are agents. In the continuous interaction with users, the recommendation system will gradually “find the north” from a confused face, catering to users’ interests. The industry has application cases, Ali researcher Renji has publicly shared taobao to strengthen learning application in the search recommendation effect. Reinforcement learning is also used in many areas of recommendation systems in the relatively simple form of Bandit algorithm to solve the cold start of new users and new items, and replace ABTest as another framework for online experiments.

In addition to the technical recommendation system has different focus, product form also has different presentation. The original recommendation system products always survive in the corner of the product, such as related recommendations, this product form can only be regarded as “icing on the cake”, if the recommendation system accidentally opened the skylight, it is not a life-threatening problem. Now recommended products have evolved into the main form of Internet products: information flow. From the earliest social network dynamic, to the text and text information flow, to today’s short video. Information flow is a product form of recommendation system. Compared with relevant recommendation form, it is no longer icing on the cake, but a sharp tool for attention harvesting.

The evolution of product form of recommendation system, background is the evolution of Internet from PC to mobile, search is king on PC, recommendation is king on mobile, naturally more and more important. As the variety of wearables grows, more and more recommendations will emerge. Products and technologies are developing in synergy with each other, and more interesting recommendation algorithms and product forms will come out in the future. It is never too late to become a recommendation system engineer.

About the author: Chen Kaijiang is a CTO of science and technology. He used to be senior algorithm engineer of Sina Weibo, algorithm supervisor of Kaola FM, co-founder of personalized shopping guide App “Wave” and “Browsing and Chatting”. He has many years of experience in recommendation system, and has rich practical experience in algorithm, architecture and products.

Blame: He Yongcan

This article is the programmer’s original article, shall not be reproduced without permission.

Dry goods | become a recommender systems engineer never too late

Recommended system engineer skill tree

Recommend the development roadmap of system engineer

Status quo in the field of recommendation systems

Related Posts

What is a thread deadlock? What are the forming conditions? How to avoid it?

Notes on System Architecture Design (57) — Test automation and Object-oriented Testing

That instant APP sued by CCTV, a blessing in disguise?