On November 13, the cloud Brain Technology machine Learning training camp was officially opened. The training camp was jointly built by Kesci and Cloud Brain Technology. The training camp was taught by top Chinese and American mentors, who carefully guided students to solve practical problems in machine learning and cultivated high potential talents for the artificial intelligence industry.

K – Lab data analysis online collaboration platform provide full support for this training camp, training camp participants can in the browser, fast and convenient to data processing, model building, code debugging, writing reports, such as the analysis work, realize the online registration – AI application algorithm works online submission – case study and practice ability evaluation and selection of the “one-stop” work style to learn.

At present, the first issue of big stars share content has been released, the following is a transcript of the share ~

Share content introduction

Topic: Challenge of recommendation system of e-commerce platform with 100 million users

Speaker: Zhang Benyu (Founder &CEO of Cloud Brain Technology)

A veteran of 18 years in artificial intelligence, he once worked for Microsoft Research Asia, Google and Facebook. He holds 150 US patents in AI, and has published 45 papers in international first-class journals and conferences, which have been cited more than 6000 times. Innovation Works’ latest research entitled “How great are Chinese/Chinese in AI?” Zhang Benyu is one of the top 10 Chinese scientists.

Main points:

  • Collaborative Filtering

  • Feature Engineering

  • Recommended system combat attention points


Feature engineering in machine learning

First let’s look at the five parts of machine learning.

First, feature engineering. The second is the algorithm definition and tuning, that is, what kind of algorithm you should choose, what kind of parameters to adjust. The third is data acquisition and cleaning, and the next step is to implement and optimize the algorithm. The ‘I’ stands for integration with business production systems, so we’ll call it FaDAI for short. Feature engineering is the most important part of these five links.



We will briefly introduce feature engineering and some common feature engineering methods.

To quote Ng: “Applying machine learning is really about feature engineering, which is very difficult, time-consuming, and requires expertise.” Our ideal machine learning situation is that we have clean Raw data, then turn it into a learnable Dataset, work out some models through some algorithms, and then solve a problem. This is the optimal state. But in reality, we have all kinds of data, some from databases, some from logs, some from semi-structured documents, some from unstructured audio, pictures. What features can be extracted that we can use in machine learning to learn models and solve problems?





Variable type feature engineering

Now let’s look at the characteristics of variable types. There are actually several broad categories of variable types. There are classified characteristic variables, there are numerical characteristics, and two more special ones are time and space, which we will introduce one by one.

1. Discrete characteristics

Enumerate some examples of discrete characteristics: What is your operating system type? It could be a desktop, it could be a tablet, it could be a phone. What is your user_id? I have 121545, or some other ID. This type of feature is the one that requires feature engineering most, because its value space is very large and often results in sparse data. So in terms of efficiency and accuracy, it is a huge challenge to the model.

The simplest feature project is called One-hot Encoding. For example, the platform dimension has three values: Desktop, mobile, and tablet. So we can convert to three features, 1 if the platform is desktop, 1 if the platform is mobile, 1 if the feature is tablet, 1 if the feature is tablet, so it’s a very sparse structure. So for example, if you have 100,000 sites, you have 100,000 dimensions, so this is 100,000 dimensions and you take 1 on this dimension and 0 on everything else.



One common way to do this is to do Hash Encoding. For example: there are 200 + countries, Hash it to 100 + columns, but one-hot has 200 + columns, but Hash it, the parameters are adjustable, so it can be shrunk to 100, 50, or even 10. It comes at a cost, for example Brazil and Chile are in the same column, but the two countries may have different characteristics, but they must share the same position. This is a potential problem with them, but sparsity can be controlled and can also handle low frequencies and some new variables. The implicit condition is that there is an assumption that some features can share the same location. This assumption can also be used in deep learning. Therefore, the discovery in practice will not affect the actual results in many cases, as long as your parameter space is relatively enough, that is, it has enough expression ability. This is also a relatively common approach that some of the better-known, open-source machine learning tools have.



The other Encoding is the count Encoding, which converts it to a global count. For example, if the AD ID is 423654, how many times he has seen it and how many times he has clicked on it, he can directly convert it into a statistic, which can be the number of times he has watched it, the number of times he has clicked on it and the CTR of the advertisement. It’s a different ID, each id has a different weight, becomes a feature on a floating point number, shares a weight. There is an assumption here, that it has some kind of linear relationship with global statistics, or linear relationship after some transformation space.

2. The outliers

There is also the influence of outliers on the whole statistics that we care about, so we may change from absolute value to relative value, which is the order in which it is sorted, for example, by CTR, what are the characteristics of the best CTR?

Finally, the common practice in neural networks is to transform classification variables into embedded variables and implement Embedding. Let’s say you have 100,000 different sites and you project it onto a vector in 64 or 128 dimensions. That’s equivalent to needing 100,000 Free parameters. Now you only need 64 or 128 dimensions. The reason why we can do this is that the original 100,000 dimensional space is very sparse, 64 or 128 dimensions are more dense expression, so there is still a very strong expression meaning, the advantage is less memory requirements, relatively higher accuracy.

Some students asked the difference between Hash and Embedding. Embedding itself needs to be learned. For example, what kind of Embedding space it projects into can be learned. The hash is directly projected through a predefined hash function, and the hash function itself does not need to be learned. The underlying logic is different here, Hash Encoding means that you can share the same weight for both dimensions. For example, Brazil and Chile are in the same column, they have the same weight, and Embedding is actually a distributional representation. It means that Brazil might be represented by different values on 64 dimensions, and Chile is represented by the same 64 dimensions, so each of these 64 dimensions, each of these columns participates in representing a different country, each of these countries is represented by this 64 dimensions. There is a difference in basic thinking between the two.

3. Numerical variables

We’re going to go into numerical variables, and there are two main types of numerical variables, floating point and fixed point, which are integers. Many times numerical variables can also be used as direct inputs to the model. However, feature engineering is basically required, because in practice, its value range will be very scattered, and in fact, it will have a relatively large impact on the model.

First of all, let’s look at missing data. One of the easiest ways to do missing data is to turn it into a blank, or a NaN, but in reality white space is treated as a zero, which is not really the best way to express it. It is better to use the mean, median, or modular value, or generate some value from another model. But usually the average and median are good enough. Pawnchess can be a bit awkward in the second case, which is to ignore changes in decimal places because sometimes too many decimal places make a noise. His own observations weren’t actually that accurate, so a lot of the time the accuracy was due to some lower order noise. Or we want it to be robust on certain characteristics. For example, in this example, after multiplying by 10, it can be regarded as a classified, discrete variable to some extent, such as 12345678910. Of course, after it becomes a classified variable, it actually produces a constraint, 10 must be better than 9, 9 must be better than 8, it has a sorting order and relationship. So it’s all about whether or not this constraint holds up in the real world.

And then there’s a further extension of rounding, quadratic, 0 and 1, anything beyond 0 is 1, because a lot of times we need to focus on some qualitative properties of it. In Binning, you slice it, you discretize it, you slice it into some bin, which is the same width, 1 to 2,2 to 3,3 to 4, how many times does it fall into this bin. And the other way to divide it is to divide the buckets that fall into it evenly, as evenly as possible, so that the horizontal axis is even.

Sometimes the range of values is too large or too small. In this case, some kind of nonlinear change, such as log transformation, is used to make it smoother and more discriminating between the two value ranges with extreme values. This is also a very common approach to nonlinearity. It’s pretty simple, but it actually works pretty well. List and take the square root or the square root.

The final approach is to do some normalization of arrays. There are two ways of normalization: Minmax finds the minimum values and maxima and normalizes them to between 0 and 1, and there is a comparative normalization that subtracts mean and divides var, but has a basic understanding of the distribution of the data. Here’s another way to do it, which is to normalize the vector, again, to prevent outlier points, but again, for numerical stability.

Here is a method of feature generation, for example, the original feature is X1,X2, and it can generate new features through pairwise interaction, which also brings some nonlinearity. The recommendation system FFM, which I’ll talk about later, essentially uses this approach.

And then there’s the time variable, which is essentially a continuous value, but there’s actually some special treatment, and sometimes there’s some weird stuff that gets ignored, so be careful. The first thing to notice about time zones is whether you should use local time or the same time zone, depending on the problem, and daylight saving time. It depends on the scene. Time is a continuous value, and in many cases it is segmented, sometimes with certain semantic segmentation, such as morning, noon and evening. In fact, the segmentation itself can also be made overlapping. For example, 5pm to 9pm is early morning, and 8pm to 11pm is morning, so 8pm to 9pm belongs to two bin at the same time, which is also ok. The second one is to characterize some of its time trends, which is the time it took, the time it took last week, or a change in the relative time it took.

4. Processing of temporal and spatial features

There are also some scenarios where we focus on special occasions, such as holidays and Singles’ Day. For example, with these projections for electricity consumption, the Spring Festival could be a very strong feature. Electricity consumption in big cities drops sharply during the Spring Festival, and special recommendations for payrolls before the World Cup may be something to consider in practice. Time interval: for example, the time between the last AD click, the time between two clicks, because it is assumed that the user’s interest changes with time variables.

Corresponding to time are spatial variables, which may be GPS coordinates, or semantic addresses, zip codes, cities, provinces, or distances from a particular location. A lot of times the location is a continuous stream of features that may have GPS coordinates every second, and he may need to perform abnormal monitoring because GPS is not that accurate and reliable. You can also enhance location information based on external sources: the demographics of the area, income, etc.


Feature engineering for natural language processing



Let’s look at the feature engineering of natural language processing. Text is also a classification variable in nature, so it has some traditional methods, such as :Bag of Words,TF-IDF, as well as newer Embedding or Topic models.

Bag of Words is an expression of one-hot encoding. Tf-idf is a simple improvement of Bag of Words. Its feature value not only depends on the occurrence or absence of words, It hopes that the value of this feature can reflect the relative importance of this word to semantics. Term Frequency means that the more times a word appears in a document, the more important it is likely to be. On the other hand, the fewer articles a word appears in, the more distinctive or representative it is. So TF stands for Term Frequency, IDF is the Frequency when words appear in document, and the multiplication of the two is a common method to re-weighting feature values in the field of information retrieval.



If you have the tF-IDF vector of two documents, you can define the similarity of the vector, which can be defined by Cosine. Cosine can be regarded as a normalize inner product. If two features are normalized by L2, the relation between them is the inner product, or the included Angle between two vectors.

Textual Similarity is simply a quantitative calculation of how easy it is to convert from one text to another. Word2vec is actually a method of Embedding. We need to define some loss function to learn which loss function meets the expected loss function. Topic models are essentially a matrix decomposition to express low-dimensional data in high-dimensional space and describe data more completely, which is also used in recommendation systems.

Collaborative filtering and recommendation system

Recommendation systems are a very broad field of machine learning, closely related to advertising systems. The difference is business logic, essentially algorithms can learn from each other.

The essence of collaborative filtering is to use other users to recommend and filter the user. Assuming that the items A and B have seen are very similar, A and B may share similar lists. For example, if only B has seen some items, A and B are likely to have the same preferences. Items could be ads, movies, music, and so on.

For example, green is like, red is not like, and we want to look at the user’s preference for TV, what kind of user is more like him? We’ll notice the second and third users, and we’ll use the second and third users’ preferences to guess what it likes on the TV, which means it likes the third item.

Collaborative filtering is divided into three steps:

  • The user needs to show his preference for an item.

  • Use algorithms to find users who are similar to him.

  • Make a recommendation based on the user.

This is user-based recommendation, and we’ll see examples of item-based recommendations next.

First, he needs to determine a measurement method, which can measure the similarity between users and items. Assuming that such an item is expressed by a feature vector, its similarity can be measured by Euclidean distance or Pearson correlation coefficient. Euclidean distance is actually one of the simplest measurements, but it’s also a very useful method in many cases.

If two vectors are two points in an n-dimensional space, then the distance between the two points is the Euclidean distance. Distance we need to translate into similarity, sometimes smaller is better, sometimes bigger is better. So we’re going to use one of these changes. We’re essentially projecting an infinite interval onto 0,1. Pearson’s coefficient is essentially a description of the similarity between the two. Cosine is also a variation based on inner product, if on a hypersphere, it has a simple correspondence with Euclidean distance. With such a distance, we can find similar labels in two ways: 1. Find the nearest K neighbors. 2. Find neighbors whose similarity is less than or greater than a certain degree. Both methods are used in practice.

Item-item Filtering: Now with user ABC and Item ABC, we consider whether to recommend Item C to user C. We look at which item C often appears together and find that it is item A. User C is recommended for item A, so he is recommended for item C. User A is recommended by user-item Filtering. If User C has an additional item that User A does not know about, and item D is unknown to User A, then User D will recommend it to User A. These two dimensions may be different, and which method is better depends on the specific characteristics of the data.





Either method has some disadvantages: 1. Complexity is O(n^2) and increases with the number of users and items. Regardless of whether item-item Filtering or User-item Filtering is used, the dimension of feature vector itself is very high, and the overhead for computing similarity or difference degree will be larger, with an O(N) increase. Finding similar items has order n complexity; 2. How to recommend new users.

This work was put forward by Steffen in 2010. He enhanced the model from another Angle and achieved good results at the same time. He focuses on synergies between features, such as combining two features. For an example of an AD, he cares if the user clicked on the AD (1 or 0), showing some characteristics of the user, the country, the time of the click, and the type of AD. This is a simplified data set, using one-hot Encoding.

The simplest method is to one-hot express all features, and do not hash other features such as dates. If you put that matrix back into the system of recommendations, such as user and movie recommendations, each line represents the relationship between the user and the movie. There has been one-hot expression and there has been a normalization, where Y is good and bad. In addition to collaborative filtering, another method of recommendation system is to treat it as a regression problem, in which X is these features and Y is rating. The simplest model is linear regression. Linear regression is essentially giving each feature a weight, adding it up, and adding a prior. And then we get a predicted value, and we want that predicted value to be as close to the real y as possible.

You may not be able to express yourself well when using only original features. For example, in the USA and today is Thanksgiving, which is a very important piece of information, we may need to combine these features and construct new features. But these combinations can be huge, and the number of combinations is n squared. Let’s say 200 countries, 30 festivals, combine that with other features like sites, and it’s huge. We take a closer look at the feature sets, which may not be independent of each other. There are some parameters that can be shared. These share parameters are very important concepts that are used in Hash Encoding, CNN, and RNN. For example, the combination of The United States and Thanksgiving, it is very related to the combination of China and Chinese New Year, so they can be depicted with the same latent factors.

The traditional technique for finding Latent factors is to do matrix factorization, so let’s say we have a very large matrix in nm, and we can reconstruct a matrix in nm by multiplying two matrices in NK and km, which is SVD or LSI, maybe a different term but the same thing. So this idea is extended to FFM, and the key idea here is to define WIj as vi times the inner product of vj, which is a member of the k dimension, and one of the benefits of that is to reduce O(n^2) to O(n). So WIJ is not an arbitrary parameter, it’s a restricted parameter. So FM can be expressed as the following expression, which is not O(n^2) complexity, but O(nk) such a problem, k is an optional parameter, does not change with the increase of data volume or feature. It looks like it’s going to be a lot more computations, but there are a lot of computations that are repetitive and can be O(NK) with a simple change.



To sum up its advantages: FM model can be calculated in linear time, it can be used with any feature vector of real number type, even under very large data, it can also carry out some parameter estimation, and can also do two-order feature combination.


K-lab, an online data analysis collaboration platform launched by kesai, has integrated Python3, Python2, and R for you, with built-in 100+ common data analysis kits (including Seaborn shared in this article). The official website also aggregates rich data resources, you can directly log inkesci.com, and try using Seaborn or another toolkit for data analysis.