In the last article, we analyzed Facebook’s famous paper GBDT+LR. Although this paper was widely praised in the industry, GBDT is already an old model after all. Today we’re going to introduce one of the more widely used models in the industry, which was developed in 2010 by Steffen Rendle. Although it was born earlier, it is much more dynamic and has spawned many versions. What we are dissecting today is the most classic original paper of 2010.

When it comes to the algorithm model of recommendation and advertisement, it is almost difficult to bypass FM, which is a very strong model. The theory is simple, the derivation is rigorous, the realization is easy, and the effect is not bad. Even now it is still used in major manufacturers, and is even the main model in some small factories. We may have some doubts before we read it, but I believe that when you finish reading it, you will have a new understanding.

Again, this is a long article and I hope you will be patient to the end.

Introduction to the

FM’s full name is Factorization Machines, which translates to Factorization Machines. Such a translation is difficult to say, because it doesn’t have the shape or the shape, so we generally don’t translate it as FM model for short.

At the beginning of the paper, various benefits of FM model are discussed. For example, compared with SVM, FM can still have a good performance in sparse features, and FM has a very high efficiency, which can obtain results in linear time. And unlike nonlinear SVM (kernel function), FM does not need to convert features into dual forms, and the parameters of the model can be directly trained without support vector or other methods.

In addition to SVM, FM model has obvious advantages compared with other factorization models such as SVD, PITF and FPMC. These models tend to be specific to a particular scenario or data set, and their training and optimization schemes are customized according to the scenario. FM model can achieve the same good effect without customization and can be easily applied to various prediction problems.

This paragraph summary actually does not have too much connotation, basically boasted a FM. The author is right, however, that FM is better than the models he lists.

To sum up, the advantages of FM model are as follows:

  • The parameters of the FM model support very sparse features, while the SVM model does not
  • The time complexity of FM is O(n)O(n)O(n), and the parameters of the original problem can be directly optimized without relying on support vector or transforming into a dual problem to solve
  • FM is a general model, which can be applied to any scenario with real number characteristics, but other models cannot

At the very beginning, paper showed the advantages of FM model and gave enough benefits. Later, it made an in-depth analysis of FM model, so that we could understand how these advantages were obtained and how it worked.

Sparse scenario

In supervised learning scenarios, the most common task is to predict a target T given a vector x. If T is a real number, it’s a regression model, and if T is a constant of categories, it’s a classification model.

These are the basics of machine learning, and I’m sure you all know that, but for online sorting, what’s important is sorting, not sorting. Generally speaking, impression and click can be regarded as two categories for classification and prediction, or the click rate of goods can be directly predicted by regression model and sorted according to the click rate. These two are essentially the same, both are for the commodity itself to learn. Another method is to place more emphasis on the ordering of goods. Our model studies the relationship between goods.

For example, if we use vector xix_ixi to represent the feature vector of commodity I and vector xjx_jxj to represent the feature vector of commodity J, then we can combine these two vectors as input for classification prediction. The result of classification indicates whether I goods come before J goods or vice versa. This way, we can directly get the ranking relationship between items through multiple predictions, rather than by a score. This allows you to train with only positive samples. This method directly trains the pros and cons of model goods, which is called learning to rank in the industry.

However, no matter which approach, there is a problem that cannot be avoided: sparse features. Take a very simple example, for example, we conduct one-hot processing on the categories of goods. Under the scene of e-shopping mall, there are often tens of thousands of categories of goods. Then the one-hot result will be an array with tens of thousands of lengths, in which only one digit is 1 and the others are all 0. Of course, this is just one example, but there are many, many other features that could be very sparse.

We use M (x)m(x)m(x) M (x) M ˉD\bar{m}_DmˉD to represent the number of non-zero elements in x vector, mˉD\bar{m}_D \ll nmˉD≪n, Here n is the dimension of the feature.

Real world examples

The paper gives an example of a real scene. Our question is to predict users’ ratings for a movie. Let’s first look at the picture below, which is the sample matrix.

It’s clear that the big chunk on the left is the feature, and the Target Y on the right represents the predicted outcome, which is what the user is likely to say about the movie. There are five possibilities here [1, 2, 3, 4, 5], which means this is a multi-classification problem.

Then let’s look at features. There are five parts in total, among which the blue part represents the one-hot of the user. So the length of this array should be the number of users, except that the dimension representing the current user is 1, and all the other dimensions are 0.

The red part represents the movie, and the one-hot of the movie is the same logic as the one-hot of the user. The dimension representing the current movie is 1, and the others are 0.

After that is the yellow part, which represents the previous user’s rating behavior for movies. The dimension is also the number of movies. Any movie that has been rated by the user is greater than zero, and any movie that has not been rated is equal to zero. The scoring logic is 1 divided by the number of movies reviewed by the user. For example, in the first line, the first user has reviewed the first three movies, so each part of the first three movies gets a score of 13\ Frac {1}{3}31.

The green part has only one dimension and represents the time when the user reviewed the movie. It is calculated by using the earliest date on record as a base (January 2009 in this case), and incrementing by 1 each month after that. In 2009, for example, 5 can be converted to 5.

The last browned area represents the last movie reviewed by the user, which is also a one-hot array with dimensions consistent with the number of movies.

Let’s assume that the number of users is U and the number of movies is M, then the resulting dimensionality of the entire feature should be U+3M+1. Even the smaller movie-rating sites have at least a million users, which, when you add in the number of movies, is a huge number. Only a few of these dimensions have values, and the rest are zero.

For such sparse feature matrix, it is difficult to guarantee the effect of the general model. Why should the FM model stay out of it? Now, let’s look at the principle of FM model.

Principle of FM model

Before we introduce the equation of the FM model, let’s review the linear regression expression, which is actually very simple, just one line:


Y = W T X = w 0 + i = 1 n w i x i Y = W^TX = w_0 + \sum_{i=1}^n w_i x_i

That is to say W=(w0, W1,w2… wn)W =(W_0, W_1, w_2, \cdots, w_N)W=(w0, W1,w2… wn)W =(w0,w1,w2… wn)W =(w0,w1,w2… wn)W =(w0,w1,w2… wn)W =(w0,w1,w2… Where n is the dimension of the feature and m is the number of samples. So we also often write it as Y equals XWY equals XWY equals XW, which is the same thing no matter how you write it.

And the reason why it’s called a linear regression is very simple, because it’s a simple linear equation with only one term. However, sometimes the effect is not good, especially in the sparse scenes, the depiction ability is not enough. When we make features, we often combine the two features to make new combination features. Since we introduce new features and find new combination features in this operation, we can dig out the information that cannot be obtained before, so the model will have better effect.

When we binomial the features, we get something like this:


y ^ = w 0 + i = 1 n w i x i + i = 1 n 1 j = i + 1 n w i j x i x j \hat{y}=w_0 + \sum_{i=1}^nw_ix_i + \sum_{i=1}^{n-1}\sum_{j=i+1}^n w_{ij}x_ix_j

Xix_ixi and xjx_jxj represent two different features. For n-dimensional features, there should be a total of Cn2C_n^2Cn2. This value is n(n−1)2\frac{n(n-1)}{2}2n(n−1), That means we need the same number of weighted parameters. But there is a small problem, as we have already said, because features can be very sparse, so n is very large, say millions, where the magnitude of features combined by two features is about n squared, so the number of parameters that result is astronomical. Think about how many training samples do we need to ensure perfect convergence for a model with tens of billions or more parameter Spaces? This is unthinkable.

In this case, targeted optimization is essential. FM is also known for this, both the introduction of feature crossing, and solve the problem of complexity and model parameters, really can be said to be very good.

FM solves this problem in a very simple way. Instead of simply setting parameters for the feature pair after crossover, it sets a method to calculate the feature parameters. The FM model introduces a new matrix V, which is a two-dimensional matrix of N x K. Here k is the parameter we set, usually not very large, such as 16, 32 and so on. For each dimension I of the feature, we can find a viv_ivi, which represents a vector of length K.

So we can use viv_ivi and vjv_jvj to calculate the wiJw_ {ij}wij in the above formula:


w i j = v i T v j w_{ij} = v_i^Tv_j

That is to say, we use the inner product of the vectors to calculate the coefficients of the cross features, compared to the original order n^2 O n^2 O n^2, we reduce the magnitude of the parameters to n x k. Since k is a constant value, we can think of our number of arguments as O(n)O(n)O(n). In this way we reduce the number of arguments by an order of magnitude.

With the V matrix, the formula can be rewritten as:


y ^ = w 0 + i = 1 n w i x i + i = 1 n 1 j = 1 n v i T v j x i . x j \hat{y} = w_0 + \sum_{i=1}^nw_ix_i+\sum_{i=1}^{n-1}\sum_{j=1}^nv_i^T v_jx_i, x_j

⋅VTW =V \cdot V^TW=V Therefore, the parameter matrix V introduced by us can be regarded as a factorization of W matrix, which is also the origin of FM. Of course, in actual application scenarios, we do not need to set a very large K, because the feature matrix is often very sparse, and we may not have enough samples to train such a large number of parameters, and limiting K can also improve the generalization ability of FM model to a certain extent.

In addition, it is also beneficial to model training, because for some sparse feature combinations, all of our samples may be empty. For example, in the case of the combination of user A and movie B, it is possible that user A has no behavior on movie B, so this data is empty and we cannot train any parameters. But by introducing V, even though these two terms are missing, we can still get a parameter. Because we trained A vector parameter for user A and movie B, we dot the two vector parameters, and we get the coefficient of the cross feature.

And there are two ways of thinking about this, one of which is what I just said, and we can calculate the coefficients pretty well for some sparse combinations. Another way of understanding it is that it is also a way of embedding, which maps an ID to a vector. Therefore, FM is also used for embedding in the industry.

Complexity optimization

Next, let’s look at another essential content, which is about the complexity optimization of FM model. If we look at the above formula, it is not difficult to find that the computational complexity for predicting a sample is O(kn2)O(kN ^2)O(KN2).

We said that n is a large number, so obviously we can’t accept the calculation of n^2n2. So it is necessary to optimize it, and the optimization here is very simple, can be directly derived from the deformation of the mathematical formula.


y ^ = w 0 + i = 1 n w i x i + i = 1 n 1 j = 1 n v i T v j x i . x j \hat{y} = w_0 + \sum_{i=1}^nw_ix_i+\sum_{i=1}^{n-1}\sum_{j=1}^nv_i^T v_jx_i, x_j

This is the original formula, and for this formula, the first two terms are order n, order n, order n, so we can ignore this and focus on the last term. So what we’re going to do is we’re going to simplify this by distorting the mathematical formula:


i = 1 n j = i + 1 n v i T v j x i x j = 1 2 i = 1 n j = 1 n v i T v j x i x j 1 2 i = 1 n v i T v j x i x j = 1 2 ( i = 1 n j = 1 n f = 1 k v i . f v j . f x i x j i = 1 n f = 1 k v i . f v i . f x i x i ) = 1 2 f = 1 k ( ( i = 1 n v i . f x i ) ( j = 1 n v j . f x j ) i = 1 n v i . f 2 x i 2 ) = 1 2 f = 1 k ( ( i = 1 n v i . f x i ) 2 i = 1 n v i . f 2 x i 2 ) \begin{aligned} \sum_{i=1}^n\sum_{j=i+1}^n v_i^T v_j x_i x_j &= \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n v_i^Tv_jx_i x_j – \frac{1}{2}\sum_{i=1}^nv_i^Tv_jx_ix_j\\ &=\frac{1}{2}(\sum_{i=1}^n\sum_{j=1}^n\sum_{f=1}^kv_{i, f}v_{j, f}x_ix_j-\sum_{i=1}^n\sum_{f=1}^kv_{i,f}v_{i,f}x_ix_i)\\ &=\frac{1}{2}\sum_{f=1}^k((\sum_{i=1}^nv_{i, f}x_i)(\sum_{j=1}^nv_{j, f}x_j)-\sum_{i=1}^nv_{i,f}^2x_i^2)\\ &=\frac{1}{2}\sum_{f=1}^k((\sum_{i=1}^nv_{i,f}x_i)^2 – \sum_{i=1}^nv_{i, f}^2 x_i^2) \end{aligned}

Just to explain the derivation, the first line, which I think you can understand, and the second line, which I think you can understand, is just expanding the inner product of viTvjv_i^Tv_jviTvj. The second line to the transformation of the third line also is not hard to understand, there are three Σ \ Sigma Σ, we extracted the innermost Σ \ Sigma Σ, because finite summation, we change the order does not affect the results. Once you factor out the common factors, you get two square terms. If we look at this, we can see that the computational complexity of these two squares is O(n)O(n)O(n), plus the complexity of the outer layer O(k)O(k)O(k) O(k), so the overall complexity is O(kn)O(KN)O(KN)O(KN).

So we have completed the optimization of FM model prediction.

Model training

Let’s look at the process of model training again. We can write the original formula after deformation:


y ^ = w 0 + i = 1 n w i x i + 1 2 f = 1 k ( ( i = 1 n v i . f x i ) 2 i = 1 n v i . f 2 x i 2 ) \hat{y}=w_0 + \sum_{i=1}^n w_i x_i + \frac{1}{2}\sum_{f=1}^k((\sum_{i=1}^nv_{i, f}x_i)^2 – \sum_{i=1}^nv_{i, f}^2x_i^2)

For the FM model, we can also use the gradient descent algorithm for optimization. Since gradient descent is to be used, we need to write down the partial derivatives of all parameters in the model.

We can put the parameters is divided into three parts, the first part of the natural is w0w_0w0, partial y ^ partial w0 = 1 \ frac {\ partial \ hat {y}} {\ partial w_0} = 1 partial w0 partial y ^ = 1. The second part is ∑ I = 1nWixi \sum_{I =1}^nw_ix_i∑ I = 1Nwixi, whose coefficient for each of these wiW_iwi is xix_ixi. Since the sample is fixed, we want to treat the eigenvalues of the sample as constants. So partial y ^ partial wi = xi \ frac {\ partial \ hat {y}} {\ partial w_i} = x_i partial wi partial y ^ = xi. The last part is 12 ∑ f = 1 k ((nvi ∑ I = 1, looking) 2 – ∑ I = 1 nvi, f2xi2) \ frac {1} {2} \ sum_ {f = 1} ^ (k (\ sum_ {I = 1} ^ nv_ {I, f} x_i) ^ 2 – \ sum_ {I = 1} ^ nv_ {I, F}^2x_i^2)21∑f=1k((∑ I = 1Nvi,fxi)2−∑ I = 1Nvi,f2xi2) So we can ignore the outermost layer ∑f=1k\sum_{f=1}^k∑f=1k and take the derivative of the parentheses, It is concluded that as a result of the partial y ^ partial vi, f = xi ∑ NVJ j = 1, FXJ – vi, fxi2) \ frac {\ partial \ hat {y}} {\ partial v_ {I, f}} = x_i \ sum_ v_ {{j = 1} ^ n j, X_j – v_ f} {I, f} x_i ^ 2) partial vi, partial y f ^ = xi ∑ j = 1 NVJ, FXJ – vi, fxi2)

When we put these three together, we get:

\begin{equation} \frac{\partial \hat{y}}{\partial \theta}=\left\{ \begin{aligned} & 1, & if \; \theta\; is\; w_0 \\& x_i, & if\; \theta\; is\; w_i \\ & x_i\sum_{j=1}^n v_{j, f}x_j -v_{i,f}x_i^2) & if\; \theta\; is\; v_{i,f} \end{aligned} \right. \end{equation}

Where ∑j=1nvj, FXJ \sum_{j=1}^nv_{j,f}x_j∑j=1nvj, FXJ and I are independent, so it can be calculated in advance, so for all parameters, we can calculate their gradient in O(1)O(1)O(1) O(1). Since the magnitude of all our parameters is O(kN)O(KN)O(KN)O(KN), it means that we can complete the gradient descent update of all parameters in O(KN)O(KN)O(KN) time.

Let’s expand to d dimensions

The FM model we introduced just now focuses on the case of two-dimensional feature crossover, and we can extend it to the case of D-dimensional feature crossover if we wish. In this way, our feature can fit the cross relation of any d~(1≤ D ~≤d)\tilde{d} (1\leq\tilde{d}\leq d)d~(1≤d~≤ D) dimension.

We can write the equation for the extension of FM model to d dimension by following the formula just described:


y ^ = w 0 + i = 1 n w i x i + l = 2 d i = 1 n i l = i l 1 + 1 n ( Π j = 1 l x i j ) ( f = 1 k l Π j = 1 l v i j . f ( l ) ) \hat{y}=w_0 + \sum_{i=1}^nw_ix_i + \sum_{l=2}^d\sum_{i=1}^n\cdots\sum_{i_l=i_{l-1}+1}^n(\Pi_{j=1}^lx_{i_j})(\sum_{f=1}^{k_l}\Pi_{j=1}^lv_{i_j,f}^{(l)})

The first two are easy to understand, but let’s focus on the third. The third term contains cross features from 2 to D. Let’s take D =3 as an example, then this term should contain two dimensional cross terms and three dimensional cross terms, which should look like this:


i = 1 n 1 j = i + 1 n x i x j ( t = 1 k v i . t v j . t ) + i = 1 n 2 j = i + 1 n 1 l = j + 1 n x i x j x l ( t = 1 k v i . t v j . t v l . t ) \sum_{i=1}^{n-1}\sum_{j=i+1}^nx_ix_j(\sum_{t=1}^kv_{i,t}v_{j,t})+\sum_{i=1}^{n-2}\sum_{j=i+1}^{n-1}\sum_{l=j+1}^nx_ix_jx _l(\sum_{t=1}^k v_{i,t}v_{j, t}v_{l, t})

This is the same thing as before, and it’s not hard to figure out that it’s O(KND)O(kn^d)O(KND). When d=2, we optimize its complexity to O(kN)O(KN)O(KN)O(KN) through a series of deformation. However, when D >2, there is no good optimization method, and the intersection of triple features is often meaningless and too sparse, so we generally only use the case of D =2.

Code implementation

So this is a paragraph that you should know that there is such a thing, but it doesn’t really mean much. In the end, paper also compared the effects of FM and some other classical models, such as SVD and SVM, which were not of much value, because almost no one used these models directly in the recommendation field now. Finally, I will post the FM model that I implemented with PyTorch and TensorFlow respectively for your reference.

Pytorch

import torch
from torch import nn

ndim = len(feature_names)
k = 4

class FM(nn.Module) :
    def __init__(self, dim, k) :
        super(FM, self).__init__()
        self.dim = dim
        self.k = k
        self.w = nn.Linear(self.dim, 1, bias=True)
        Initialize V matrix
        self.v = nn.Parameter(torch.rand(self.dim, self.k) / 100)
        
    def forward(self, x) :
        linear = self.w(x)
        # 2 times
        quadradic = 0.5 * torch.sum(torch.pow(torch.mm(x, self.v), 2) - torch.mm(torch.pow(x, 2), torch.pow(self.v, 2)))
        # Set a layer of sigmoID into a classification model, also can not add, is the regression model
        return torch.sigmoid(linear + quadradic)
    
    
fm = FM(ndim, k)
loss_fn = nn.BCELoss()
optimizer = torch.optim.SGD(fm.parameters(), lr=0.005, weight_decay=0.001)
iteration = 0
epochs = 10

for epoch in range(epochs):
    fm.train()
    for X, y in data_iter:
        output = fm(X)
        l = loss_fn(output.squeeze(dim=1), y)
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        iteration += 1        
        if iteration % 200= =199:
            with torch.no_grad():
                fm.eval()
                output = fm(X_eva_tensor)
                l = loss_fn(output.squeeze(dim=1), y_eva_tensor)
                acc = ((torch.round(output).long() == y_eva_tensor.view(-1.1).long()).sum().float().item()) / 1024
                print('Epoch: {}, iteration: {}, loss: {}, acc: {}'.format(epoch, iteration, l.item(), acc))
            fm.train()
Copy the code

TensorFlow

import tensorflow as tf
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np


class FactorizationMachine:

    def __init__(self, n_dim=1, k=4, learning_rate=0.05, epochs=8) :
        self._learning_rate = learning_rate
        self._n_dim = n_dim
        self._k = k
        self._epochs = epochs
        self.sess = tf.Session()
        self.x_input = tf.placeholder(shape=[None, self._n_dim], dtype=tf.float32)
        self.y_input = tf.placeholder(shape=[None.1], dtype=tf.float32)
        Initialize W and V
        self.w = tf.Variable(tf.truncated_normal(shape=[self._n_dim, 1], dtype=tf.float32))
        self.V = tf.Variable(tf.truncated_normal(shape=[self._n_dim, self._k], dtype=tf.float32))
        self.b = tf.Variable(tf.truncated_normal(shape=[1.1]))
        self.linear = tf.add(self.b, tf.matmul(self.x_input, self.w))
        self.quadratic = 1/2 * tf.reduce_sum(tf.square(tf.matmul(self.x_input, self.V)) - tf.matmul(tf.square(self.x_input), tf.square(self.V)), axis=1, keepdims=True)
        self.y_out = self.linear + self.quadratic
        self.y_pred = tf.round(tf.sigmoid(self.y_out))
        self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.y_out, labels=self.y_input))
        self.train_op = tf.train.GradientDescentOptimizer(self._learning_rate).minimize(self.loss)
        self.accuracy = tf.reduce_mean(tf.cast(tf.equal(self.y_pred, self.y_input), tf.float32))
        init = tf.global_variables_initializer()
        self.sess.run(init)

    def train(self, X, Y, iterations=1000, batch_size=16, validation_size=0.1) :
        x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=validation_size)
        for epoch in range(self._epochs):
            for i in range(iterations):
                rand_idx = np.random.choice(x_train.shape[0], size=batch_size)
                rand_x = x_train[rand_idx]
                rand_y = y_train[rand_idx]
                self.sess.run(self.train_op, feed_dict={self.x_input: rand_x, self.y_input: rand_y})
                if i % 100= =99:
                    loss = self.sess.run(self.loss, feed_dict={self.x_input: x_test, self.y_input: y_test})
                    acc = self.sess.run(self.accuracy, feed_dict={self.x_input: x_test, self.y_input: y_test})
                    print('epoch = {}, iteration ={}, loss = {}, accuracy ={}'.format(epoch, i, loss, acc))

    def predict(self, x) :
        return self.sess.run(self.y_pred, feed_dict={self.x_input: x})
Copy the code

Note: Version 1.0 is used for the TensorFlow code

FM paper here we are finished analysis, also give the code implementation. But that didn’t end there, as there were several more versions of FM. Such as AFM, FFM, PFM and so on. I will introduce these papers one by one in the follow-up.

That’s all for today’s article. I sincerely wish you all a fruitful day. If you still like today’s content, please join us in a three-way support.

Original link, ask a concern