2010 Fast Context-Aware Recommendations with Factorization Machines

background

This paper studies the recommendation system of context-aware. The task of this paper is to give the order of given user characteristics and product characteristics, which is essentially a regression problem.

Context features are explained as follows

The difference between such user/ item attributes and context is that attributes are attached only to either an item or user (e.g. a genre is attached to a movie) whereas context is attached to the whole rating event (e.g. the mood of a user when rating an item).

Context features include the time, place, and mood of the user accessing the recommendation system.

Problems in modeling

The model input is user characteristics: U={U1, U2… } U = \ left \ {u_ {1}, u_ {2}, \} \ ldots \ right U = {u1, u2,… }, I={i1,i2… } I = \ left \ {i_ {1}, i_ {2}, \ \ ldots \ right} I = {i1, i2,… }, and the aforementioned context characteristics C={c1,c2… C = \} left \ {c_ {1}, c_ {2}, \ \ ldots \ right} C = {c1, c2,… }. Examples are given as follows, in which the left side of the figure is the feature of the recommended input, the middle is the feature vector after coding, and the right side is the prediction target.

In the above example, the characteristics of the user can be expressed as y:U×I×C3… ×Cm→Ry: U \times I \times \mathcal{C}_{3} \ldots \times \mathcal{C}_{m} \rightarrow \mathbb{R}y:U \ I \ C3… ×Cm→R, where C feature starts from 3 because user feature and product feature are considered as subscripts 1 and 2. The value of the feature is expressed as

$\begin{aligned} U &=\{\text { Alice, } \boldsymbol{B o b}, \text { Charlie }\} \\ I &=\{\text { TItanic, } \text { Notting Hill, } \text { Star } \text { Wars, } \text { Star } \text { Trek }\} \\ \mathcal{C}_{3} &=\{\boldsymbol{S} a d, \text { Normal, } \boldsymbol{H a p p y}\} \\ \mathcal{C}_{4} &=\mathcal{P}(U) \end{aligned}$

So the first line in the figure above is expressed as: Alice rated TItanic with 5 stars and that she has watched this movie with Charlie while she was Happy.

In this paper, FM was used to solve the above problems and ALS optimization algorithm was adopted. In this paper, FM algorithm was compared with Multiverse Shamesystem x algorithm, so brief notes are made below.

Multiverse Recommendationx

When solving the problem using Multiverse Shameshamex, the question was expressed as follows

$\hat{y}\left(u, i, c_{3}, \ldots, c_{m}\right):=\sum_{f_{1}}^{k_{1}} \ldots \sum_{f_{m}}^{k_{m}} b_{f_{1}, \ldots, f_{m}} v_{u, f_{1}}^{(U)} v_{i, f_{2}}^{(I)} \prod_{l=3}^{m} v_{c_{l}, f_{l}}^{C_{l}}$

The specific parameter dimensions are as follows

$\begin{array}{ll}\mathcal{B} \in \mathbb{R}^{k_{1} \times \ldots \times k_{m}}, & \mathbb{V}^{(U)} \in \mathbf{R}^{|U| \times k_{1}} \\ \mathbf{V}^{(I)} \in \mathbb{R}^{|I| \times k_{2}}, & \mathbb{V}^{\left(C_{l}\right)} \in \mathbb{R}^{\left|C_{l}\right| \times k_{l}}\end{array}$

Multiverse Shamesystem has the following problems

Let ki=kk_i = kki=k, then the computational complexity is O(km)O\left(k^{m} right)O(km), resulting in slow training and prediction time.
Multiverse Shamex applies only to the categorical context.
In tag-recommended related tasks, it is generally better to decompose several lower variable interactions than a single M-ARY relationship (such as Tucker decomposition). This is because under high sparsity, the pary relation of factorization can be well estimated, but the M-ary relation of factorization is difficult to estimate.

Categorical domain, Categorical set domain, and Real Domains are defined

The Categorical domain is one-hot. There is only one non-zero element, which is expressed as: (1, 0, 0).
Categorical set domain: when encoding, the dimension corresponding to the attribute is non-zero. There can be multiple non-zero elements, which are expressed as: (0,0.5,0.5).
Real domains

Factorization Machines

When using FM to solve, the problem is expressed as follows

$\hat{y}(\mathbf{x}):=w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+\sum_{i=1}^{n} \sum_{j=i+1}^{n} \hat{w}_{i, j} x_{i} x_{j}$

Where w^ I,j\hat{w}_{I,j}w^ I,j are the interaction parameters between feature pairs, and are: W ^ I, j: = ⟨ vi, vj ⟩ kvi = ∑ f = 1, f ⋅ vj, f \ hat _ {I, j} {w} : = \ left \ langle \ mathbf {n} _ {I}. \ mathbf {n} _ {j} \ rangle = \ \ right sum_ {f = 1} ^ v_ {k} {I, f} \ cdot v_} {j, f w ^ I, j: = ⟨ vi, vj ⟩ kvi = ∑ f = 1, f ⋅ vj, f, W0 ∈ R, w ∈ Rn, V ∈ Rn * kw_ {0} \ in \ mathbb {R}, \ quad \ mathbf {w} \ \ mathbb in ^ {R} {n}. \quad \mathbf{V} \in \mathbb{R}^{n \times k}w0∈R, W ∈Rn,V∈Rn×k.

So the model takes into account the combination of features. The meaning of this is explained in Resources below

Meanwhile, by observing a large number of sample data, it can be found that the correlation between certain features and labels will be improved after association. For example, “USA” and “Thanksgiving”, “China” and “Chinese New Year” have a positive impact on users’ clicks. In other words, users from “China” are likely to have a lot of browsing and purchasing behavior in “Chinese New Year”, while there is no special consumption behavior in “Thanksgiving”. The positive correlation between such correlation features and label is widespread in practical problems, such as “cosmetics” and “female”, “ball sports accessories” and “male”, “movie tickets” and “movie” category preference, etc. Therefore, it is very meaningful to introduce the combination of the two features.

Merc (K \m (x))O(K \cdot m(\mathbf{x}))O(K \m (x)), m(x)m(\mathbf{x})m(x) is the dimension of feature representation.

$\hat{y}(\mathbf{x})=w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+\frac{1}{2} \sum_{f=1}^{k}\left(\left(\sum_{i=1}^{n} v_{i, f} x_{i}\right)^{2}-\sum_{i=1}^{n} v_{i, f}^{2} x_{i}^{2}\right)$

With the addition of regular terms, the objectives of model optimization are as follows

$\mathrm{RLS-OPT}=\sum_{(\mathbf{x}, y) \in S}(\hat{y}(\mathbf{x})-y)^{2}+\sum_{\theta \in \Theta} \lambda_{(\theta)} \theta^{2}$

ALS optimization algorithm

The above problems can be solved by SGD, but ALS(Alternating Least Square) is discussed in this paper. Adjusting the learning rate can be avoided by using ALS to solve the problem. The derivation of ALS is rather complicated, so it is only recorded here and some details are not understood (this part is copied from the reference materials).

LEMMA 1(1 \quad(1(LINEARITY IN θ)\theta)θ) A FM is A linear function with respect to every single model parameter θ∈ θ \theta \in \theta θ∈ θ and thus can be reexpressed as:

$\hat{y}(x \mid \theta)=g_{(\theta)}(x)+\theta h_{(\theta)}(x)$

Each parameter θ can be taken out by itself, with the coefficient h in front and the sum of the other terms g.

H is the partial derivative of y with respect to θ :

$h_{(\theta)}(x)=\frac{\partial}{\partial \theta} \hat{y}(x \mid \theta)$

LEMMA 22 \quad2 (OPTIMAL VALUE FOR θ\thetaθ) The regularized leastsquare solution of a single parameter θ\thetaθ FOR a Linear model y ^ \ hat (x ∣ theta) {} y (\ mathbf {x} \ mid \ theta) y ^ (x ∣ theta) is:

$\theta=-\frac{\sum_{(\mathbf{x}, y) \in S}\left(g_{(\theta)}(\mathbf{x})-y\right) h_{(\theta)}(\mathbf{x})}{\sum_{(\mathbf{x}, y) \in S} h_{(\theta)}^{2}(\mathbf{x})+\lambda_{(\theta)}}$

Let the partial derivative of the loss function RLS_OPT with respect to θ be 0, and substitute the two equations in Lemma 1 in the simplification process.

The algorithm flow is as follows. The weight of lower Interatcion should be trained first, because they have more data and can be carried out more reliably (because there are obviously more samples with single feature than samples with multiple features). For 2-way interaction, training is conducted in turn from one dimension to another. Training complexity to O (∣ S ∣ mk) O (k) | S | m O (∣ S ∣ mk), including ∣ S ∣ | S | ∣ S ∣ as sample number, m as the characteristic length, k embedding implicit said characterized by length. In this paper, ALS is also accelerated, which is omitted and not recorded.

The experiment

Contrast ALS and SGD

The experimental results are shown as follows. The left figure shows that the errors of SGD vary greatly at different learning rates, and ALS is better than SGD. The middle and the figure show the comparison results of SGD and ALS under different learning rates, and it can be seen that ALS wins completely.

Multiverse Recommendation and FM time comparison

The experimental results are shown below. For the Multiverse Recommendation algorithm, m=4, and the time complexity is expressed as follows. FM run time is linear, Multiverse Recommendation run time is O(k4)O\left(k^{4}\right)O(k4), the difference is not a point.

Comparison of prediction results

FM(no context) is degraded into a simple linear model. It can be seen that FM has a good effect, while RMSE and MAE are the lowest. Considering the characteristics of context, Multiverse Recommendation is better than FM(no Context).

In different context proportions, the experimental results are shown as follows. With the increase of context feature proportion, the effect of FM and Multiverse Recommendation will improve. The effect of FM(no Context) is reduced because the context features cannot be captured.

conclusion

The formula of FM is very simple and beautiful, and also has a strong explanatory. Relatively speaking, the whole paper pays more attention to mathematics, and the introduction of engineering is less, and the derivation of mathematics needs to be deeply understood.

The resources

【FM】Factorization Machines
Deep into FFM principle and practice

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Classic CTR Prediction (FM)

background

Problems in modeling

Multiverse Recommendationx

Factorization Machines

ALS optimization algorithm

The experiment

conclusion

The resources

Classic CTR Prediction (FM)

background

Problems in modeling

Multiverse Recommendationx

Factorization Machines

ALS optimization algorithm

The experiment

conclusion

The resources

Related Posts

Dba + Tools: Pt-kill Rebuilt version, Slow SQL alarm and strangler (downloadable)

Hive Deployment

97 pages of PPT, understand the development of the whole industry chain of autonomous driving!