Author: vivo Internet Server Team - Tang Shutao

Recommendations are now everywhere, such as Douyin, Taobao, and JD.com App, which can be seen in recommendation systems, which involve many technologies.

This article takes the classic collaborative filtering as the starting point, focuses on the matrix factorization algorithm widely used in the industry, and introduces the principle of the algorithm from the two dimensions of theory and practice. It is easy to understand, and I hope to bring you some inspiration.

The author believes that to thoroughly understand a paper, the best way is to reproduce it by hand. You will encounter various doubts and theoretical details in the process of reproduction.

1. Background

1.1 Introduction

In the twenty-first century with the explosion of information, people are easily drowned in the ocean of knowledge. In this scenario, search engines can help us quickly find what we want to find.

In the e-commerce scene, today's social material is extremely rich, and the goods are dazzling and varied. It is easy for consumers to be picky, that is, users will face the problem of information overload.

To solve this problem, recommendation engines came into being. For example, we open Taobao App, JD app, B station video app, and there are recommended modules in each scenario.

Then a kindergarten child suddenly asked you at this time, why did JD recommend this "Programmer's Cervical Spine Rehabilitation Guide" to you? You might answer because I am a programmer by profession.

Then the child asked, why is the book "Spark Big Data Analysis" ranked 6th recommended, and "Scala Programming" ranked 2nd? At this point you may not be able to answer this question.

To answer this question, we imagine the following scenario:

In JD's e-commerce system, there are two roles: user and product, and we assume that users will give a score between 0 and 5 to the product they buy. The higher the score, the more like the product.

Based on this assumption, we transformed the above question into how many users would rate the three books "Programmer's Cervical Spine Rehabilitation Guide", "Spark Big Data Analysis", and "Scala Programming" (the user has not purchased it before) these 3 books). Therefore, the order of items on the page is equivalent to the problem of predicting the user's ratings of these items and sorting them according to these ratings.

In order to facilitate the prediction of the user's rating of the item, we take all the triples (User, Item, Rating) , that is, the rating of the item purchased by the user User as Rating, and organize it into the following matrix form:

图片

Among them, the table contains \( m\) users and \(n\) items , and the table is defined as a scoring matrix\({\bf{R}}_{m \times n}\), where the elements\( r_{u,i}\) represents the \(u\)th user's rating for the \(i\)th item.

For example, in the table above, user user-1 purchased items item-1, item-3, item-4, and gave ratings of 4, 2, and 5, respectively. In the end, we transformed the original problem into predicting the values at the white spaces .

1.2 Collaborative filtering

Collaborative filtering, in simple terms, is to recommend items of interest to users by using the preferences of groups with similar interests and common experience to users. The use of mathematical language to express similar interests is similarity (people to people, things to things). Therefore, according to the objects of similarity , collaborative filtering can be divided into user -based collaborative filtering and item -based collaborative filtering.

Taking the rating matrix as an example, the rating matrix is observed in the row direction , and each row represents the vector representation of each user. For example, the vector of user-1 is [4, 0, 2, 5, 0, 0]. Observe the rating matrix in column direction , each column represents the vector representation of each item, for example, the vector of item-1 is [4, 3, 0, 0, 5].

Based on the vector representation, there are various formulas for calculating similarity, such as cosine similarity, Euclidean distance, and Pearson. Here we take cosine similarity as an example. It is a high-dimensional extension of the vector angle we learned in middle school (secondary school only involves 2 and 3 dimensions). The cosine similarity formula is easy to understand and use. Given two vectors \(\mathbf{A}=\{a\_1, \cdots, a\_n\}\) and \(\mathbf{B}=\{b\_1, \cdots, b\_n \}\), and its included angle is defined as follows:

\(\cos(\theta)=\frac{\bf{A}\cdot \bf{B}}{{|\bf{A}}|{|\bf{B}}|}=\frac{a \_1 b\_1 + \cdots + a\_n b\_n}{\sqrt{a\_1^2+\cdots a\_n^2}\sqrt{b\_1^2 + \cdots b\_n^2 }}\)

For example, we calculate the cosine similarity of user-3 and user-4, the corresponding vectors are [0, 2, 0, 3, 0, 4], [0, 3, 3, 5, 4, 0]

\(\cos(\theta)=\frac{\bf{A}\cdot \bf{B}}{{|\bf{A}}|{|\bf{B}}|}=\frac{a \_1 b\_1 + \cdots + a\_n b\_n}{\sqrt{a\_1^2+\cdots a\_n^2}\sqrt{b\_1^2 + \cdots b\_n^2 }}\)

The closer the cosine value of the angle between the vectors is to 1, the closer the two items are to parallel, that is, the more similar they are, and the closer to -1, the closer the two items are to the opposite direction, indicating that the similarity of the two items is close to the opposite, close to 0, Indicates that the vectors are nearly perpendicular/orthogonal, and the two items are almost unrelated. Obviously, this is completely consistent with human intuition.

For example, we often see the " relevant recommendation " module in video apps. One of the principles behind it is similarity calculation. The following figure shows a specific example.

We use "Blood Clan Part 1" to perform similarity search in the vector library (a database that stores vectors, the system can search the library with the similarity formula according to the input vector, and find the candidate vector of TopN), and find the previous Of the 7 films with high similarity, it is worth noting that the first is himself, with a similarity of 1.0, and the other three are the other three works of the same series of "Blood".

1.2.1 User-based Collaborative Filtering (UserCF)

User-based collaborative filtering is divided into two steps

  1. Find out several users whose user similarity is TopN.
  2. According to the items rated by TopN users, a set of candidate items is formed, and the weighted average is used to estimate the score of user u for each candidate item.

图片

For example, the candidate items that can be obtained from similar users {u1, u3, u5, u9} of user u are

\(\{i\_1, i\_2, i\_3, i\_4, i\_5, i\_6, i_7\}\)

We now predict user u 's rating for item i1 . Since the item is in the purchase records of two users {u1, u5} , user u 's predicted rating for item i1 is:

\({r}_{u,i\_1} = \frac{\text{sim}{(u,u\_1)}\times {r}_{u\_1,i\_1}+\text{ sim}{(u,u\_5)}\times {r}\_{u\_5,i\_1}}{\text{sim}{(u,u\_1)}+\text{sim}{ (u,u\_5)}}\)

Where \(\text{sim}{(u,u_1)}\) represents the similarity between user u and user\(u_1\).

When recommending, sort the predicted scores of all candidate items according to user u , and select TopM candidate items to recommend to user u .

1.2.2 Item-based collaborative filtering (ItemCF)

图片

Item-based collaborative filtering is divided into two steps:

  1. From the set of items purchased by user u , select items similar to each item TopN.
  2. TopN similar items form a set of candidate items, and use the weighted average to estimate the score of user u for each candidate item.

For example, we predict user u 's rating for item i3 . Since item i3 is similar to item {i6, i1, i9} , user u 's predicted rating for item i3 is:

\({r}_{u,i\_3} = \frac{\text{sim}{(i\_6,i\_3)}\times {r}\_{u,i\_6}+\text {sim}{(i\_1,i\_3)}\times {r}\_{u,i\_1}+\text{sim}{(i\_9,i\_3)}\times {r} \_{u,i\_9}}{\text{sim}{(i\_6,i\_3)}+\text{sim}{(i\_1,i\_3)}+\text{sim} {(i\_9,i_3)}}\)

Where \(\text{sim}{(i\_6,i\_3)}\) represents the similarity between item\(i_6\) and item\(i_3\), and the same for other symbols.

图片

1.2.3 Comparison of UserCF and ItemCF

We summarize ItemCF and UserCF as follows:

UserCF is mainly used to recommend items that users who have the same interests and hobbies like. The recommendation results focus on reflecting the hotspots of small groups similar to users' interests, and are more social, reflecting the small interest groups in which users belong. The popularity of the item. In practical applications, UserCF is usually used for news recommendation.

ItemCF recommends items that are similar to the items he liked before, that is, the recommendation results of ItemCF focus on maintaining the user's historical interests, and the recommendation is more personalized and reflects the user's own interests. In practical applications, ItemCF is used by book and movie platforms, such as Douban, Amazon, Netflix, etc.

In addition to user-based and item-based collaborative filtering, there is also a class of model-based collaborative filtering algorithms, as shown in the figure above. In addition, user-based and item-based collaborative filtering can be classified as a neighborhood-based (K-Nearest Neighbor, KNN ) algorithm, which is essentially looking for "TopN neighbors", and then uses neighbors and similarity to make predictions.

2. Matrix decomposition

The classic collaborative filtering algorithm itself has some shortcomings, the most obvious of which is the sparsity problem. We know that the rating matrix is a large sparse matrix , which causes the dot product of two vectors to equal 0 when calculating the similarity (take cosine similarity as an example). To understand this more intuitively, let's take an example as follows:

 rom sklearn.metrics.pairwise import cosine_similarity
 
a = [
  [  0,   0,   0,   3,   2,  0, 3.5,  0,  1 ],
  [  0,   1,   0,   0,   0,  0,   0,  0,  0 ],
  [  0,   0,   1,   0,   0,  0,   0,  0,  0 ],
  [4.1, 3.8, 4.6, 3.8, 4.4,  3,   4,  0, 3.6]
]
 
cosine_similarity(a)
 
# array([[1.        , 0.        , 0.        , 0.66209271],
#        [0.        , 1.        , 0.        , 0.34101639],
#        [0.        , 0.        , 1.        , 0.41280932],
#        [0.66209271, 0.34101639, 0.41280932, 1.        ]])

图片

We extract the vector item1 - item4 from the rating matrix, and use the cosine similarity to calculate the similarity between them.

Through the similarity matrix, we can see that the similarity between items item-1, item-2, item-3 is 0, and the items most similar to item-1, item-2, item-3 are all item-4, so item-4 will be recommended to the user in the recommendation scenario based on ItemCF.

However, the reason item item-4 is most similar to items item-1, item-2, item-3 is that item-4 is a popular item and is purchased by many users, while items item-1, item-2, item- The reason for the similarity of 3 is 0 is that their feature vectors are very sparse and lack direct data for similarity calculation.

To sum up, we can see that the classical user/item-based collaborative filtering algorithm has natural defects and cannot handle sparse scenarios. To solve this problem, matrix factorization is proposed.

2.1 Display feedback

We define the user's behavior of rating an item as showing feedback. Matrix decomposition based on explicit feedback is to divide the rating matrix\({\bf{R}}_{m \times n}\) into two matrices\({\bf{X}}_{m \times k}\) The product of \({\bf{Y}}_{n \times k}\) is approximately expressed, and its mathematical expression is as follows:

\({\bf{R}}_{m \times n} \approx {\bf{X}}_{m \times k}\left({\bf{Y}}_{n \times k}\ right)^{\text T}\)

Among them, \(k \ll m/n\) represents the recessive factor, which is understood from the user side, and \(k=2\) represents the two attributes of the user's age and gender. In addition, there is a good analogy to the triangular prism in physics. The white light is decomposed into 7 colors of light under the action of the triangular prism. In the matrix decomposition algorithm, the decomposition function is similar to the "triangular prism", as shown in the following figure, so , matrix factorization is also known as latent semantic model . Matrix decomposition reduces the degrees of freedom of the system from \(\mathcal{O}(mn)\) to \(\mathcal{O}((m+n)k)\), thus achieving the purpose of dimensionality reduction.

图片

In order to solve the matrices \({\bf{X}}_{m \times k}\) and \({\bf{Y}}_{n \times k}\), it is necessary to minimize the squared error loss function, to Make the product of the two matrices as close as possible to the scoring matrix\({\bf{R}}_{m \times n}\), that is

$$ \min\limits_{{\bf{x}}^*,{\bf{y}}^*} L({\bf{X}},{\bf{Y}})=\min\limits_ {{\bf{x}}^*,{\bf{y}}^*}\sum\limits_{r_{u,i} \text{ is known}}(r_{u,i}-{\bf {x}}_u^{\text T}{\bf{y}}_i)^2+\lambda \left( \sum\limits_{u}{\bf{x}}_u^{\text T}{ \bf{x}}_u+\sum\limits_{i}{\bf{y}}_i^{\text T}{\bf{y}}_i\right) $$

where \(\lambda \left( \sum\limits_{u}{\bf{x}}\_u^{\text T}{\bf{x}}\_u+\sum\limits_{i}{\bf {y}}\_i^{\text T}{\bf{y}}\_i\right)\) is the penalty term, \(\lambda\) is the penalty coefficient/regularization coefficient, \(\mathbf{x }_u\) represents the \(u\)th user's \(k\) dimensional feature vector, \(\mathbf{y}_i\) represents the \(i\)th item's \(k\) dimensional feature vector.

\({\bf{x}}\_u = \begin{pmatrix} x\_{u,1} \\ \vdots \\ x_{u,k} \\ \end{pmatrix} \qquad {\bf{ y}}\_i = \begin{pmatrix} y\_{i,1} \\ \vdots \\ y_{i,k} \\ \end{pmatrix}\)

The feature vectors of all users constitute the user matrix\({\bf{X}}_{m \times k}\), and the feature vectors of all items constitute the item matrix\({\bf{Y}}_{n \ times k}\).

\({\bf{x}}\_u = \begin{pmatrix} x\_{u,1} \\ \vdots \\ x_{u,k} \\ \end{pmatrix} \qquad {\bf{ y}}\_i = \begin{pmatrix} y\_{i,1} \\ \vdots \\ y_{i,k} \\ \end{pmatrix}\)

When we train the model, we only need to train the parameters in the user matrix and the \(n \times k\) parameters in the item matrix. Therefore, collaborative filtering is successfully transformed into an optimization problem.

2.2 Predictive score

Through model training (ie, the process of solving model coefficients), we get the user matrix\({\bf{X}}_{m \times k}\) and the item matrix\({\bf{Y}}_{n \ times k}\), the ratings of all users for all items can be predicted by \({\bf{X}}_{m \times k}\left({\bf{Y}}_{n \times k}\ right)^{\text T}\) to get. As shown below.

After getting all the rating predictions, we can make a selective recommendation for each item. It should be noted that the product of the user matrix and the item matrix, the estimated score obtained, and the actual score of the user are not congruent, but approximately equal . As shown in the pink part of the two matrices in the above figure, the user's actual rating and estimated rating are similar, and there is a certain error.

2.3 Theoretical derivation

There are also many theoretical derivations of matrix factorization ALS on the Internet, but many derivations are not so rigorous , and some steps are even wrong when operating vector derivatives . Some bloggers misunderstood the summation term of the loss function , for example

\(\sum\limits_{\color{red}{u=1}}^{\color{red} m}\sum\limits_{\color{red}{i=1}}^{\color{red} n}(r_{u,i}-{\bf{x}}\_u^{\text T}{\bf{y}}\_i)^2\)

But the rating matrix is sparse, and the summation does not traverse the entire set of users and items. The correct spelling should be

\(\sum\limits_{\color{red}{(u,i) \text{ is known}}}(r_{u,i}-{\bf{x}}\_u^{\text T}{ \bf{y}}\_i)^2\)

Among them, (u,i)is known means known scoring items.

We give a detailed and correct derivation process in this section. One is to use it as a small math exercise, and the other is to have a deeper understanding of the algorithm, which is easy to read the source code of Spark ALS.

Describe \({(u,i) \text{ is known}}\) in mathematical language, and the loss function of matrix decomposition is defined as follows:

\(L({\bf{X}},{\bf{Y}})=\sum\limits_{\color{red}{(u,i) \in K}}(r_{u,i}- {\bf{x}}\_u^{\text T}{\bf{y}}\_i)^2+\lambda \left( \sum\limits_{u}{\bf{x}}\_u^ {\text T}{\bf{x}}\_u+\sum\limits_{i}{\bf{y}}\_i^{\text T}{\bf{y}}\_i\right)\)

Where \(K\) is the known \((u, i)\) set in the scoring matrix. For example, the \(K\) corresponding to the following scoring matrix is

\({\bf{R}}_{4 \times 4} = \begin{pmatrix} 0 & r_{1,2} & r_{1,3} & 0 \\ r_{2,1} & 0 & r_{2,3} & 0 \\ 0 & r_{3,2} & 0 & r_{3,4} \\ 0 & r_{4,2} & r_{4,3} & r_{4,4 } \end{pmatrix} \\ \Rightarrow \color{red}{K = \{(1,2), (1,3), (2,1), (2,3), (3,2), (3,4), (4,2), (4,3), (4,4)\}}\)

There are two typical optimization methods for solving the above loss function, which are:

  • Alternating Least Squares ( ALS )
  • Stochastic Gradient Descent ( SGD )

Alternating least squares refers to fixing one of the variables, using least squares to solve the other variable, and doing this alternately until convergence or reaching the maximum number of iterations, which is also the origin of the word "alternation".

Stochastic gradient descent is the most commonly used method in optimization theory. It calculates the gradient and then updates the variables to be solved.

In the matrix factorization algorithm, Spark finally chose ALS as the only official implementation because ALS is easy to parallelize and there is no dependency between tasks.

Let's start to deduce the whole calculation process. In machine learning theory, the unit of differentiation is generally in the vector dimension, and the components of the vector are rarely deduced as partial differentials.

First, we fix the item matrix\({\bf{Y}}\) and treat the item matrix\({\bf{Y}}\) as a constant . Without loss of generality, we define the set of items rated by user u as \(I_u\), use the loss function to find the partial derivative of the vector \(\mathbf{x}_u\), and set the derivative equal to 0 to obtain:

$$ \displaystyle \frac{\partial L}{\partial {\bf{x}}_u}=-2\sum\limits_{i \in I_u}(r_{u,i}-{\bf{x} }_u^{\text T}{\bf{y}}_i)\frac{\partial {(\bf{x}}_u^{\text T}{\bf{y}}_i)}{\partial {\bf{x}}_u}+2\lambda \frac{\partial {(\bf{x}}_u^{\text T}{\bf{x}}_u)}{\partial {\bf{ x}}_u}=0, \quad u=1, \cdots, m \\ \begin{split} & \quad \Rightarrow \sum\limits_{i \in I_u}(r_{u,i}-{\ bf{x}}_u^{\text T}{\bf{y}}_i){\bf{y}}_i^{\text T}=\lambda {\bf{x}}_u^{\text T} \\ & \quad \Rightarrow \sum\limits_{i \in I_u}r_{u,i}{\bf{y}}_i^{\text T}-\sum\limits_{i \in I_u} {\bf{x}}_u^{\text T}{\bf{y}}_i{\bf{y}}_i^{\text T}=\lambda {\bf{x}}_u^{\ text T} \\ & \quad \Rightarrow \sum\limits_{i \in I_u}{\bf{x}}_u^{\text T}{\bf{y}}_i{\bf{y}}_i ^{\text T}+\lambda {\bf{x}}_u^{\text T}=\sum\limits_{i \in I_u}r_{u,i}{\bf{y}}_i^{ \text T} \end{split} $$

Because the vector \(\mathbf{x}_u\) has nothing to do with the summation symbol \(\sum\limits_{i \in I_u}\), it is moved out of the summation symbol because \({\bf{x}} \_u^{\text T}{\bf{y}}\_i{\bf{y}}_i^{\text T}\) is matrix multiplication (does not satisfy commutativity ), so \(\mathbf{ x}_u\) on the left

$$ \begin{split} & {\bf{x}}_u^{\text T}\sum\limits_{i \in I_u}{\bf{y}}_i{\bf{y}}_i^{ \text T}+\lambda {\bf{x}}_u^{\text T}=\sum\limits_{i \in I_u}r_{u,i}{\bf{y}}_i^{\text T} \\ \end{split} $$

Taking the transpose of both sides of the equation, we have

$$ \begin{split} & \quad \Rightarrow \left({\bf{x}}_u^{\text T}\sum\limits_{i \in I_u}{\bf{y}}_i{\bf {y}}_i^{\text T}+\lambda {\bf{x}}_u^{\text T}\right)^{\text T}=\left(\sum\limits_{i \in I_u }r_{u,i}{\bf{y}}_i^{\text T}\right)^{\text T} \\ & \quad \Rightarrow \left(\sum\limits_{i \in I_u} {\bf{y}}_i{\bf{y}}_i^{\text T}\right){\bf{x}}_u+\lambda {\bf{x}}_u=\sum\limits_{i \in I_u}r_{u,i}{\bf{y}}_i \\ & \quad \Rightarrow \left(\sum\limits_{i \in I_u}{\bf{y}}_i{\bf{ y}}_i^{\text T}+\lambda {\bf{I}}_k\right){\bf{x}}_u=\sum\limits_{i \in I_u}r_{u,i}{ \bf{y}}_i \end{split} $$

In order to simplify \(\sum\limits_{i \in I\_u}{\bf{y}}\_i{\bf{y}}_i^{\text T}\) and \(\sum\limits_{ i \in I\_u}r\_{u,i}{\bf{y}}_i\), we will expand \(I_u\).

Suppose\(I\_u=\{i\_{c\_1}, \cdots, i\_{c_N}\}\), where N represents the number of items rated by the user\(u\), \(i_ {c_i}\) represents the index/serial number corresponding to the \(c_i\)th item. With the help of \(I_u\), we have

$$ \sum\limits_{i \in I_u}{\bf{y}}_i{\bf{y}}_i^{\text T}= \begin{pmatrix} {\bf{y}}_{c_1 }, \cdots,{\bf{y}}_{c_N} \end{pmatrix} \begin{pmatrix} {\bf{y}}_{c_1}^{\text T} \\ \vdots \\{ \bf{y}}_{c_N}^{\text T} \end{pmatrix}={\bf{Y}}_{I_u}^{\text T}{\bf{Y}}_{I_u} \\ \sum\limits_{i \in I_u}r_{u,i}{\bf{y}}_i= \begin{pmatrix}{\bf{y}}_{c_1}, \cdots,{\bf {y}}_{c_N} \end{pmatrix} \begin{pmatrix} r_{u,c_1} \\ \vdots \\ r_{u,c_N} \end{pmatrix}={\bf{Y}}_ {I_u}^{\text T}{\bf{R}}_{u,I_u}^{\text T} $$

Among them, \({\bf{Y}}_{I_u}\) is the line with \(I\_u=\{i\_{c\_1}, \cdots i\_{c_N}\}\) The submatrix formed by the \(N\) row vectors selected in the item matrix\({\bf{Y}}\)

\({\bf{R}}_{u,I_u}\) is indexed by \(I\_u=\{i\_{c\_1}, \cdots i\_{c_N}\}\) , the sub-row vector formed by the \(N\) elements selected in the row vector of the \(u\)th row of the scoring matrix \({\bf{R}}\)

Therefore, we have

\(\left({\bf{Y}}_{I\_u}^{\text T}{\bf{Y}}\_{I\_u}+\lambda {\bf{I}}\_k \right){\bf{x}}\_u={\bf{Y}}\_{I\_u}^{\text T}{\bf{R}}\_{u,I\_u}^ {\text T} \\ \quad \Rightarrow {\bf{x}}\_u=\left({\bf{Y}}_{I\_u}^{\text T}{\bf{Y}} \_{I\_u}+\lambda {\bf{I}}\_k\right)^{-1}{\bf{Y}}_{I\_u}^{\text T}{\bf{ R}}\_{u,I_u}^{\text T}\)

In online blogs, many bloggers give conclusions similar to the following, which are not very rigorous, mainly due to the lack of understanding of the loss function .

\({\bf{x}}\_u=\left({\bf{\color{red} Y}}^{\text T}{\bf{\color{red} Y}}+\lambda {\ bf{I}}\_k\right)^{-1}{\bf{\color{red} Y}}^{\text T}{\bf{\color{red} R}}_{u}^ {\text T}\)

Similarly, we define the set of users whose item i is rated as \(U\_i=\{u\_{d\_1}, \cdots u\_{d_M}\}\)

According to the symmetry

\({\bf{y}}\_i=\left({\bf{X}}\_{U\_i}^{\text T}{\bf{X}}\_{U\_i}+ \lambda {\bf{I}}\_k\right)^{-1}{\bf{X}}_{U\_i}^{\text T}{\bf{R}}\_{i, U_i}\)

Among them, \({\bf{X}}_{U_i}\) is \(U\_i=\{u\_{d\_1}, \cdots, u\_{d_M}\}\) is The submatrix formed by the \(M\) row vectors selected by the row number in the user matrix\({\bf{X}}\)

\({\bf{R}}_{i,U_i}\) as \(U\_i=\{u\_{d\_1}, \cdots, u\_{d_M}\}\) as Index, the subcolumn vector formed by the \(M\) elements selected in the column vector of the i-th column of the scoring matrix\({\bf{R}}\)

In addition, \(\mathbf{I}_k\) is the identity matrix

If the reader feels that the above derivation is still very abstract, we will also give a concrete example to experience the intermediate process

\(\begin{pmatrix} 0 & r_{1,2} & r_{1,3} & 0 \\ r_{2,1} & 0 & r_{2,3} & 0 \\ 0 & r_{3 ,2} & 0 & r_{1,3} \\ 0 & r_{2,2} & r_{2,3} & r_{2,4} \\ \end{pmatrix} \approx \begin{pmatrix} x_{1,1} & x_{1,2} \\ x_{2,1} & x_{2,2} \\ x_{3,1} & x_{3,2} \\ x_{4,1 } & x_{4,2} \end{pmatrix} \begin{pmatrix} y_{1,1} & y_{1,2} \\ y_{2,1} & y_{2,2} \\ y_{ 3,1} & y_{3,2} \\ y_{4,1} & y_{4,2} \end{pmatrix}^{\text T} \\ \Rightarrow {\bf{R}} \approx {\bf{X}} {\bf{Y}}^{\text T}\)

Note that the loss function is a scalar, here we only expand the terms involving \(x_{1,1}, x_{1,2}\), as shown below

\(L=\sum\limits_{\color{red}{(u,i) \text{ is known}}}(r_{u,i} - {\bf{x}}\_u^{\text{ T}}{\bf{y}}\_i)^2 =(\color{blue}{r_{1,2}} - \color{red}{x_{1,1}}y_{2,1} - \color{red}{x_{1,2}}y_{2,2})^2 + (\color{blue}{r_{1,3}} - \color{red}{x_{1,1 }}y_{3,1} - \color{red}{x_{1,2}}y_{3,2})^2 + \cdots\)

Let the loss function take the partial derivatives of x_{1,1}, x_{1,2} respectively, you can get

\(\frac{\partial{L}}{\partial{\color{red}{x_{1,1}}}}=2(\color{blue}{r_{1,2}} - \color{ red}{x_{1,1}}y_{2,1}-\color{red}{x_{1,2}}y_{2,2})(-y_{2,1}) + 2(\ color{blue}{r_{1,3}} - \color{red}{x_{1,1}}y_{3,1}-\color{red}{x_{1,2}}y_{3, 2})(-y_{3,1}) = 0 \\ \frac{\partial{L}}{\partial{\color{red}{x_{1,2}}}}=2(\color{ blue}{r_{1,2}} - \color{red}{x_{1,1}}y_{2,2}-\color{red}{x_{1,2}}y_{2,2} )(-y_{2,2}) + 2(\color{blue}{r_{1,3}} - \color{red}{x_{1,1}}y_{3,1}-\color{ red}{x_{1,2}}y_{3,2})(-y_{3,2}) = 0\)

Written in matrix form, we can get

\(\begin{pmatrix} y_{2,1} & y_{3,1} \\ x_{2,2} & x_{3,2} \\ \end{pmatrix} \begin{pmatrix} y_{2 ,1} & y_{2,2} \\ y_{3,1} & y_{3,2} \\ \end{pmatrix} \begin{pmatrix} \color{red}{x_{1,1}} \\ \color{red}{x_{1,2}} \\ \end{pmatrix} = \begin{pmatrix} y_{2,1} & y_{3,1} \\ x_{2,2} & x_{3,2} \\ \end{pmatrix} \begin{pmatrix} \color{blue}{r_{1,2}} \\ \color{blue}{r_{1,3}} \\ \end {pmatrix}\)

Using our rules above, it is easy to test the conclusions we derive.

To sum up, the entire algorithm process of ALS has only two steps and involves two cycles, as shown in the following figure:

图片

The algorithm uses RMSE (root-mean-square error) to evaluate the error.

\(rsme = \sqrt{\frac{1}{|K|}\sum\limits_{(u,i) \in K}(r_{u,i}-{\bf{x}}\_u^{ \text T}{\bf{y}}\_i)^2}\)

When the RMSE value changes very little or when the maximum iteration step is reached, the convergence condition is satisfied and the iteration is stopped.

"Talk is cheap. Show me the code." As a small exercise, we give a Python implementation of the above pseudocode.

 import numpy as np
from scipy.linalg import solve as linear_solve
 
# 评分矩阵 5 x 6
R = np.array([[4, 0, 2, 5, 0, 0], [3, 2, 1, 0, 0, 3], [0, 2, 0, 3, 0, 4], [0, 3, 3,5, 4, 0], [5, 0, 3, 4, 0, 0]])
 
m = 5          # 用户数
n = 6          # 物品数
k = 3          # 隐向量的维度
_lambda = 0.01 # 正则化系数
 
# 随机初始化用户矩阵, 物品矩阵
X = np.random.rand(m, k)
Y = np.random.rand(n, k)
 
# 每个用户打分的物品集合
X_idx_dict = {1: [1, 3, 4], 2: [1, 2, 3, 6], 3: [2, 4, 6], 4: [2, 3, 4, 5], 5: [1, 3, 4]}
 
# 每个物品被打分的用户集合
Y_idx_dict = {1: [1, 2, 5], 2: [2, 3, 4], 3: [1, 2, 4, 5], 4: [1, 3, 4, 5], 5: [4], 6: [2, 3]}
 # 迭代10次
for iter in range(10):
    for u in range(1, m+1):
        Iu = np.array(X_idx_dict[u])
        YIu = Y[Iu-1]
        YIuT = YIu.T
        RuIu = R[u-1, Iu-1]
        xu = linear_solve(YIuT.dot(YIu) + _lambda * np.eye(k), YIuT.dot(RuIu))
        X[u-1] = xu
 
    for i in range(1, n+1):
        Ui = np.array(Y_idx_dict[i])
        XUi = X[Ui-1]
        XUiT = XUi.T
        RiUi = R.T[i-1, Ui-1]
        yi = linear_solve(XUiT.dot(XUi) + _lambda * np.eye(k), XUiT.dot(RiUi))
        Y[i-1] = yi

Finally, we print the user matrix, the item matrix, and the predicted rating matrix as follows. You can see that the predicted rating matrix is very close to the original rating matrix.

 # X
array([[1.30678487, 2.03300876, 3.70447639],
       [4.96150381, 1.03500693, 1.62261161],
       [6.37691007, 2.4290095 , 1.03465981],
       [0.41680155, 3.31805612, 3.24755801],
       [1.26803845, 3.57580564, 2.08450113]])
# Y
array([[ 0.24891282,  1.07434519,  0.40258993],
       [ 0.12832662,  0.17923216,  0.72376732],
       [-0.00149517,  0.77412863,  0.12191856],
       [ 0.12398438,  0.46163336,  1.05188691],
       [ 0.07668894,  0.61050204,  0.59753081],
       [ 0.53437855,  0.20862131,  0.08185176]])
 
# X.dot(Y.T) 预测评分
array([[4.00081359, 3.2132548 , 2.02350084, 4.9972158 , 3.55491072, 1.42566466],
       [3.00018371, 1.99659282, 0.99163666, 2.79974661, 1.98192672, 3.00005934],
       [4.61343295, 2.00253692, 1.99697545, 3.00029418, 2.59019481, 3.99911584],
       [4.97591903, 2.99866546, 2.96391664, 4.99946603, 3.99816006, 1.18076534],
       [4.99647978, 2.31231627, 3.02037696, 4.0005876 , 3.5258348 , 1.59422188]])
 
# 原始评分矩阵
array([[4,          0,           2,         5,          0,          0],
       [3,          2,           1,         0,          0,          3],
       [0,          2,           0,         3,          0,          4],
       [0,          3,           3,         5,          4,          0],
       [5,          0,           3,         4,          0,          0]])

3. Spark ALS application

The internal implementation of Spark is not the algorithm we listed above, but the core principle is exactly the same. Spark implements the distributed version of the above pseudocode. For the specific algorithm, please refer to Large-scale Parallel Collaborative Filtering for the Netflix Prize. Second, looking at Spark's official documentation, we also noticed that the penalty function used by Spark is slightly different from the one we used above.

\(\lambda \left( \sum\limits_{u}{\color{red}{n\_u}\bf{x}}\_u^{\text T}{\bf{x}}\_u+\sum \limits\_{i}{\color{red}{n\_i}\bf{y}}\_i^{\text T}{\bf{y}}_i\right)\)

Among them, n\_u, n\_i respectively represent the number of items scored by user u and the number of users scored by item i. which is

\(\begin{cases} n\_u = |I\_u| \\ n\_i = |U\_i| \\ \end{cases}\)

This section uses two cases to understand the specific use of Spark ALS and its application in practical engineering scenarios facing the Internet.

3.1 Demo case

Taking the data given in Section 1 as an example, the triples (User, Item, Rating) are organized as als-demo-data.csv , and the demo dataset involves 5 users and 6 items.

 userId,itemId,rating
1,1,4
1,3,2
1,4,5
2,1,3
2,2,2
2,3,1
2,6,3
3,2,2
3,4,3
3,6,4
4,2,3
4,3,3
4,4,5
4,5,4
5,1,5
5,3,3
5,4,4

Using Spark's ALS class is very simple, just feed the triple (User, Item, Rating) data into the model for training.

 import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.recommendation.ALS
  
val spark = SparkSession.builder().appName("als-demo").master("local[*]").getOrCreate()
 
val rating = spark.read
  .options(Map("inferSchema" -> "true", "delimiter" -> ",", "header" -> "true"))
  .csv("./data/als-demo-data.csv")
 
// 展示前5条评分记录
rating.show(5)
 
val als = new ALS()          
  .setMaxIter(10)             // 迭代次数,用于最小二乘交替迭代的次数
  .setRank(3)                 // 隐向量的维度
  .setRegParam(0.01)          // 惩罚系数
  .setUserCol("userId")       // user_id
  .setItemCol("itemId")       // item_id
  .setRatingCol("rating")     // 评分列
 
val model = als.fit(rating)   // 训练模型
 
// 打印用户向量和物品向量
model.userFactors.show(truncate = false)
model.itemFactors.show(truncate = false)
 
// 给所有用户推荐2个物品
model.recommendForAllUsers(2).show()

The output of the above code in the console is as follows:

 +------+------+------+
|userId|itemId|rating|
+------+------+------+
|     1|     1|     4|
|     1|     3|     2|
|     1|     4|     5|
|     2|     1|     3|
|     2|     2|     2|
+------+------+------+
only showing top 5 rows
 
+---+------------------------------------+
|id |features                            |
+---+------------------------------------+
|1  |[-0.17339179, 1.3144133, 0.04453602]|
|2  |[-0.3189066, 1.0291641, 0.12700711] |
|3  |[-0.6425665, 1.2283803, 0.26179287] |
|4  |[0.5160747, 0.81320006, -0.57953185]|
|5  |[0.645193, 0.26639006, 0.68648624]  |
+---+------------------------------------+
 
+---+-----------------------------------+
|id |features                           |
+---+-----------------------------------+
|1  |[2.609607, 3.2668495, 3.554771]    |
|2  |[0.85432494, 2.3137972, -1.1198239]|
|3  |[3.280517, 1.9563107, 0.51483333]  |
|4  |[3.7446978, 4.259611, 0.6640027]   |
|5  |[1.6036265, 2.5602736, -1.8897828] |
|6  |[-1.2651576, 2.4723763, 0.51556784]|
+---+-----------------------------------+
 
+------+--------------------------------+
|userId|recommendations                 |
+------+--------------------------------+
|1     |[[4, 4.9791617], [1, 3.9998217]]|   // 对应物品的序号和预测评分
|2     |[[4, 3.273963], [6, 3.0134287]] |
|3     |[[6, 3.9849386], [1, 3.2667015]]|
|4     |[[4, 5.011649], [5, 4.004795]]  |
|5     |[[1, 4.994258], [4, 4.0065994]] |
+------+--------------------------------+

We use numpy to validate Spark results and Excel to visualize the scoring matrix.

 import numpy as np
 
X = np.array([[-0.17339179, 1.3144133, 0.04453602],
              [-0.3189066, 1.0291641, 0.12700711],
              [-0.6425665, 1.2283803, 0.26179287],
              [0.5160747, 0.81320006, -0.57953185],
              [0.645193, 0.26639006, 0.68648624]])
 
Y = np.array([[2.609607, 3.2668495, 3.554771],
              [0.85432494, 2.3137972, -1.1198239],
              [3.280517, 1.9563107, 0.51483333],
              [3.7446978, 4.259611, 0.6640027],
              [1.6036265, 2.5602736, -1.8897828],
              [-1.2651576, 2.4723763, 0.51556784]])
 
R_predict = X.dot(Y.T)
R_predict

The scoring matrix for the output prediction is as follows:

 array([[3.99982136, 2.84328038, 2.02551472, 4.97916153, 3.0030386,  3.49205357],
       [2.98138452, 1.96660155, 1.03257371, 3.27396294, 1.88351875, 3.01342882],
       [3.26670123, 2.0001004 , 0.42992289, 3.00003605, 1.61982132, 3.98493822],
       [1.94325135, 2.97144913, 2.98550149, 5.011649  , 4.00479503, 1.05883274],
       [4.99425778, 0.39883335, 2.99113433, 4.00659955, 0.41937014, 0.19627587]])

From the rating matrix visualized in Excel, it can be observed that the predicted rating matrix is very close to the original rating matrix. Taking user-3 as an example, the items recommended by Spark are item-6 and item-1, [[6, 3.9849386], [1, 3.2667015]], which is exactly the same as the predicted score matrix displayed by Excel.

From the result given by the Spark function recommendForAllUsers() , Spark does not remove the items that the user has purchased.

图片

3.2 Engineering application

In the Internet scenario, the number of users\(m\)(10 million~100 million level) and the number of items\(n\)( 100,000~1 million level) are very large, and the buried point data of the App is generally saved in HDFS. Taking the long video scene of the Internet as an example, the user's buried point information is finally aggregated into the user behavior table t\_user\_behavior .

The behavior table contains the user's imei and the item's content-id, but there is no direct user rating. In practice, our solution is to use the user's other behaviors to weight the user's rating for the item. which is

rating = w1 * play_time (playing time) + w2 * finsh\_play\_cnt (number of plays completed) + w3 * praise_cnt (number of likes) + w4 * share_cnt (number of shares) + other indicators suitable for your business logic

Among them, wi is the weight corresponding to each indicator.

图片

The following code block demonstrates the process of recommending large-scale user and commodity scenarios in engineering practice.

 import org.apache.spark.ml.feature.{IndexToString, StringIndexer}
 
// 从hive加载数据,并利用权重公式计算用户对物品的评分
val rating_df = spark.sql("select imei, content_id, 权重公式计算评分 as rating from t_user_behavior group by imei, content_id")
 
// 将imei和content_id转换为序号,Spark ALS入参要求userId, itemId为整数
// 使用org.apache.spark.ml.feature.StringIndexer
val imeiIndexer    = new StringIndexer().setInputCol("imei").setOutputCol("userId").fit(rating_df)
val contentIndexer = new StringIndexer().setInputCol("content_id").setOutputCol("itemId").fit(rating_df)
val ratings = contentIndexer.transform(imeiIndexer.transform(rating_df))
 
// 其他code,类似于上述demo
val model = als.fit(ratings)
 
// 给每个用户推荐100个物品
val _userRecs = model.recommendForAllUsers(100)
 
// 将userId, itemId转换为原来的imei和content_id
val imeiConverter    = new IndexToString().setInputCol("userId").setOutputCol("imei").setLabels(imeiIndexer.labels)
val contentConverter = new IndexToString().setInputCol("itemId").setOutputCol("content_id").setLabels(contentIndexer.labels)
val userRecs = imeiConverter.transform(_userRecs)
 
// 离线保存供线上调用
userRecs.foreachPartition {
  // contentConverter 将itemId转换为content_id
  // 保存redis逻辑
}

It is worth noting that there is another solution to the above engineering scenario, namely implicit feedback . The user's rating for the product is very simple. In the actual scene, the user may not give a score to the item, but a large number of user behaviors can also indirectly reflect the user's preferences, such as the user's purchase record, search keywords, add to the shopping cart, and single songs. Play the same song on a loop. We call these indirect user actions implicit feedback to distinguish it from explicit feedback corresponding to ratings. Hu Yifan et al proposed the ALS-WR model (ALS with Weighted-λ-Regularization) for implicit feedback scenarios in the paper Collaborative filtering for implicit feedback datasets, and Spark officially implemented this model, which we will introduce in future articles. the model.

4. Summary

Starting from the recommended scenario, this paper introduces the classic recommendation algorithm of collaborative filtering, and explains the matrix decomposition algorithm that is only implemented and maintained by Spark, deduces the theoretical principle of matrix decomposition under display feedback in detail, and gives The stand-alone implementation of the Python version allows readers to better understand the matrix algorithm. Finally, we explain the use of Spark ALS with two examples of demo and engineering practice, which can give students who have not been exposed to the recommended algorithm an intuitive understanding. Understand the principles behind matrix factorization, a recommendation algorithm, in both theoretical and practical terms.

references:

  1. Zhe Wang, Deep Learning Recommendation System
  2. Hu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback datasets." 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008.
  3. Zhou, Yunhong, et al. "Large-scale parallel collaborative filtering for the Netflix prize." International conference on algorithmic applications in management. Springer, Berlin, Heidelberg, 2008.

vivo互联网技术
3.4k 声望10.2k 粉丝