Matrix Factorization


In matrix factorization, the matrix $R$ is modeled to be a multiplication of two matrices with lower ranks:


where $Pinmathcal{R}^{|U|times k}$ and $Qinmathcal{R}^{|V|times k}$. $k$ is the number of factors. Two matrix factorization models are introduced in this section. An element in $R$ could be represented by

\[\hat r_{ij}=\mu+b_i+b_j+p_i^Tq_j\]

$mu$ is the global mean, $b_i$ is the user bias and $b_j$ is the item bias. $p_i$ is the user factor which is the $i$-th row in $P$, $q_j$ is the item factor which is the $j$-th row in $Q$.

The square error loss function with L2 regularization for a single element is

\[\mathcal L=(\hat r_{ij}- r_{ij})^2+\lambda\left(b_i^2+b_j^2+\|p_i\|^2+\|q_j\|^2\right)\]

Since most elements in $R$ are missing, the stochastic gradient descent is used to estimate parameters cite{funk2006netflix}. Given a training sample $<i,j,r_{ij}>$ and the learning rate $mu$, parameters are updated by

\[\begin{split}b&\leftarrow b-\mu(\hat r_{ij}-r_{ij})\\ b_i&\leftarrow b_i-\mu\left((\hat r_{ij}-r_{ij})+\lambda b_i\right)\\ b_j&\leftarrow b_j-\mu\left((\hat r_{ij}-r_{ij})+\lambda b_j\right)\\ p_i&\leftarrow p_i-\mu\left((\hat r_{ij}-r_{ij})q_j+\lambda p_i\right)\\ q_j&\leftarrow q_j-\mu\left((\hat r_{ij}-r_{ij})p_i+\lambda q_j\right)\end{split}\]




In the sorting problem, predicting the score precisely actually is not important. Thus, the algorithm BPR optimize the model with the loss function AUC. For user \(u\) and items \(i\) and \(j\), the optimization objective function is


\text{BPR-OPT}&=\ln p(\Theta|\hat r_{ui}>\hat r_{uj})\\
&=\ln p(\hat r_{ui}>\hat r_{uj} | \Theta)p(\Theta)\\
&=\ln \sigma(\hat r_{uij}) +\ln p(\Theta)\\
&=\ln \sigma(\hat r_{uij})-\lambda \|\Theta\|^2

where \(\hat r_{uij}=\hat r_{ui}-\hat r_{uj}\) is produced by \(\hat r_{ui}=\mathbb w_u^T\mathbb h_i=\sum_fw_{uf}h_{if}\). Thus, the method of updating parameters for BPR is as follows:

step 1: Sampling triples \((u,i,j)\) from training datasets, where \(i\) is the positive sample(interacted with users \(u\)) and \(j\) is negative sample(no interaction with users \(u\))

step 2: Updating parameters, \(\Theta\leftarrow\Theta+\alpha\left(\left(1-\sigma(\hat r_{uij})\right)\frac{\partial \hat r_{uij}}{\partial \Theta}-\lambda\Theta\right)\)

Repeatedly execute the above steps until the algorithm converges, and the gradient of the parameters is


\frac{\partial \hat r_{uij}}{\partial \theta}=\begin{cases}
(h_{if}-h_{jf})&\text{if }\theta=w_{uf},\\
w_{uf}&\text{if }\theta=h_{if},\\
-w_{uf}&\text{if }\theta=h_{jf},\\


There are scenarios where we can’t obtain the specific ratings of users, but the information similar to ‘confidence level’ can be got,such as Game duration, viewing duration and so on. Although the time information can not directly reflect the user’s preferences, it can show that users like it more likely.In this scenario, the user-item record can be expressed as a as confidence level \(r_{ui}\) and a indicator \(p_{ui}\) of 0-1(whether the user-item interaction exists), and if the user-item interaction does not exist, the confidence level is 0. ‘Weighted’ is to calculate the weight of loss corresponding to each record according to confidence. The objective function of optimization is ..math:


Weights \(c_{ui}=1+\alpha\log(1+r_{ui}/\epsilon)\) are calculated by confidence level. Because the interaction that does not occur also exists in the loss function, the conventional stochastic gradient descent has performance problems. Therefore, ALS is used to optimize the model. The training process is as follows:

step 1: Update each user’s vector, \(x_u=(Y^TC^uY+\lambda I)^{-1}Y^TC^up(u)\)

step 2: Update each item’s vector, \(y_i=(X^TC^iX+\lambda I)^{-1}X^TC^ip(i)\)

where \(Y\in\mathbb R^{n\times f}\) and \(X\in\mathbb R^{m\times f}\) represent matrix consisting of \(n\) user vectors and \(m\) item vectors respectively, and \(f\) is the length of vector. In \(C^u\in\mathbb R^{m\times m}\), only \(C^u_{i,i}=c_{ui}\) on the diagonal line, on the other positions the value is 0, and \(C^i\in\mathbb R^{n \times n}\) is similar. Meanwhile, \(p(u)=[p_{u1},\dots,p_{um}]^T\) and \(p(i)=[p_{1i},\dots,p_{ni}]^T\). \(C^u\) and \(C^i\) are so huge that they can be decomposed into \(Y^T C^uY = Y^T Y + Y ^T (C^ u − I)Y\). Lastly, most values of in \(C^u - I\) is 0, so only the non-zero part needs to be calculated.