Skip to content

Gradient descent based matrix factorization

This post followed following tutorial: The author demonstrated how a deterministic algorithm’s implementation can be achieved with a nondeterministic way of programming.

Matrix factorization is to, find out two or more matrices such that when you multiply them you will get back the original matrix. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. One obvious application is to predict ratings in collaborative filtering.

In a recommendation system such as Netflix or MovieLens, there is a group of users and a set of items (movies for the above two systems). Given that each user has rated some items in the system, we would like to predict how the users would rate the items that they have not yet rated, such that we can make recommendations to the users. In this case, all the information we have about the existing ratings can be represented in a matrix. Each cell of the matrix presents which rating a user has given a film. An empty cell means that the user has not yet rated the movie.

Hence, the task of predicting the missing ratings can be considered as filling in the blanks such that the values would be consistent with the existing ratings in the matrix.

The intuition behind using matrix factorization to solve this problem is that there should be some latent features that determine how a user rates an item. For example, two users would give high ratings to a certain movie if they both like the actors or actresses in the movie, or if the movie is an action movie, which is a genre preferred by both users.

Hence, if we can discover these latent features, we should be able to predict a rating with respect to a certain user and a certain item, because the features associated with the user should match with the features associated with the item.

Having discussed the intuition behind matrix factorization, we can now go on to work on mathematics. Firstly, we have a set U of users and a set D of items. Let R of size |U|×|D| be the matrix that contains all the ratings that the users have assigned to the items. Also, we assume that we would like to discover K latent features. Our task, then, is to find two matrics matrices P (of size |U|×|K|) and Q (of size |D|×|K|) such that their product approximates |R|.

R \approx P x Q^T = \hat{R}

In this way, each row of P would represent the strength of the associations between a user and the features. Similarly, each row of Q would represent the strength of the associations between an item and the features. To get the prediction of a rating of an item dj by ui, we can calculate the dot product of their vectors.

\hat{r}_{ij} = p^T_{i}q_{j} = \sum_{k=0}^{K} p_{ik} q_{kj}

Now, we have to find a way to obtain P and Q. One way to approach this problem is the first initialize the two matrices with some values, calculate how different their product is to M, and then try to minimize this difference iteratively. Such a method is called gradient descent, aiming at finding a local minimum of the difference.

The difference here, usually called the error between the estimated rating and the real rating, can be calculated by the following equation for each user-item pair:

e_{ij}^2 = (r_{ij} - \hat{r}_{ij})^2 = (r_{ij} - \sum_{k=1}^{K} p_{ik} q_{kj})^2

Here we consider the squared error because the estimated rating can be either higher or lower than the real rating.

To minimize the error, we have to know in which direction we have to modify the values of p_{ik} and q_{kj}. In other words, we need to know the gradient at the current values, and therefore we differentiate the above equation with respect to these two variables separately:

\frac{\partial}{\partial p_{ik}}e_{ij}^2 = -2(r_{ij} - \hat{r}_{ij})(q_{kj}) = -2 e_{ij} q_{kj}

\frac{\partial}{\partial q_{ik}}e_{ij}^2 = -2(r_{ij} - \hat{r}_{ij})(p_{ik}) = -2 e_{ij} p_{ik}

Having obtained the gradient, we can now formulate the update rules for both p_{ik} and q_{kj}:

p_{ik} = p_{ik} + \alpha \frac{\partial}{\partial p_{ik}} e_{ik}^2 = p_{ik} + 2\alpha e_{ij}q_{kj}

q_{kj} = q_{kj} + \alpha \frac{\partial}{\partial q_{kj}} e_{ij}^2 = q_{kj} + 2\alpha e_{ij}p{ik}

A question might have come to your mind by now: if we find two matrices P and Q such that P×Q approximates R, isn’t that our predictions of all the unseen ratings will be zeros? In fact, we are not really trying to come up with P and Q such that we can reproduce R exactly. Instead, we will only try to minimise the errors of the observed user-item pairs. In other words, if we let T be a set of tuples, each of which is in the form of (u_{i},d_{j},r_{ij}), such that T contains all the observed user-item pairs together with the associated ratings, we are only trying to minimise every e_{ij} for (u_{i},d_{j},r_{ij}) \in T. (In other words, T is our set of training data.) As for the rest of the unknowns, we will be able to determine their values once the associations between the users, items and features have been learnt.

Published inMachine Learning

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *