I was interested in learning how to put together a recommender system for fun and practice. Since the Alternating-Least-Squares with Weighted--Regularization (ALS-WR) algorithm seems to be a popular algorithm for recommender systems, I decided to give it a shot. It was developed for the Netflix Prize competition, which also involved a sparse matrix of reviewers by items being reviewed.
While searching for resources on the ALS-WR algorithm, I came across an excellent tutorial (whose link is now broken) that walks you through the theory and how to implement the algorithm using python on a small dataset of movie reviews. It even provided a link to download a Jupyter Notebook that you can run and see the algorithm in action. Having this notebook to toy around with was extremely helpful in familiarizing myself with the algorithm. However, as I compared the code in the notebook to the math in the blog post and in the original paper, it seemed like it wasn’t taking full advantage of the sparsity of the ratings matrix , which is a key feature of this type of problem. By slightly changing a couple lines in this code, I was able to dramatically reduce the computation time by taking advantage of the sparsity.
I won’t walk through all the details because the notebok already does that really well, but I’ll give enough background to explain the change I made and why it speeds up the computation.
We start with a matrix of size where each row represents one of the users and each column represents one of the movies. Most of the matrix contains 0’s since most users only review a small subset of the available movies. The dataset used in the tutorial contains only about 6% nonzero values. We want to generate a low-rank approximation for such that , where is size and is size , as shown below (image borrowed from the tutorial):
The columns of the resulting matrices and turn out to contain columns with latent features about the users and movies, respectively. The and matrices are calculated iteratively, by fixing one and solving for the other, then repeating while alternating which one is fixed. As a side note, in case you want to look at the paper, the notation is a little different. They use and instead of and , and and instead of and . I’ll stick with the tutorial notation in this post.
The equations for solving for and are quite similar, so let’s just look at the equation for . In each iteration, the column for each user in is generated with the following equation:
Here, is the identity matrix, is the number of movies rated by user , and is the set of all movies rated by user . That in and means we are selecting only the columns for movies rated by user , and the way that selection is made makes all the difference.
In the tutorial, the key lines to generate each look like this:
Ai = np.dot(Q, np.dot(np.diag(Ii), Q.T)) + lmbda * nui * E Vi = np.dot(Q, np.dot(np.diag(Ii), R[i].T)) P[:,i] = np.linalg.solve(Ai,Vi)
Notice that in the equation for , the way it removes columns for movies that weren’t reviewed by user is creating a matrix with the elements of along the diagonal, then doing a matrix multiplication between that and , which zeroes out columns of for movies user did not review. This matrix multiplication is an expensive operation that (naively) has a complexity of (although probably a bit better with the numpy implementation). A similar operation is done in the calculation. Even though this is not as expensive (complexity of ), that’s still an operation we’d like to avoid if possible.
On reading the equations and Matlab algorithm implementation in the original paper, I noticed that rather than zeroing out unwanted columns, they actually remove those columns by creating a submatrix of and a subvector of . This does 2 important things: First, it lets us remove that inner matrix multiplications. Second, it dramatically reduces the cost of the remaining matrix multiplications. Since we have a density of only about 6% in our matrix, the cost of both and should theoretically be reduced to about 6% of their original costs, since the complexities of those operations ( and ) are both linearly dependent on . Here’s the code that replaces the 3 lines shown above:
# Get array of nonzero indices in row Ii Ii_nonzero = np.nonzero(Ii) # Select subset of Q associated with movies reviewed by user i Q_Ii = Q[:, Ii_nonzero] # Select subset of row R_i associated with movies reviewed by user i R_Ii = R[i, Ii_nonzero] Ai = np.dot(Q_Ii, Q_Ii.T) + lmbda * nui * E Vi = np.dot(Q_Ii, R_Ii.T) P[:, i] = np.linalg.solve(Ai, Vi)
By making that replacement and a similar one for the equations for , a series of 15 iterations went from taking 15-16 minutes down to about 13 seconds: a ~70-fold speedup! Check out the notebook with my updates on GitHub, or clone the whole repo to run it yourself.
The moral of the story here is that sometimes things that don’t seem like a big deal at first glance can make huge changes in the performance of your algorithms. This exercise reinforced in my mind the value of spending a little extra time to make sure you understand the algorithm or tool you’re using. And more specifically, if you have a sparse dataset, make that sparsity work for you.