This is the third post of a project on collaborative filtering based on the MovieLens 100K dataset. The remainder of this post is straight out of a Jupyter Notebook file you can download here. You can also see it here on GitHub.


Now that we've established some simple baseline models and demonstrated that the Damped User + Movie Baseline model is the best of the few we tested, let's move on to some actual collaborative filtering models. Here, we'll explore user-based and item-based collaborative filtering.

Item-Based vs User-Based

Image found on www.selemmarafi.com

The idea of these methods is simply to predict unseen ratings by looking at how similar users rated a particular item, or by looking at how similar items were rated by a particular user. Both methods fall under the category of K-Nearest Neighbor (KNN) models, since ratings from the K most similar users or items are combined for the prediction.

Below, I've implemented a class called KNNRecommender that can accept a mode parameter of either 'user' or 'item'. Let's see how it compares to our best baseline!

1. Import necessary modules and classes

In [1]:
Expand Code

2. Load the Data

Let's load and examine the ratings data. If you're following along (i.e. actually running these notebooks) you'll need to make sure to run the first one to download the data before running this one.

In [2]:
Expand Code
First 5:
userId movieId rating timestamp
214 259 255 4 1997-09-19 23:05:10
83965 259 286 4 1997-09-19 23:05:27
43027 259 298 4 1997-09-19 23:05:54
21396 259 185 4 1997-09-19 23:06:21
82655 259 173 4 1997-09-19 23:07:23
Last 5:
userId movieId rating timestamp
46773 729 689 4 1998-04-22 19:10:38
73008 729 313 3 1998-04-22 19:10:38
46574 729 328 3 1998-04-22 19:10:38
64312 729 748 4 1998-04-22 19:10:38
79208 729 272 4 1998-04-22 19:10:38

3. Write our KNNRecommender class

In [3]:
Expand Code

4. Determine Optimal $k$ Values

In [4]:
Expand Code
In [5]:
Expand Code
Out[5]:
k mode i_fold
0 1 user 0
1 1 user 1
2 1 user 2
3 1 user 3
4 1 user 4
5 1 item 0
6 1 item 1
7 1 item 2
8 1 item 3
9 1 item 4
In [6]:
Expand Code
(100000, 4)
k=1, mode=item, i_fold=0: MAE=0.807   dt=3.55 seconds
k=1, mode=item, i_fold=1: MAE=0.812   dt=1.68 seconds
k=1, mode=item, i_fold=2: MAE=0.803   dt=1.62 seconds
k=1, mode=item, i_fold=3: MAE=0.806   dt=1.60 seconds
k=1, mode=item, i_fold=4: MAE=0.807   dt=1.81 seconds
k=1, mode=user, i_fold=0: MAE=0.827   dt=1.68 seconds
k=1, mode=user, i_fold=1: MAE=0.834   dt=1.66 seconds
k=1, mode=user, i_fold=2: MAE=0.835   dt=1.53 seconds
k=1, mode=user, i_fold=3: MAE=0.835   dt=1.26 seconds
k=1, mode=user, i_fold=4: MAE=0.833   dt=1.59 seconds
k=2, mode=item, i_fold=0: MAE=0.758   dt=1.90 seconds
k=2, mode=item, i_fold=1: MAE=0.768   dt=1.59 seconds
k=2, mode=item, i_fold=2: MAE=0.758   dt=1.52 seconds
k=2, mode=item, i_fold=3: MAE=0.759   dt=1.45 seconds
k=2, mode=item, i_fold=4: MAE=0.760   dt=1.50 seconds
k=2, mode=user, i_fold=0: MAE=0.785   dt=1.31 seconds
k=2, mode=user, i_fold=1: MAE=0.788   dt=1.32 seconds
k=2, mode=user, i_fold=2: MAE=0.788   dt=1.35 seconds
k=2, mode=user, i_fold=3: MAE=0.787   dt=1.33 seconds
k=2, mode=user, i_fold=4: MAE=0.788   dt=1.26 seconds
k=5, mode=item, i_fold=0: MAE=0.730   dt=1.41 seconds
k=5, mode=item, i_fold=1: MAE=0.741   dt=1.51 seconds
k=5, mode=item, i_fold=2: MAE=0.735   dt=1.65 seconds
k=5, mode=item, i_fold=3: MAE=0.731   dt=1.39 seconds
k=5, mode=item, i_fold=4: MAE=0.731   dt=1.67 seconds
k=5, mode=user, i_fold=0: MAE=0.753   dt=1.56 seconds
k=5, mode=user, i_fold=1: MAE=0.763   dt=1.64 seconds
k=5, mode=user, i_fold=2: MAE=0.756   dt=1.35 seconds
k=5, mode=user, i_fold=3: MAE=0.754   dt=1.31 seconds
k=5, mode=user, i_fold=4: MAE=0.755   dt=1.44 seconds
k=10, mode=item, i_fold=0: MAE=0.725   dt=1.59 seconds
k=10, mode=item, i_fold=1: MAE=0.734   dt=1.66 seconds
k=10, mode=item, i_fold=2: MAE=0.729   dt=1.76 seconds
k=10, mode=item, i_fold=3: MAE=0.723   dt=1.56 seconds
k=10, mode=item, i_fold=4: MAE=0.726   dt=1.44 seconds
k=10, mode=user, i_fold=0: MAE=0.745   dt=1.37 seconds
k=10, mode=user, i_fold=1: MAE=0.752   dt=1.38 seconds
k=10, mode=user, i_fold=2: MAE=0.746   dt=1.75 seconds
k=10, mode=user, i_fold=3: MAE=0.743   dt=1.32 seconds
k=10, mode=user, i_fold=4: MAE=0.745   dt=1.27 seconds
k=20, mode=item, i_fold=0: MAE=0.725   dt=1.49 seconds
k=20, mode=item, i_fold=1: MAE=0.735   dt=1.51 seconds
k=20, mode=item, i_fold=2: MAE=0.728   dt=1.52 seconds
k=20, mode=item, i_fold=3: MAE=0.723   dt=1.38 seconds
k=20, mode=item, i_fold=4: MAE=0.728   dt=1.71 seconds
k=20, mode=user, i_fold=0: MAE=0.741   dt=1.30 seconds
k=20, mode=user, i_fold=1: MAE=0.750   dt=1.31 seconds
k=20, mode=user, i_fold=2: MAE=0.743   dt=1.31 seconds
k=20, mode=user, i_fold=3: MAE=0.738   dt=1.30 seconds
k=20, mode=user, i_fold=4: MAE=0.742   dt=1.34 seconds
k=50, mode=item, i_fold=0: MAE=0.729   dt=1.48 seconds
k=50, mode=item, i_fold=1: MAE=0.740   dt=1.50 seconds
k=50, mode=item, i_fold=2: MAE=0.733   dt=1.48 seconds
k=50, mode=item, i_fold=3: MAE=0.728   dt=1.41 seconds
k=50, mode=item, i_fold=4: MAE=0.733   dt=1.43 seconds
k=50, mode=user, i_fold=0: MAE=0.742   dt=1.30 seconds
k=50, mode=user, i_fold=1: MAE=0.750   dt=1.34 seconds
k=50, mode=user, i_fold=2: MAE=0.743   dt=1.35 seconds
k=50, mode=user, i_fold=3: MAE=0.738   dt=1.34 seconds
k=50, mode=user, i_fold=4: MAE=0.743   dt=1.36 seconds
k=100, mode=item, i_fold=0: MAE=0.735   dt=1.50 seconds
k=100, mode=item, i_fold=1: MAE=0.746   dt=1.50 seconds
k=100, mode=item, i_fold=2: MAE=0.738   dt=1.79 seconds
k=100, mode=item, i_fold=3: MAE=0.733   dt=1.59 seconds
k=100, mode=item, i_fold=4: MAE=0.739   dt=1.51 seconds
k=100, mode=user, i_fold=0: MAE=0.745   dt=1.36 seconds
k=100, mode=user, i_fold=1: MAE=0.753   dt=1.31 seconds
k=100, mode=user, i_fold=2: MAE=0.746   dt=1.31 seconds
k=100, mode=user, i_fold=3: MAE=0.741   dt=1.46 seconds
k=100, mode=user, i_fold=4: MAE=0.747   dt=1.58 seconds
k=200, mode=item, i_fold=0: MAE=0.741   dt=2.06 seconds
k=200, mode=item, i_fold=1: MAE=0.751   dt=1.55 seconds
k=200, mode=item, i_fold=2: MAE=0.743   dt=1.55 seconds
k=200, mode=item, i_fold=3: MAE=0.739   dt=1.54 seconds
k=200, mode=item, i_fold=4: MAE=0.744   dt=1.73 seconds
k=200, mode=user, i_fold=0: MAE=0.748   dt=1.43 seconds
k=200, mode=user, i_fold=1: MAE=0.757   dt=1.43 seconds
k=200, mode=user, i_fold=2: MAE=0.749   dt=1.39 seconds
k=200, mode=user, i_fold=3: MAE=0.744   dt=1.34 seconds
k=200, mode=user, i_fold=4: MAE=0.751   dt=1.35 seconds
In [7]:
Expand Code
i_fold=0: MAE=0.755
i_fold=1: MAE=0.765
i_fold=2: MAE=0.757
i_fold=3: MAE=0.752
i_fold=4: MAE=0.759
In [8]:
Expand Code

Here we can see that Item-based collaborative filtering outperforms User-based collaborative filtering for all $k$. This occurs for the same reason that the Item average baseline performed better than the User average baseline: there are generally more ratings per item than there are ratings per user, since there are more users than movies. (This reverse is true for larger datasets like the MovieLens 20M Dataset where there are more users than movies.)

We also see that the best Item-based CF model occurs around $k=10$ while the best User-based CF model occurs around $k=20$. We'll keep these in mind when comparing models later.

Next, we'll look at matrix factorization methods like Alternating Least Squares, so check out the next post/notebook!


Comments

comments powered by Disqus