This is the second post of a project on collaborative filtering based on the MovieLens 100K dataset. The remainder of this post is straight out of a Jupyter Notebook file you can download here. You can also see it here on GitHub.


Baseline models are important for 2 key reaons:

  1. Baseline models give us a starting point to which to compare all future models, and
  2. Smart baselines/averages may be needed to fill in missing data for more complicated models

Here, we'll explore a few typical baseline models for recommender systems and see which ones do the best for our dataset.

In [1]:
Expand Code

1. Load the Data

Let's load and examine the ratings data. If you're following along (i.e. actually running these notebooks) you'll need to make sure to run the first one to download the data before running this one.

In [2]:
Expand Code
First 5:
userId movieId rating timestamp
214 259 255 4 1997-09-19 23:05:10
83965 259 286 4 1997-09-19 23:05:27
43027 259 298 4 1997-09-19 23:05:54
21396 259 185 4 1997-09-19 23:06:21
82655 259 173 4 1997-09-19 23:07:23
Last 5:
userId movieId rating timestamp
46773 729 689 4 1998-04-22 19:10:38
73008 729 313 3 1998-04-22 19:10:38
46574 729 328 3 1998-04-22 19:10:38
64312 729 748 4 1998-04-22 19:10:38
79208 729 272 4 1998-04-22 19:10:38

2. Test Baseline Models

With all that framework setup out of the way, let's evaluate a few baseline models in increasing order of expected accuracy.

2.1 Simple Average Model

The first model we'll test is about the simplest one possible. We'll just average all the training set ratings and use that average for the prediction for all test set examples.

In [3]:
Expand Code

2.2 Average by ID Model

We can probably do a little better by using the user or item (movie) average. Here we'll set up a baseline model class that allows you to pass either a list of userIds or movieIds as X. The prediction for a given ID will just be the average of ratings from that ID, or the overall average if that ID wasn't seen in the training set.

In [4]:
Expand Code

2.3 Damped Baseline with User + Movie Data

This baseline model takes into account the average ratings of both the user and the movie, as well as a damping factor that brings the baseline prediction closer to the overall mean. The damping factor has been shown empirically to improve the perfomance.

This model follows equation 2.1 from a collaborative filtering paper from GroupLens, the same group that published the MovieLens data. This equation defines rhe baseline rating for user $u$ and item $i$ as

$$b_{u,i} = \mu + b_u + b_i$$

where

$$b_u = \frac{1}{|I_u| + \beta_u}\sum_{i \in I_u} (r_{u,i} - \mu)$$

and

$$b_i = \frac{1}{|U_i| + \beta_i}\sum_{u \in U_i} (r_{u,i} - b_u - \mu).$$

(See equations 2.4 and 2.5). Here, $\beta_u$ and $\beta_i$ are damping factors, for which the paper reported 25 is a good number for this dataset. For now we'll just leave these values equal ($\beta=\beta_u=\beta_i$). Here's a summary of the meanings of all the variables here:

Variable Meaning
$b_{u,i}$ Baseline rating for user $u$ on item (movie) $i$
$\mu$ The mean of all ratings
$b_u$ The deviation from $\mu$ associated with user $u$
$b_i$ The deviation from $\mu+b_u$ associated with user $i$
$I_u$ The set of all items rated by user $u$
$\mid I_u \mid$ The number of items rated by user $u$
$\beta_u$ Damping factor for the users ($=\beta$)
$r_{u,i}$ Observed rating for user $u$ on item $i$
$U_i$ The set of all users who rated item $i$
$\mid U_i \mid$ The number of users who rated item $i$
$\beta_i$ Damping factor for the items ($=\beta$)
In [5]:
Expand Code

2.4 Cross-validation framework

Because the ratings distributions look relatively unchanged over time, we will use a time-independent cross-validation framework to determine the best baseline model moving forward. Below we define get_xval_errs() such that if you pass in a dataframe and a baseline model object, it will return a list of the 5 (or n_splits) Mean Absolute Error (MAE) values from each fold.

In [6]:
def get_xval_errs_and_res(df, model, n_splits=5, random_state=0, rating_col='rating'):
    kf = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    errs, stds = [], []
    residuals = np.zeros(len(df))
    for train_inds, test_inds in kf.split(df):
        train_df, test_df = df.iloc[train_inds], df.iloc[test_inds]
        pred = model.fit(train_df).predict(test_df)
        residuals[test_inds] = pred - test_df[rating_col]
        mae = mean_absolute_error(pred, test_df[rating_col])
        errs.append(mae)
    return errs, residuals
In [7]:
Expand Code
Average Item Average User Average Combined 0 Combined 10 Combined 25 Combined 50
0 0.938915 0.811093 0.831037 0.755218 0.755246 0.761931 0.773639
1 0.953741 0.825894 0.842430 0.763540 0.764793 0.772484 0.785081
2 0.945107 0.814342 0.837044 0.757141 0.757230 0.764379 0.776727
3 0.937957 0.813231 0.829171 0.752095 0.751890 0.759048 0.771349
4 0.947786 0.820604 0.835589 0.759378 0.758553 0.765714 0.778648
Average Item Average User Average Combined 0 Combined 10 Combined 25 Combined 50
99995 -0.469975 -0.818182 -1.312500 -1.685529 -1.309622 -1.064009 -0.879723
99996 0.529288 1.257840 -0.250000 0.470216 0.745825 0.889503 0.957703
99997 0.529288 0.434211 -0.250000 -0.315062 -0.012279 0.166903 0.286514
99998 -0.469762 -0.850202 -1.222222 -1.576204 -1.296998 -1.111291 -0.969507
99999 -0.469975 0.222222 -1.312500 -0.585790 -0.313551 -0.185635 -0.140248
In [8]:
Expand Code

The MAE plots above show that the combined model with a damping factor of 0 or 10 performs the best, followed by the item average, then the user average. It makes sense that taking into account deviations from the mean due to both user and item would perform the best: there is simply more data being taken into account for each baseline prediction. The same idea explains why the item average performs better than the user average: there are more items than users in this dataset, so averaging over items takes into account more data per baseline prediction than averaging over users. The residual plots underneath the MAE plot illustrate that taking into account more data pulls the density of the residuals closer to 0.

Before moving on to collaborative filtering models, we'll want to choose which model to use as a baseline. Both the Combined 0 and Combined 10 models performed equally, but we'll choose the Combined 10 model, because a higher damping factor is effectively stronger regularization, which will prevent overfitting better than a damping factor of 0.

Check out the next post/notebook to see collaborative filtering models building on these baselines!


Comments

comments powered by Disqus