Baseline models are important for 2 key reaons:

  1. Baseline models give us a starting point to which to compare all future models, and
  2. Smart baselines/averages may be needed to fill in missing data for more complicated models

Here, we'll explore a few typical baseline models for recommender systems and see which ones do the best for our dataset.

In [1]:
Expand Code

1. Load the Data

Let's load and examine the ratings data. From here on until the very end of this project, we want to keep a holdout set that we won't touch except to evaluate a few models we choose based on cross-validation and other metrics. Here we'll split the data into reviews through 2013 for training and years 2014 and on as our holdout set. We do this instead of splitting randomly so that we don't get data leakage from the future during model development. We'll save these files as feather files for fast future loading.

In [2]:
Expand Code
Loading ../preprocessed/ratings-through-2013.feather
userId movieId rating timestamp year month day hour minute
19153905 120609 81562 3.0 2013-12-31 23:57:27 2013 12 31 23 57
19153906 120609 356 2.5 2013-12-31 23:57:36 2013 12 31 23 57
19153907 120609 74458 4.0 2013-12-31 23:57:47 2013 12 31 23 57
19153908 44501 70533 4.5 2013-12-31 23:58:07 2013 12 31 23 58
19153909 44501 96821 4.0 2013-12-31 23:58:34 2013 12 31 23 58

2. Establish Cross-Validation Technique

Before getting into model development, it's important to establish the performance of some baseline models to compare to. Simple cross-validation doesn't take into account the time-dependence of the reviews. Instead, we will validate the baseline models using a kind of rolling cross-validation that only predicts future data from past data. This process can be visualized as follows (image linked from StackOverflow):

Rolling Cross-Validation Schematic

In this case, we'll use 2-year intervals for each test set. Scikit-learn has a class called TimeSeriesSplit which is close to how we want to split the data for each test, but it only allows splitting by number of rows, not by a fixed time period. We'll make our own class to implement this with a similar interface.

In [3]:
Expand Code

Now that we have our cross-validation splitting framework set up, let's set up a framework to run our custom validation. For each time period that we test, we'll compute the Mean Absolute Error, which is a commonly used error metric for recommender systems. We'll set up the validate() member function for this class to return the list of years and errors to facilitate plotting the error over time.

In [4]:
Expand Code

3. Testing Baseline Models

With all that framework setup out of the way, let's evaluate a few baseline models in increasing order of expected accuracy.

3.1 Simple Average Model

The first model we'll test is about the simplest one possible. We'll just average all the training set ratings and use that average for the prediction for all test set examples.

In [5]:
Expand Code

3.2 Average by ID Model

We can probably do a little better by using the user or item (movie) average. Here we'll set up a baseline model class that allows you to pass either a list of userIds or movieIds as X. The prediction for a given ID will just be the average of ratings from that ID, or the overall average if that ID wasn't seen in the training set.

In [6]:
Expand Code

3.3 Damped Baseline with User + Movie Data

This baseline model takes into account the average ratings of both the user and the movie, as well as a damping factor that brings the baseline prediction closer to the overall mean. The damping factor has been shown empirically to improve the perfomance.

This model follows equation 2.1 from a collaborative filtering paper from GroupLens, the same group that published the MovieLens data. This equation defines rhe baseline rating for user $u$ and item $i$ as

$$b_{u,i} = \mu + b_u + b_i$$

where

$$b_u = \frac{1}{|I_u| + \beta_u}\sum_{i \in I_u} (r_{u,i} - \mu)$$

and

$$b_i = \frac{1}{|U_i| + \beta_i}\sum_{u \in U_i} (r_{u,i} - b_u - \mu).$$

(See equations 2.4 and 2.5). Here, $\beta_u$ and $\beta_i$ are damping factors, for which the paper reported 25 is a good number for this dataset. For now we'll just leave these values equal ($\beta=\beta_u=\beta_i$). Here's a summary of the meanings of all the variables here:

Variable Meaning
$b_{u,i}$ Baseline rating for user $u$ on item (movie) $i$
$\mu$ The mean of all ratings
$b_u$ The deviation from $\mu$ associated with user $u$
$b_i$ The deviation from $\mu+b_u$ associated with user $i$
$I_u$ The set of all items rated by user $u$
$\mid I_u \mid$ The number of items rated by user $u$
$\beta_u$ Damping factor for the users ($=\beta$)
$r_{u,i}$ Observed rating for user $u$ on item $i$
$U_i$ The set of all users who rated item $i$
$\mid U_i \mid$ The number of users who rated item $i$
$\beta_i$ Damping factor for the items ($=\beta$)
In [7]:
Expand Code

4. Actually evaluate the baseline models

OK, so now we have our validation framework set up, and we've defined some baseline models to compare. Now let's actually run them! For the Damped Baseline model, we'll try $\beta=$0, 25, and 50 to bracket the recommended value of 25.

In [8]:
Expand Code
Running Simple Average Model...Done!
Running User Average Model...Done!
Running Moive Average Model...Done!
Running Damped Baseline with beta=0...Done!
Running Damped Baseline with beta=25...Done!
Running Damped Baseline with beta=50...Done!
CPU times: user 2min 6s, sys: 1min 31s, total: 3min 38s
Wall time: 3min 54s

Good, that didn't take too long, just about 4-5 minutes on my laptop. Now let's see how these models perform over time:

In [9]:
Expand Code

The results here aren't too surprising. The Simple Average model performed the worst, with the User Average model perforing just slightly better. The biggest performance jump comes from using Movie Averages, simply because each movie typically has more ratings associated with it than each user. We get a little better by moving to the Damped Residuals model, then slightly better still by moving $\beta$ up to 25. Increasing beyond that appears to have negligible benefit.

Let's compare some distributions on the residuals of the predictions (prediction - actual) for 3 of these models: User Average, Movie Average, and Damped User+Movie with $\beta=25$:

In [10]:
Expand Code

Now that we have the residuals in a dataframe, let's plot those distributions and see how they compare:

In [11]:
Expand Code

The user model has a lot of regular peaks, likely because users have very few associated ratings compared to movies, so a lot of the user averages will be on half (*.5) or quarter (*.25) values. That gets smoothed out a lot for movies because there are typically more ratings, and then smoothed even further for the combined model because of the greater nuance enabled by combining user and movie averages, as well as a damping factor.

That's all for now! Next we'll do some actual collaborative filtering.


Comments

comments powered by Disqus