I'm beginning a project on the MovieLens dataset to learn about collaborative filtering algorithms. This is Part 1 of this project, where I do an initial exploratory data analysis to see what the data looks like. The remainder of this post is straight out of a Jupyter Notebook file you can download here. You can also see it here on GitHub.

I have most of the code collapsed in the boxes that say "Expand Code" so we can skip right to the visualizations, but all of the code is here if you expand the code blocks. Thanks to Jake Vanderplas for the amazing Pelican plugin that makes this possible.

Let's load and examine the ratings data. I sort the data by `timestamp`

and break the `timestamp`

column out into `year`

, `month`

, `day`

, `hour`

, and `minute`

just for good measure. I save that preprocessed data to a feather file format for faster loading later on.

This part of the dataset is quite simple in principle: just user-movie-rating triplets at different points in time. Let's ask some questions about the data to get a feel for what we're working with:

# 1. What is the distribution of the ratings?¶

I'd like to see what ratings are common and uncommon, so let's just plot counts of each rating:

That's odd...why are the half-star scores less popular? Do people prefer whole numbers? Let's look at how this distribution changes over time to see if this behavior is consistent.

# 2. How does the ratings distribution change over time?¶

I think a heatmap with time on the x-axis and rating on the y-axis would be a good way to visualize this. To do this with seaborn we first need to pivot the data into a table that resembles this structure:

And now we're set to visualize this:

Ahh, this explains the rating distribution. It looks like half-stars weren't allowed until part-way through 2003. After 2003 the distribution looks pretty smooth and consistent. This leads me to my next question:

# 3. How do the ratings distributions compare before and after half scores are allowed?¶

If we find the timestamp of the very first half-star rating, we can look at the distributions before and after that timestamp. Here, we can see it was a 3.5 star rating on February 18th, 2003, for Catch Me If You Can (go to https://movielens.org/movies/5989 to see the title of `movieId`

5989).

```
switch_timestamp = ratings_df[ratings_df['rating'].isin([0.5, 1.5, 2.5, 3.5, 4.5])].iloc[0]['timestamp']
ratings_df[ratings_df['timestamp'] == switch_timestamp]
```

# 4. How many ratings were submitted per year?¶

The number of ratings has not been constant over the years. Setting aside the measley 4 (simultaneous down to the minute) ratings from 1995, (maybe test ratings at the very beginning of the project to make sure the system works?) there were large variations from 1996 to 2005 (possibly due to funding/grad studnet availability etc.?), then a steady decrease over time after that (maybe due to the rise of Netflix as a good movie recommender?).

So the number of ratings has not been constant, but the distribution over time visualized in the heatmap above seems to show some consistency (except for the change in 2003). This leads to another question about changes over time:

# 5. How consistent are the average ratings over time?¶

This average ratings were fairly consistent around 3.5. The lack of large changes over time will simplify modeling a little bit.

One last thing I'm curious about (for now) when it comes to changes over time:

# 6. How quickly do the movie and user bases grow over time?¶

I'll assume that a user has joined on her first rating, and that she remains a user from then on.

New users seen in the dataset look fairly linear over time. I was kind of expecting more variation given the fluctuations in number of ratings per year, but this is good to know. It's interesting that after 2008 there appears to be a sudden increase in the number of movies rated. I'm not sure if there was just a sharp increase in movies produced per year, or if a large fraction of movies were being made available to rate with the MovieLens system, but this might be something to keep in mind.

# 7. How sparse is the user/movies matrix we'll be dealing with?¶

Sparsity is a very common challenge to overcome in many collaborative filtering applications. By sparsity, I mean that if we create a matrix $R$ with dimensions $n_{users} \times n_{movies}$ where each element $r_{ij}$ is a single rating by user $i$ of movie $j$, this matrix will be very empty because most users have only rated a few of the 25,000+ movies available. Let's see how bad it is.

Here's one way of looking at it: let's sort the ~140,000 users by decreasing number of movies each one rated, and plot the number of movies they rated. We can do the same for the number of users who rated each of the ~25,000 movies:

Another way to visualize these distributions is with probability density functions (PDFs) which are just normalized histograms:

With such long-tailed distributions, it helps to take the logarithm of the x-axis to help make sense of if. Often if there's a long tail, doing a log-transform will reveal a normal-ish distribution, suggesting the original distribution is log-normal. Let's see what a $\log_{10}$ transform reveals here:

Even after taking a log-transform, these distributions are quite skewed to the right, so they're not even log-normal distributions.

## Comments

comments powered by Disqus