MovieLens Project Part 2

Baseline models are important for 2 key reaons:

  1. Baseline models give us a starting point to which to compare all future models, and
  2. Smart baselines/averages may be needed to fill in missing data for more complicated models

Here, we'll explore a few typical baseline models for recommender systems and …

Read more →

MovieLens Project Part 1

I'm beginning a project on the MovieLens dataset to learn about collaborative filtering algorithms. This is Part 1 of this project, where I do an initial exploratory data analysis to see what the data looks like. The remainder of this post is straight out of a Jupyter Notebook file you …

Read more →

Analyzing Larger-than-Memory Data on your Laptop

If you want to run some analysis on a dataset that's just a little too big to load into memory on your laptop, but you don't want to leave the comfort of using Pandas dataframes in a Jupyter notebook, then Dask may be just your thing. Dask is an amazing …

Read more →

Taking Advantage of Sparsity in the ALS-WR Algorithm

The ALS-WR algorithm works well for recommender systems involving a sparse matrix of users by items to review, which happens when most people only review a small subset of many possible items (businesses, movies, etc.). By tweaking the code from a great tutorial to take advantage of this sparsity, I was able to dramatically reduce the computation time.

Read more →

Dealing with Grid Data in Python

In my PhD research, I do a lot of analysis of 2D and 3D grid data output by simulations I run. In my analyses, it's very helpful to restructure these data into a more useable format. A few key lines of python code do the trick.

Read more →

Interactive D3 Map of Baby Name Popularity

Choose a name:

Year: 1910

Using and Understanding this Map

To use the map above, select a name from the dropdown list (you should be able to type a name if you don't want to scroll), then drag the slider to move in time between the years 1910 and 2014 …

Read more →

Parameter Sweep Bash Script

In my polymer field theory research, often my studies involve running a bunch of simulations where I pick one or more input parameters and change them over a range of values, then compare the results of each separate simulation to see how that/those variable(s) affect the system I’m simulating. This kind of study is called a “parameter sweep”, and can also be referred to as “embarrassingly parallel”, because the processor(s) for each for each individual job don’t need to communicate with the processor(s) from any other job. It can be very tedious to manually create input files for each job, so I wrote a bash script to help me out.

Read more →