Programming Assignment 5: Building a Movie Recommendation System

Team Details:

When submitting, fill your team details in this cell. Note that this is a markdown cell.

Student 1 Full Name:
Student 1 Student ID:
Student 1 Email Address:

Student 2 Full Name:
Student 2 Student ID:
Student 2 Email Address:

Student 3 Full Name:
Student 3 Student ID:
Student 3 Email Address:

Assignment Details

In this assignment, we will explore some of the fundamental concepts in building a recommendation systems for movies. This assignment material is inspired by code from a number of resources including those from Hanspeter Pfister, YHat, Sudeep Das, Jonny and others. At a high level, you will be doing the following tasks:

1. Exploratory Analysis/Sparsity

Before building the actual recommender, we will explore the movie dataset and see how ratings are distributed across different dimensions and how the sparse the overall dataset is. These factors (long tail/sparsity) cause lot of issues when building a recommender.

2. Nearest Neighbor based Recommender System

We will start with a simple nearest neighbor based recommender system. Given a movie, we will try to find the top-$k$ most similar movies. We will just provide the list of movies without predicting the ratings.

3. Item based Collaborative Filtering

In this task, we will try to predict the ratings also. We will use item based collaborative filtering that computes similarity between movies and use them for recommendation.

4. User based Collaborative Filtering

In this task, we will design a user based collaborative filtering that is very similar to the recommender you designed in Task 3.

5. Latent Factor Models

In this task, we will explore couple of advanced models involving latent factors.

In [1]:
########### Do not change anything below

%matplotlib inline 

#Array processing
import numpy as np

#Data analysis, wrangling and common exploratory operations
import pandas as pd
from pandas import Series, DataFrame
from IPython.display import display

#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.
import matplotlib.pyplot as plt
import seaborn as sns


import scipy as sp
#SVD for Sparse matrices
from scipy.sparse.linalg import svds

from sklearn.metrics.pairwise import euclidean_distances

try:
   import cPickle as pickle
except:
   import pickle

from collections import defaultdict, Counter
import operator

from matplotlib import rcParams
rcParams['figure.figsize'] = (10, 6)

########################If needed you can import additional packages for helping you, although I would discourage it
########################Put all your imports in the space below. If you use some strange package, 
##########################provide clear information as to how we can install it.

#######################End imports###################################

Part 1: Exploratory Analysis/Sparsity

Dataset: We will be using the MovieLens 100K dataset. It is a fairly small dataset with 100K ratings from 1000 users on 1700 movies. You can download the dataset from http://grouplens.org/datasets/movielens/ . Some basic details about the dataset can be found in the README text file at http://files.grouplens.org/datasets/movielens/ml-100k-README.txt . You will need to understand the structure before doing the experiments. Download and unzip the file in the same folder as the IPython notebook. The rest of the code assumes that there is a folder called "ml-100k" in the current directory.

In [2]:
#####Do not change anything below

#Load the user data
users_df = pd.read_csv('ml-100k/u.user', sep='|', names=['UserId', 'Age', 'Gender', 'Occupation', 'ZipCode'])
#Load the movies data: we will only use movie id and title for this assignment
movies_df = pd.read_csv('ml-100k/u.item', sep='|', names=['MovieId', 'Title'], usecols=range(2))
#Load the ratings data: ignore the timestamps
ratings_df = pd.read_csv('ml-100k/u.data', sep='\t', names=['UserId', 'MovieId', 'Rating'],usecols=range(3))

#Working on three different data frames is a pain
# Let us create a single dataset by "joining" these three data frames
movie_ratings_df = pd.merge(movies_df, ratings_df)
movielens_df = pd.merge(movie_ratings_df, users_df)

movielens_df.head()
Out[2]:
MovieId Title UserId Rating Age Gender Occupation ZipCode
0 1 Toy Story (1995) 308 4 60 M retired 95076
1 4 Get Shorty (1995) 308 5 60 M retired 95076
2 5 Copycat (1995) 308 4 60 M retired 95076
3 7 Twelve Monkeys (1995) 308 4 60 M retired 95076
4 8 Babe (1995) 308 5 60 M retired 95076
In [13]:
#Task t1a: Print the NAME of the top-10 movies with most ratings
print movielens_df.groupby('Title').size().order(ascending=False)[:10]
Title
Star Wars (1977)                 583
Contact (1997)                   509
Fargo (1996)                     508
Return of the Jedi (1983)        507
Liar Liar (1997)                 485
English Patient, The (1996)      481
Scream (1996)                    478
Toy Story (1995)                 452
Air Force One (1997)             431
Independence Day (ID4) (1996)    429
dtype: int64
In [14]:
#Task t1b: Let us now analyze the rating behavior of the 1000 users in the dataset
#  Create a histogram based on the number of ratings per user with 50 bins. 
# Title="Count of Ratings per User", XLabel="Ratings per user", YLabel="#Users"

#t1b_user_rating_count is a groupby object that counts the number of ratings for each user.
t1b_user_rating_count = movielens_df.groupby('UserId')['Rating'].count()
t1b_user_rating_count.hist(bins=50)
plt.xlabel("Ratings per user")
plt.ylabel("#Users")
plt.title("Count of Ratings per User")

####The figure below shows that most users leave less than 25 ratings while few outliers leave a lot of ratings
Out[14]:
<matplotlib.text.Text at 0x7f2b9f405910>
In [15]:
#Task t1c: Let us now analyze the rating behavior of the 1000 users in the dataset
#  Create a histogram based on the number of ratings per movie with 50 bins. 
# Title="Count of Ratings per Movie", XLabel="Ratings per Movie", YLabel="#Movies"

#t1c_user_rating_count is a groupby object that counts the number of ratings for each movie.
t1c_user_rating_count = movielens_df.groupby('MovieId')['Rating'].count()
t1c_user_rating_count.hist(bins=50)
plt.xlabel("Ratings per Movie")
plt.ylabel("#Movies")
plt.title("Count of Ratings per Movie")

####The figure below shows that most movies receive less than 25 ratings while few popular get a lot of ratings
Out[15]:
<matplotlib.text.Text at 0x7f2b9ef3e610>
In [16]:
#Task t1d: Let us now analyze the rating distribution
#  Create a histogram based on the ratings received for each movie with 5 bins. 
# Title="Ratings Histogram", XLabel="Rating Provided", YLabel="#Ratings"

movielens_df['Rating'].hist(bins=5)
plt.xlabel("Rating Provided")
plt.ylabel("#Ratings")
plt.title("Ratings Histogram")

####The figure below shows that most movies at least 35K of ratings provided a score of 4 or higher.
#The code below shows that the average rating is 3.5
# This is problematic because each user has a different rating scale
# For some users 1 is a bad movie while for some others 3 is bad
# So a good model must take into the account the baseline rating behavior of users and movies

print "Average rating of ALL movies is", round(movie_ratings_df['Rating'].mean(), 2)
Average rating of ALL movies is 3.53
In [17]:
#Task t1e: Let us now study the baseline rating behavior in more detail
# For each user compute his/her average rating
#  Create a histogram based on the average ratings with 5 bins. 
# Title="Histogram of Average Ratings of Users", XLabel="Average Rating", YLabel="#Users"

#t1e_avg_ratings is a groupby object with average rating for each user
t1e_avg_ratings = movielens_df.groupby('UserId')['Rating'].mean()

t1e_avg_ratings.hist(bins=5)
plt.title("Histogram of Average Ratings of Users")
plt.xlabel("Average Rating")
plt.ylabel("#Users")


####The figure below shows that while the average rating of users vary
# What does this imply?
Out[17]:
<matplotlib.text.Text at 0x7f2b9eb54e90>
In [18]:
#Task t1f: Let us now study the baseline rating behavior in more detail
# For each movie compute its average rating
#  Create a histogram based on the average ratings with 5 bins. 
# Title="Histogram of Average Ratings of Movies", XLabel="Average Rating", YLabel="#Movies"

#t1e_avg_ratings is a groupby object with average rating for each user
t1f_avg_ratings = movielens_df.groupby('MovieId')['Rating'].mean()

t1f_avg_ratings.hist(bins=5)
plt.title("Histogram of Average Ratings of Movies")
plt.xlabel("Average Rating")
plt.ylabel("#Movies")


####The figure below shows that while the average rating of movies vary
# What does this imply?
Out[18]:
<matplotlib.text.Text at 0x7f2b9eb7e610>

Common Support: The concept of common support is a key idea is recommender systems. Given two items (movies in our case), the common support is the number of reviewers who rated both items. It is useful in both K-nearest neighbor and collaborative filtering based recommenders. Specifically, if the common support is 0 for a pair of movies, then it is quite hard to find their similarity!

In [38]:
#Task t1g: Let us now analyze the common support for movielens dataset
# Here is the high level idea
# We are going to create an array and populate it with the common support for all pair of movies
# We then are going to plot a histogram and see its distribution.

#This task might take quite some time - 1-2 hours for typical machines !

t1g_all_movies = movies_df['MovieId'].unique()
t1g_allpair_commonsupport = []

for i, movie1 in enumerate(t1g_all_movies):
    for j, movie2 in enumerate(t1g_all_movies):
        #Do not count a pair twice
        if i < j: 
            userids_who_rated_movie1 = movielens_df[movielens_df.MovieId==movie1].UserId.unique()
            userids_who_rated_movie2 = movielens_df[movielens_df.MovieId==movie2].UserId.unique()
            num_common_users = len(  set(userids_who_rated_movie1).intersection(userids_who_rated_movie2)   ) 
            t1g_allpair_commonsupport.append(num_common_users)

            
print "Average common support is ", round(np.mean(t1g_allpair_commonsupport), 2)
plt.hist(t1g_allpair_commonsupport)

            



#### What the average common support and its distribution imply?
Average common support is  7.11
Out[38]:
(array([  1.37138500e+06,   3.26780000e+04,   7.02500000e+03,
          1.92200000e+03,   5.25000000e+02,   1.33000000e+02,
          4.20000000e+01,   9.00000000e+00,   1.00000000e+00,
          1.00000000e+00]),
 array([   0.,   48.,   96.,  144.,  192.,  240.,  288.,  336.,  384.,
         432.,  480.]),
 <a list of 10 Patch objects>)
In [19]:
#Task t1h: Let us now consider how sparse the matrix is
t1h_sparsity = 100000./(1682*943)
print "Sparsity of the dataset is ", t1h_sparsity

#This graph is actually less sparse than typical datasets for recommender systems
# which often have a sparsity much lower than 1%
# As discussed in the class, the sparsity imposes huge problem in terms of designing efficient and correct algorithms
Sparsity of the dataset is  0.0630466936422
In [21]:
#Task t1i: Compute the average rating for each movie grouped by gender
#  In other words, for each movie, compute the average rating given by men and women
# Hint: use a pivot table

t1i_movielens_mean_ratings = movielens_df.pivot_table('Rating', index ='Title', columns='Gender',aggfunc='mean')
display(t1i_movielens_mean_ratings[:10])
Gender F M
Title
'Til There Was You (1997) 2.200000 2.500000
1-900 (1994) 1.000000 3.000000
101 Dalmatians (1996) 3.116279 2.772727
12 Angry Men (1957) 4.269231 4.363636
187 (1997) 3.500000 2.870968
2 Days in the Valley (1996) 3.235294 3.223684
20,000 Leagues Under the Sea (1954) 3.214286 3.568966
2001: A Space Odyssey (1968) 3.491228 4.103960
3 Ninjas: High Noon At Mega Mountain (1998) 1.000000 1.000000
39 Steps, The (1935) 4.000000 4.060000

Part 2: Nearest Neighbor based Recommender System

Let us now build a simple global recommendation system based on the nearest neighbor idea.

In [6]:
#Task t2a: 
# Create a dictionary where key is Movie Name and value is id
#       You can either use the movies_df or read and parse the u.item file yourself
movie_name_to_id_dictionary = {}

#Write code to populate the movie names to this array
all_movie_names = []

f = open("ml-100k/u.item", "r")
lines = f.readlines()
for line in lines:
    data = line.split("|")
    movie_name, movie_id = data[1], int(data[0])
    movie_name_to_id_dictionary[movie_name] = movie_id
    all_movie_names.append(movie_name)
f.close()
In [9]:
#Task t2b: Write a function that takes two inputs: 
#  movie_id: id of the movie and common_users: a set of user ids
# and returns the list of rows corresponding to their movie ratings 

def get_movie_reviews(movie_id, common_users):
    #Get a boolean vector for themovielens_dfns_dfs provided by users in common_users for movie movie_id
    # Hint: use the isin operator of Pandas
    mask = (movielens_df['UserId'].isin(common_users)) & (movielens_df['MovieId'] == movie_id)
    
    #Create a subset of data where the mask is True i.e. only collect data from users who satisfy the condition above
    # Then sort them based on userid
    movie_ratings = movielens_df[mask].sort('UserId')
    
    #Do not change below
    #Return the unique set of ratings provided
    movie_ratings = movie_ratings[movie_ratings['UserId'].duplicated()==False]
    return movie_ratings
    
In [10]:
#Do not change below

#Here are some sample test cases for evaluating t2b
print "get_movie_reviews(1, set([1]))"
display( get_movie_reviews(1, set([1])) )

print "get_movie_reviews(1, set(range(1, 10)))"
display( get_movie_reviews(1, set(range(1, 10))) )

print "get_movie_reviews(100, set(range(1, 10)))"
display( get_movie_reviews(100, set(range(1, 10))) )

print "get_movie_reviews(784, set(range(1, 784)))"
display( get_movie_reviews(784, set(range(1, 784))) )
get_movie_reviews(1, set([1]))
MovieId Title UserId Rating Age Gender Occupation ZipCode
20103 1 Toy Story (1995) 1 5 24 M technician 85711
get_movie_reviews(1, set(range(1, 10)))
MovieId Title UserId Rating Age Gender Occupation ZipCode
20103 1 Toy Story (1995) 1 5 24 M technician 85711
15002 1 Toy Story (1995) 2 4 53 F other 94043
820 1 Toy Story (1995) 5 4 33 F other 15213
35865 1 Toy Story (1995) 6 4 42 M executive 98101
get_movie_reviews(100, set(range(1, 10)))
MovieId Title UserId Rating Age Gender Occupation ZipCode
20202 100 Fargo (1996) 1 5 24 M technician 85711
15009 100 Fargo (1996) 2 5 53 F other 94043
843 100 Fargo (1996) 5 5 33 F other 15213
35894 100 Fargo (1996) 6 5 42 M executive 98101
74830 100 Fargo (1996) 7 5 57 M administrator 91344
get_movie_reviews(784, set(range(1, 784)))
MovieId Title UserId Rating Age Gender Occupation ZipCode
18103 784 Beyond Bedlam (1993) 13 1 47 M educator 29206
67420 784 Beyond Bedlam (1993) 405 1 22 F healthcare 10019
In [6]:
#Task t2c: Let us now calculate the similarity between two movies
# based on the set of users who rated both movies
#Using euclidean distance is a bad idea - but simplifies the code

def calculate_similarity(movie_name_1, movie_name_2, min_common_users=0):
    
    movie1 = movie_name_to_id_dictionary[movie_name_1]
    movie2 = movie_name_to_id_dictionary[movie_name_2]
    
    #This is the set of UNIQUE user ids  who reviewed  movie1
    users_who_rated_movie1 = movielens_df[movielens_df['MovieId'] == movie1]['UserId'].unique()
    
    #This is the set of UNIQUE user ids  who reviewed  movie2
    users_who_rated_movie2 = movielens_df[movielens_df['MovieId'] == movie2]['UserId'].unique()
    
    #Compute the common users who rated both movies: 
    # hint convert both to set and do the intersection
    common_users = set(users_who_rated_movie1).intersection(users_who_rated_movie2)
    
    #Using the code you wrote in t2a, get the reviews for the movies and common users
    movie1_reviews = get_movie_reviews(movie1, common_users)
    movie2_reviews = get_movie_reviews(movie2, common_users)
    
    #Now you have the data frame for both movies
    # Use the euclidean_distances function from sklean (imported already)
    # to compute the distance between their rating values
    distance = euclidean_distances(movie1_reviews['Rating'], movie2_reviews['Rating'])

    if len(common_users) < min_common_users:
        return [[float('inf')]]
    return distance
In [7]:
#Do not change below
print calculate_similarity("Toy Story (1995)", "GoldenEye (1995)")
print calculate_similarity("GoldenEye (1995)", "Tomorrow Never Dies (1997)")
print calculate_similarity("Batman Forever (1995)", "Batman & Robin (1997)")
[[ 13.60147051]]
[[ 6.244998]]
[[ 4.12310563]]
In [8]:
#Task t2d: Given a movie, find the top-k most similar movies
# that have the lowest euclidean distance 

#Here is the high level logic:
#  for each movie in all_movie_names (Except the input movie name)
#    compute its similarity and store it in an array
#   return the k movies with the smallest distances
# remember to pass min_common_users to calculate_similarity
def get_top_k_similar_movies(input_movie_name, k=5, min_common_users=0):
    movie_similarity = [] 
    for movie_name in all_movie_names:
        if input_movie_name == movie_name:
            continue
        similarity = calculate_similarity(input_movie_name, movie_name, min_common_users)
        similarity = similarity[0][0]
        movie_similarity.append( (similarity, movie_name) )
    return sorted(movie_similarity)[:k]

        
In [63]:
#print get_top_k_similar_movies("Toy Story (1995)", 10)
print "\nMovies similar to GoldenEye [25]", get_top_k_similar_movies("GoldenEye (1995)", 10, 25)
print "\nMovies similar to GoldenEye [50]", get_top_k_similar_movies("GoldenEye (1995)", 10, 50)
print "\nMovies similar to GoldenEye [100]", get_top_k_similar_movies("GoldenEye (1995)", 10, 100)
print "\n\n"

print "\nMovies similar to Usual Suspects [25]", get_top_k_similar_movies("Usual Suspects, The (1995)", 10, 25)
print "\nMovies similar to Usual Suspects [50]", get_top_k_similar_movies("Usual Suspects, The (1995)", 10, 50)
print "\nMovies similar to Usual Suspects [100]", get_top_k_similar_movies("Usual Suspects, The (1995)", 10, 100)
print "\n\n"

print "\nMovies similar to Batman Forever [25]", get_top_k_similar_movies("Batman Forever (1995)", 10, 25)
print "\nMovies similar to Batman Forever [50]", get_top_k_similar_movies("Batman Forever (1995)", 10, 50)
print "\nMovies similar to Batman Forever [100]", get_top_k_similar_movies("Batman Forever (1995)", 10, 100)
print "\n\n"

print "\nMovies similar to Shawshank Redemption [25]", get_top_k_similar_movies("Shawshank Redemption, The (1994)", 10, 25)
print "\nMovies similar to Shawshank Redemption [50]", get_top_k_similar_movies("Shawshank Redemption, The (1994)", 10, 50)
print "\nMovies similar to Shawshank Redemption [100]", get_top_k_similar_movies("Shawshank Redemption, The (1994)", 10, 100)
print "\n\n"
 
Movies similar to GoldenEye [25] [(4.2426406871192848, 'Program, The (1993)'), (4.6904157598234297, 'Up Close and Personal (1996)'), (4.7958315233127191, 'Quick and the Dead, The (1995)'), (4.8989794855663558, 'Murder at 1600 (1997)'), (5.0990195135927845, 'Down Periscope (1996)'), (5.0990195135927845, "My Best Friend's Wedding (1997)"), (5.0990195135927845, 'Nick of Time (1995)'), (5.2915026221291814, 'Devil in a Blue Dress (1995)'), (5.2915026221291814, 'Donnie Brasco (1997)'), (5.2915026221291814, 'Michael (1996)')]

Movies similar to GoldenEye [50] [(7.2801098892805181, 'Star Trek: Generations (1994)'), (7.5498344352707498, 'Basic Instinct (1992)'), (7.6811457478686078, 'Jumanji (1995)'), (7.6811457478686078, 'Phenomenon (1996)'), (7.9372539331937721, 'Courage Under Fire (1996)'), (8.0, 'Outbreak (1995)'), (8.1240384046359608, 'Firm, The (1993)'), (8.2462112512353212, 'Abyss, The (1989)'), (8.2462112512353212, 'Cliffhanger (1993)'), (8.2462112512353212, 'Maverick (1994)')]

Movies similar to GoldenEye [100] [(9.8488578017961039, 'Mission: Impossible (1996)'), (10.770329614269007, 'True Lies (1994)'), (11.532562594670797, 'Speed (1994)'), (12.165525060596439, 'Independence Day (ID4) (1996)'), (12.206555615733702, 'Top Gun (1986)'), (12.369316876852981, 'Rock, The (1996)'), (12.569805089976535, 'Batman (1989)'), (13.114877048604001, 'Star Trek: The Wrath of Khan (1982)'), (13.601470508735444, 'Toy Story (1995)'), (13.74772708486752, 'Blade Runner (1982)')]




Movies similar to Usual Suspects [25] [(5.196152422706632, 'Candidate, The (1972)'), (5.196152422706632, 'To Catch a Thief (1955)'), (5.4772255750516612, 'Cat on a Hot Tin Roof (1958)'), (5.4772255750516612, 'Roman Holiday (1953)'), (5.5677643628300215, "Sophie's Choice (1982)"), (5.5677643628300215, 'Wallace & Gromit: The Best of Aardman Animation (1996)'), (5.6568542494923806, 'His Girl Friday (1940)'), (5.7445626465380286, 'Die xue shuang xiong (Killer, The) (1989)'), (5.7445626465380286, 'Some Folks Call It a Sling Blade (1993)'), (5.9160797830996161, 'Breaking the Waves (1996)')]

Movies similar to Usual Suspects [50] [(6.5574385243020004, 'Close Shave, A (1995)'), (7.2801098892805181, 'Apt Pupil (1998)'), (7.2801098892805181, 'Philadelphia Story, The (1940)'), (7.6811457478686078, 'Good Will Hunting (1997)'), (7.810249675906654, 'Wrong Trousers, The (1993)'), (8.1240384046359608, 'Manchurian Candidate, The (1962)'), (8.1240384046359608, 'Six Degrees of Separation (1993)'), (8.3066238629180749, 'Maltese Falcon, The (1941)'), (8.6023252670426267, 'Sling Blade (1996)'), (8.7749643873921226, 'Secrets & Lies (1996)')]

Movies similar to Usual Suspects [100] [(9.3273790530888157, 'L.A. Confidential (1997)'), (9.4868329805051381, 'Rear Window (1954)'), (9.9498743710661994, 'Vertigo (1958)'), (10.04987562112089, 'North by Northwest (1959)'), (10.440306508910551, 'To Kill a Mockingbird (1962)'), (10.63014581273465, 'Glory (1989)'), (11.661903789690601, 'Godfather: Part II, The (1974)'), (11.704699910719626, 'Titanic (1997)'), (11.789826122551595, 'M*A*S*H (1970)'), (11.832159566199232, 'Casablanca (1942)')]




Movies similar to Batman Forever [25] [(4.1231056256176606, 'Batman & Robin (1997)'), (4.6904157598234297, 'Down Periscope (1996)'), (4.6904157598234297, 'Junior (1994)'), (4.7958315233127191, 'Jack (1996)'), (5.0, 'Murder at 1600 (1997)'), (5.0990195135927845, 'Chain Reaction (1996)'), (5.196152422706632, "City Slickers II: The Legend of Curly's Gold (1994)"), (5.3851648071345037, 'Drop Zone (1994)'), (5.5677643628300215, 'Angels in the Outfield (1994)'), (5.6568542494923806, 'Casper (1995)')]

Movies similar to Batman Forever [50] [(7.3484692283495345, 'Addams Family Values (1993)'), (7.9372539331937721, 'Net, The (1995)'), (8.1853527718724504, 'First Knight (1995)'), (8.3666002653407556, 'Dragonheart (1996)'), (8.9442719099991592, 'Eraser (1996)'), (9.1104335791442992, 'Broken Arrow (1996)'), (9.1651513899116797, 'Outbreak (1995)'), (9.1651513899116797, 'Star Trek: Generations (1994)'), (9.1651513899116797, 'Young Guns (1988)'), (9.2195444572928871, 'Interview with the Vampire (1994)')]

Movies similar to Batman Forever [100] [(15.491933384829668, 'Batman (1989)'), (16.30950643030009, 'Jurassic Park (1993)'), (17.146428199482248, 'Indiana Jones and the Last Crusade (1989)'), (18.055470085267789, 'Fugitive, The (1993)'), (20.46948949045872, 'Return of the Jedi (1983)'), (21.283796653792763, 'Empire Strikes Back, The (1980)'), (22.06807649071391, 'Raiders of the Lost Ark (1981)'), (22.693611435820433, 'Star Wars (1977)'), (inf, "'Til There Was You (1997)"), (inf, '1-900 (1994)')]




Movies similar to Shawshank Redemption [25] [(4.0, 'Some Folks Call It a Sling Blade (1993)'), (5.4772255750516612, 'Roman Holiday (1953)'), (5.5677643628300215, 'To Catch a Thief (1955)'), (5.6568542494923806, 'Looking for Richard (1996)'), (5.9160797830996161, 'As Good As It Gets (1997)'), (5.9160797830996161, 'Thin Blue Line, The (1988)'), (6.2449979983983983, 'Amistad (1997)'), (6.324555320336759, 'Candidate, The (1972)'), (6.4031242374328485, 'M (1931)'), (6.4031242374328485, 'Manon of the Spring (Manon des sources) (1986)')]

Movies similar to Shawshank Redemption [50] [(6.7823299831252681, 'Close Shave, A (1995)'), (7.3484692283495345, 'Good Will Hunting (1997)'), (7.6157731058639087, 'High Noon (1952)'), (8.1240384046359608, 'Treasure of the Sierra Madre, The (1948)'), (8.2462112512353212, 'Mr. Smith Goes to Washington (1939)'), (8.3666002653407556, 'Hamlet (1996)'), (8.426149773176359, 'Manchurian Candidate, The (1962)'), (8.8317608663278477, 'Secrets & Lies (1996)'), (8.8881944173155887, '12 Angry Men (1957)'), (8.8881944173155887, 'Cinema Paradiso (1988)')]

Movies similar to Shawshank Redemption [100] [(9.2736184954957039, 'Glory (1989)'), (9.9498743710661994, 'North by Northwest (1959)'), (10.344080432788601, 'Rear Window (1954)'), (10.440306508910551, 'Cool Hand Luke (1967)'), (10.723805294763608, 'Searching for Bobby Fischer (1993)'), (10.770329614269007, 'To Kill a Mockingbird (1962)'), (10.816653826391969, 'Vertigo (1958)'), (10.908712114635714, 'Bridge on the River Kwai, The (1957)'), (11.090536506409418, 'Right Stuff, The (1983)'), (11.269427669584644, "It's a Wonderful Life (1946)")]



Task 3: Item based Collaborative Filtering

In this task, let us try to perform item based collaborative filtering. In spirit, it is very similar to what you already did for Task 2. With some minor changes, you can easily build an item based collaborative filtering recommenders.

In [3]:
#Do not change below
# By default euclidean distance can be give arbitrary values
# Let us "normalize" it by limit its value between 0 and 1 and slightly change the interpretation
# 0 means that the preferences are very different
# 1 means that preferences are identical
# For tasks 3 and 4, remember to use this function

#Vec1 and vec2 are vectors
def euclidean_distance_normed(vec1, vec2):
    if len(vec1) == 0:
        return 0.0
    euc_distance = euclidean_distances(vec1, vec2)[0][0]
    return 1.0 / (1.0 + euc_distance)
In [7]:
#Task t3a:
# In this task, you want to compute the similarity between two items
#  which in this case means ratings_df
# You can use code from 2c except that you must now call euclidean_distance_normed
#  when computing the distance

def calculate_similarity_normed(movie_name_1, movie_name_2, min_common_users=0):
    movie1 = movie_name_to_id_dictionary[movie_name_1]
    movie2 = movie_name_to_id_dictionary[movie_name_2]

    #This is the set of UNIQUE user ids  who reviewed  movie1
    users_who_rated_movie1 = movielens_df[movielens_df['MovieId'] == movie1]['UserId'].unique()
    
    #This is the set of UNIQUE user ids  who reviewed  movie2
    users_who_rated_movie2 = movielens_df[movielens_df['MovieId'] == movie2]['UserId'].unique()
     
    #Compute the common users who rated both movies: 
    # hint convert both to set and do the intersection
    common_users = set(users_who_rated_movie1).intersection(users_who_rated_movie2)

    #Using the code you wrote in t2a, get the reviews for the movies and common users
    movie1_reviews = get_movie_reviews(movie1, common_users)
    movie2_reviews = get_movie_reviews(movie2, common_users)
    
    #Do not change below
    
    #Now you have the data frame for both movies
    # Use the euclidean_distances function from sklean (imported already)
    # to compute the distance between their rating values
    distance = euclidean_distance_normed(movie1_reviews['Rating'].values, movie2_reviews['Rating'].values)

    if len(common_users) < min_common_users:
        return 0.0
    return distance
In [11]:
#Do not change below
print calculate_similarity_normed("Toy Story (1995)", "GoldenEye (1995)")
print calculate_similarity_normed("GoldenEye (1995)", "Tomorrow Never Dies (1997)")
print calculate_similarity_normed("Batman Forever (1995)", "Batman & Robin (1997)")
0.0684862527649
0.138026263116
0.195194101601
In [12]:
#Do not change below

#We are now going to create item-item similarity database
# Since our data is "small", we will use a non-traditional approach of using nested hashes
# In real-life, using something like databases or other data structures is far more preferable

#Here is the high level structure
#{
#    movie_name_1:  
#    { 
#        movie_name_2: similarity_between_movie_1_and_2, 
#        movie_name_3: similarity_between_movie_1_and_3, 
#        ....
#        movie_name_n: similarity_between_movie_1_and_n
#    },
#    movie_name_2:
#    {
#        movie_name_1: similarity_between_movie_2_and_1, 
#        movie_name_3: similarity_between_movie_2_and_3, 
#        ....
#        movie_name_n: similarity_between_movie_2_and_n
#    },
#    ....
#    movie_name_n:
#    {
#        movie_name_1: similarity_between_movie_n_and_1, 
#        movie_name_2: similarity_between_movie_n_and_2, 
#        ....
#        movie_name_n-1: similarity_between_movie_n_and_n-1
#    },
#}    
    

#Here is how to use this data structuere:

#To get similarity between movies
#   data[movie1][movie2]
#To get similarity between one movie and all others
#   data[movie1]
In [17]:
#DO not change below
#This hash stores the movie to movie 
# as described above
movie_similarity_hash = defaultdict(dict)
In [18]:
#Item based filtering is expensive as you need to compute similarity of all pairs of items
# for this dataset it is 1682*1682 ~ 28 lakh pairs or 2.8 million
# running all of them might take hours and close to a day
# instead let us run on a smaller dataset
# specifically, let us only focus on the top-250 movies based on ratings
# which is more manageable


#Task t3b: 
# Get the top-k movie names with most ratings
#Hint: use Counter class

def top_k_movie_names(k):
    movie_ratings_counter = Counter()
    
    with open("ml-100k/u.data","r") as f:
        for line in f:
            user_id, movie_id, rating, timestamp = line.split("\t")
            movie_name = all_movie_names[ int(movie_id) - 1]
            movie_ratings_counter[movie_name] += 1
    
    return movie_ratings_counter.most_common(k)
In [19]:
#Do not change below
print "Top-10", top_k_movie_names(10), "\n"
print "Top-25", top_k_movie_names(25), "\n"
Top-10 [('Star Wars (1977)', 583), ('Contact (1997)', 509), ('Fargo (1996)', 508), ('Return of the Jedi (1983)', 507), ('Liar Liar (1997)', 485), ('English Patient, The (1996)', 481), ('Scream (1996)', 478), ('Toy Story (1995)', 452), ('Air Force One (1997)', 431), ('Independence Day (ID4) (1996)', 429)] 

Top-25 [('Star Wars (1977)', 583), ('Contact (1997)', 509), ('Fargo (1996)', 508), ('Return of the Jedi (1983)', 507), ('Liar Liar (1997)', 485), ('English Patient, The (1996)', 481), ('Scream (1996)', 478), ('Toy Story (1995)', 452), ('Air Force One (1997)', 431), ('Independence Day (ID4) (1996)', 429), ('Raiders of the Lost Ark (1981)', 420), ('Godfather, The (1972)', 413), ('Pulp Fiction (1994)', 394), ('Twelve Monkeys (1995)', 392), ('Silence of the Lambs, The (1991)', 390), ('Jerry Maguire (1996)', 384), ('Chasing Amy (1997)', 379), ('Rock, The (1996)', 378), ('Empire Strikes Back, The (1980)', 367), ('Star Trek: First Contact (1996)', 365), ('Back to the Future (1985)', 350), ('Titanic (1997)', 350), ('Mission: Impossible (1996)', 344), ('Fugitive, The (1993)', 336), ('Indiana Jones and the Last Crusade (1989)', 331)] 

In [20]:
#Do not change below
top_250_movie_names = [item[0] for item in top_k_movie_names(250)]
In [21]:
#Task t3c:
#Use the following logic
#  for each movie in movie_names:
#    for all other movies in movie_names:
#      compute similarity between  two movies using calculate_similarity_normed
#      remember to pass min_common_users to that function
#  note that movie_similarity_hash is a defaultdict 
#  so similarity between movie1 and movie2 can be set as movie_similarity_hash[movie1][movie2]
#  btw, similarity in our case is commutative. 
#   i.e. similarity(movie1, movie2) = similarity(movie2, movie1)
#   so do not call the function twice !
# movie_names is an array that lists the movies for which you have to compute pairwise similarity
def compute_movie_to_movie_similarity(movie_names, min_common_users=0):
    for movie_name_1 in movie_names:
        for movie_name_2 in movie_names:
            if movie_name_1 == movie_name_2:
                continue
            similarity = calculate_similarity_normed(movie_name_1, movie_name_2, min_common_users)
            movie_similarity_hash[movie_name_1][movie_name_2] = similarity
            movie_similarity_hash[movie_name_2][movie_name_1] = similarity    
            
In [24]:
#Do not change below

#Let us first test if your code above is correct by testing against a small subset of data
movie_similarity_hash = defaultdict(dict)
# let use the top-10 movies
compute_movie_to_movie_similarity(top_250_movie_names[:10], min_common_users=0)

#Get similarity with 
display(movie_similarity_hash["Toy Story (1995)"])
display(movie_similarity_hash['Return of the Jedi (1983)'])

print movie_similarity_hash["Toy Story (1995)"]["Independence Day (ID4) (1996)"]
{'Air Force One (1997)': 0.05697049581409102,
 'Contact (1997)': 0.046577726053728993,
 'English Patient, The (1996)': 0.047393923678019305,
 'Fargo (1996)': 0.039703378843999403,
 'Independence Day (ID4) (1996)': 0.040932023773211611,
 'Liar Liar (1997)': 0.045702558661657698,
 'Return of the Jedi (1983)': 0.046005690800297719,
 'Scream (1996)': 0.048137743418679969,
 'Star Wars (1977)': 0.040475114406879495}
{'Air Force One (1997)': 0.047007646177349489,
 'Contact (1997)': 0.045210588948186241,
 'English Patient, The (1996)': 0.04318065266563334,
 'Fargo (1996)': 0.038852101069895194,
 'Independence Day (ID4) (1996)': 0.038974721601113524,
 'Liar Liar (1997)': 0.037548797615781393,
 'Scream (1996)': 0.041895006229997207,
 'Star Wars (1977)': 0.052554803633703538,
 'Toy Story (1995)': 0.046005690800297719}
0.0409320237732
In [25]:
#Do not change below
#Let us now test against top-250 most popular movies
#This might take 10-20 mins to run!
movie_similarity_hash = defaultdict(dict)
compute_movie_to_movie_similarity(top_250_movie_names, min_common_users=25)
In [47]:
#Do not change below
#Do this if you want to persist the data 

# Let us persist the movie-movie similarity data structure 
# that way you dont need to re-run the whole thing
#pickle is a serialization library in Python
# To persist/serialize, use the following line
#pickle.dump(movie_similarity_hash, open("movie_similarity.pickle", "wb"))
# To deserialize, uncomment the following line 
#movie_similarity_hash = pickle.load( open( "movie_similarity.pickle", "rb" ) )

In [48]:
for movie_name in top_250_movie_names[:10]:
    print "Top-10 most similar movies for ", movie_name, " :", 
    print sorted(movie_similarity_hash[movie_name].items(), key=operator.itemgetter(1), reverse=True)[:10]
    print "\n"
Top-10 most similar movies for  Star Wars (1977)  : [('Fly Away Home (1996)', 0.12613198362288319), ("Ulee's Gold (1997)", 0.084626326089585924), ('Good Will Hunting (1997)', 0.082402672566341831), ('Apt Pupil (1998)', 0.080920442903129275), ('Gattaca (1997)', 0.080069658725771431), ('Wag the Dog (1997)', 0.07979214086871815), ('Sling Blade (1996)', 0.079245895810613809), ('African Queen, The (1951)', 0.07871102875529136), ('Rainmaker, The (1997)', 0.07844773813482285), ('Secrets & Lies (1996)', 0.075254704849630052)]


Top-10 most similar movies for  Contact (1997)  : [('Philadelphia (1993)', 0.10230216299200159), ('Sling Blade (1996)', 0.094882133490771736), ('Fried Green Tomatoes (1991)', 0.093051003668180476), ('Room with a View, A (1986)', 0.092610094432975923), ("Singin' in the Rain (1952)", 0.092175602102042759), ('Shine (1996)', 0.092175602102042759), ('Maltese Falcon, The (1941)', 0.092175602102042759), ('Bound (1996)', 0.091747370480532636), ('Sneakers (1992)', 0.091747370480532636), ('Mask, The (1994)', 0.089695015344041368)]


Top-10 most similar movies for  Fargo (1996)  : [('Manchurian Candidate, The (1962)', 0.099449199236264399), ("Ulee's Gold (1997)", 0.08815170219611887), ('Maltese Falcon, The (1941)', 0.087410245452875457), ('Good Will Hunting (1997)', 0.085983444756559363), ('Sling Blade (1996)', 0.084297269155557394), ('Rainmaker, The (1997)', 0.081799777282574593), ('African Queen, The (1951)', 0.081210303141612275), ('Wag the Dog (1997)', 0.080920442903129275), ('Rear Window (1954)', 0.07844773813482285), ('Vertigo (1958)', 0.07844773813482285)]


Top-10 most similar movies for  Return of the Jedi (1983)  : [('Fly Away Home (1996)', 0.12849622184722817), ("Ulee's Gold (1997)", 0.10676232268609791), ('Gattaca (1997)', 0.093952725662966904), ('Wag the Dog (1997)', 0.091747370480532636), ('Rainmaker, The (1997)', 0.090094108300614636), ('Seven Years in Tibet (1997)', 0.089301349778500683), ('Shine (1996)', 0.08704668331836253), ('Apt Pupil (1998)', 0.086687761389570364), ('Kiss the Girls (1997)', 0.084626326089585924), ('Sling Blade (1996)', 0.083019512538737683)]


Top-10 most similar movies for  Liar Liar (1997)  : [('Mask, The (1994)', 0.096331396777550107), ('While You Were Sleeping (1995)', 0.091325248684348978), ('Ghost and the Darkness, The (1996)', 0.090498756211208911), ('GoldenEye (1995)', 0.090094108300614636), ('Maverick (1994)', 0.08815170219611887), ('Con Air (1997)', 0.083972136564709449), ('River Wild, The (1994)', 0.083333333333333329), ('Cape Fear (1991)', 0.083019512538737683), ('Home Alone (1990)', 0.082709315626306693), ('Pretty Woman (1990)', 0.081503394203052748)]


Top-10 most similar movies for  English Patient, The (1996)  : [('Fried Green Tomatoes (1991)', 0.094413879633246586), ('Maverick (1994)', 0.092175602102042759), ('Room with a View, A (1986)', 0.090498756211208911), ('Cinderella (1950)', 0.090094108300614636), ('Right Stuff, The (1983)', 0.088529810866542852), ('Firm, The (1993)', 0.08815170219611887), ('Sneakers (1992)', 0.087778549957133301), ('Mask, The (1994)', 0.086333380578904162), ('Cape Fear (1991)', 0.084626326089585924), ("Monty Python's Life of Brian (1979)", 0.084626326089585924)]


Top-10 most similar movies for  Scream (1996)  : [('Maverick (1994)', 0.10113069765789216), ("Singin' in the Rain (1952)", 0.094413879633246586), ('Cinderella (1950)', 0.088529810866542852), ('Room with a View, A (1986)', 0.088529810866542852), ('Bound (1996)', 0.084297269155557394), ('While You Were Sleeping (1995)', 0.083972136564709449), ('Philadelphia (1993)', 0.082709315626306693), ('Shine (1996)', 0.082099515221765715), ('Clueless (1995)', 0.081799777282574593), ('Crimson Tide (1995)', 0.080920442903129275)]


Top-10 most similar movies for  Toy Story (1995)  : [("Ulee's Gold (1997)", 0.11696132920126338), ('Kiss the Girls (1997)', 0.10815240673485554), ('Rainmaker, The (1997)', 0.1060878539025194), ('Gattaca (1997)', 0.10290397182775128), ('Wag the Dog (1997)', 0.10230216299200159), ('Shine (1996)', 0.10171118008217982), ('Seven Years in Tibet (1997)', 0.098907726574930466), ('Cool Hand Luke (1967)', 0.0973366881823024), ('Peacemaker, The (1997)', 0.095840694682461411), ('Manchurian Candidate, The (1962)', 0.093051003668180476)]


Top-10 most similar movies for  Air Force One (1997)  : [('Cinderella (1950)', 0.11519216806670013), ('While You Were Sleeping (1995)', 0.11189119247086728), ('Firm, The (1993)', 0.11034777731716484), ('Sneakers (1992)', 0.10960059084055324), ('Maverick (1994)', 0.10230216299200159), ('Crimson Tide (1995)', 0.10230216299200159), ('Bound (1996)', 0.10171118008217982), ('GoldenEye (1995)', 0.10113069765789216), ('Hunt for Red October, The (1990)', 0.10056040392403998), ('Mask, The (1994)', 0.10000000000000001)]


Top-10 most similar movies for  Independence Day (ID4) (1996)  : [('Rainmaker, The (1997)', 0.11034777731716484), ('Kiss the Girls (1997)', 0.10745035092526581), ('Gattaca (1997)', 0.10477782979607683), ("Ulee's Gold (1997)", 0.10171118008217982), ('Peacemaker, The (1997)', 0.098907726574930466), ('Seven Years in Tibet (1997)', 0.093952725662966904), ('Wag the Dog (1997)', 0.091747370480532636), ('Cop Land (1997)', 0.089301349778500683), ('G.I. Jane (1997)', 0.087778549957133301), ('Maverick (1994)', 0.086333380578904162)]


In [74]:
#Task t3d

#Before doing t3d, please complete t4a so that user_rating_hash is available
# this will make the code below easier

#In this task, we are going to predict the rating of a user u for a movie m using item based collaborative filtering
#Here is the high level logic:
# for each item i rated by this user:
#    s = similarity between i and input movie m 
#    if similarity between i and m is 0, ignore this item 
#    compute weighted rating for m based on i as rating for i * s
# compute the predicted rating as sum of all weighted ratings / sum of all similarities

def predict_rating_for_movie_icf(movie_similarity_hash, input_user_id, input_movie_name, movies_considered):
    total_weighted_rating = 0.0
    total_similarity= 0.0
        
    #Hint: movie_similarity_hash is a nested hash where user id is key and 
    #  all their rating as a dictionary  
    # this dictionary is ordered as moviename: rating

    #if this user has already rated the movie, return that rating
    if input_movie_name in user_rating_hash[input_user_id]:
        return user_rating_hash[input_user_id][input_movie_name]
    
    #For each movie the user has rated
    for movie_name in user_rating_hash[input_user_id].keys():

        #if user rated some movie, but it is not in the subset of movies that we computed pairwise similarity
        # such as top-250, then do not consider it either
        # for this task, the input is in movies_considered 
        if movie_name not in movies_considered:
            continue
        
        #compute similarity between movies
        #dont recompute = use the hash
        similarity = movie_similarity_hash[input_movie_name][movie_name]
        
        #Reject item if similarity is 0
        if similarity <= 0.0:
            continue
                     
        #Compute weighted rating
        weighted_rating = similarity * user_rating_hash[input_user_id][movie_name]
        
        #update total_weighted_rating and total_similarity
        total_weighted_rating += weighted_rating
        total_similarity += similarity
            
    #Do not change below
    if total_similarity == 0.0:
        return 0.0
    
    return total_weighted_rating / total_similarity
In [75]:
#Let us compute the rating for first 5 users for the top-20 movies that they have not seen
for user_id in range(1, 5+1):
    print user_id, [ (movie_name, 
                        round(predict_rating_for_movie_icf(movie_similarity_hash, user_id, movie_name, top_250_movie_names),2))
                       for movie_name in top_250_movie_names[:20] 
                        if movie_name not in user_rating_hash[user_id]]
           
#print movie_name, predict_rating_for_movie_icf(movie_similarity_hash, 1, 'Liar Liar (1997)', min_common_users=25)
1 [('Liar Liar (1997)', 3.83), ('English Patient, The (1996)', 3.87), ('Scream (1996)', 3.87), ('Air Force One (1997)', 3.87)]
2 [('Return of the Jedi (1983)', 3.91), ('Independence Day (ID4) (1996)', 3.85), ('Raiders of the Lost Ark (1981)', 3.92), ('Pulp Fiction (1994)', 3.93), ('Twelve Monkeys (1995)', 3.87), ('Silence of the Lambs, The (1991)', 3.91), ('Chasing Amy (1997)', 3.89), ('Rock, The (1996)', 3.86), ('Empire Strikes Back, The (1980)', 3.91), ('Star Trek: First Contact (1996)', 3.86)]
3 [('Star Wars (1977)', 2.95), ('Fargo (1996)', 2.92), ('English Patient, The (1996)', 2.92), ('Toy Story (1995)', 2.92), ('Independence Day (ID4) (1996)', 2.89), ('Raiders of the Lost Ark (1981)', 2.92), ('Godfather, The (1972)', 2.93), ('Pulp Fiction (1994)', 2.95), ('Twelve Monkeys (1995)', 2.94), ('Silence of the Lambs, The (1991)', 2.92), ('Jerry Maguire (1996)', 2.9), ('Rock, The (1996)', 2.89), ('Empire Strikes Back, The (1980)', 2.95), ('Star Trek: First Contact (1996)', 2.9)]
4 [('Fargo (1996)', 4.21), ('Return of the Jedi (1983)', 4.26), ('English Patient, The (1996)', 4.17), ('Toy Story (1995)', 4.23), ('Independence Day (ID4) (1996)', 4.18), ('Raiders of the Lost Ark (1981)', 4.27), ('Godfather, The (1972)', 4.24), ('Pulp Fiction (1994)', 4.25), ('Twelve Monkeys (1995)', 4.24), ('Silence of the Lambs, The (1991)', 4.27), ('Jerry Maguire (1996)', 4.23), ('Chasing Amy (1997)', 4.19), ('Rock, The (1996)', 4.23), ('Empire Strikes Back, The (1980)', 4.29), ('Star Trek: First Contact (1996)', 4.22)]
5 [('Contact (1997)', 3.35), ('Liar Liar (1997)', 3.25), ('English Patient, The (1996)', 3.31), ('Scream (1996)', 3.3), ('Air Force One (1997)', 3.29), ('Godfather, The (1972)', 3.36), ('Pulp Fiction (1994)', 3.34), ('Twelve Monkeys (1995)', 3.35), ('Jerry Maguire (1996)', 3.32), ('Chasing Amy (1997)', 3.36), ('Rock, The (1996)', 3.32)]
In [77]:
#Task t3e: 
#Here is the pseudocode for recommending movies
# for each movie this user has not rated in movies_considered:
#           predict rating for this movie and this user using t4d
#  return the top-k movies
def recommend_movies_icf(input_user_id, movies_considered, movie_similarity_hash,
                             user_rating_hash, k=10, min_common_movies=5):
    predicted_ratings = []
    
    #Your code here
    for movie_name in movies_considered:
        if movie_name in user_rating_hash[input_user_id]:
            continue
        
        #Predict the rating for input_user_id and movie_name using t3d
        predicted_rating = predict_rating_for_movie_icf(movie_similarity_hash, 
                                        input_user_id, movie_name, movies_considered)
        #append rating and movie name to predicted_ratings
        predicted_ratings.append( (predicted_rating, movie_name) )
    
    return sorted(predicted_ratings, reverse=True)[:k]
In [81]:
#Do not change below:

#Let us predict top-5 movies for first 10 users
for user_id in range(1,11):
    print user_id, recommend_movies_icf(user_id, top_250_movie_names, movie_similarity_hash, 
                               user_rating_hash, k=10, min_common_movies=5)
 1 [(4.1925848292261962, 'Fly Away Home (1996)'), (4.0983740953356511, "Ulee's Gold (1997)"), (4.0853921793090748, 'Seven Years in Tibet (1997)'), (3.9897313336931295, 'Maltese Falcon, The (1941)'), (3.9879158866803466, 'Manchurian Candidate, The (1962)'), (3.9876557080885249, 'Cop Land (1997)'), (3.9782574914172879, 'Wag the Dog (1997)'), (3.978195449823783, "Singin' in the Rain (1952)"), (3.9776616970615382, 'Secrets & Lies (1996)'), (3.9561068860479827, 'Mother (1996)')]
2 [(4.0808708045508908, 'Cool Hand Luke (1967)'), (4.0758760619946868, 'Maltese Falcon, The (1941)'), (4.0533051054926501, 'Patton (1970)'), (4.0083596465934832, 'Reservoir Dogs (1992)'), (3.9881299279283864, 'Wag the Dog (1997)'), (3.9839921546749677, 'Bound (1996)'), (3.9765105208616371, 'Father of the Bride Part II (1995)'), (3.9734726233103057, 'Abyss, The (1989)'), (3.9617904510489481, 'Rumble in the Bronx (1995)'), (3.9401087977258875, 'Manchurian Candidate, The (1962)')]
3 [(4.0, 'Fly Away Home (1996)'), (3.2010326559372495, 'Chinatown (1974)'), (3.157701069290638, 'Beavis and Butt-head Do America (1996)'), (3.143952685392172, 'Ed Wood (1994)'), (3.1433803387730994, 'Clerks (1994)'), (3.1237868346046724, 'Austin Powers: International Man of Mystery (1997)'), (3.0798009033435956, 'Annie Hall (1977)'), (3.0759377820139293, 'Room with a View, A (1986)'), (3.0697683194380638, 'Postino, Il (1994)'), (3.0661680108458311, 'Mighty Aphrodite (1995)')]
4 [(5.0, 'Fly Away Home (1996)'), (4.4632038663842257, 'Searching for Bobby Fischer (1993)'), (4.4196965100181682, 'Mighty Aphrodite (1995)'), (4.4191100601766378, 'To Kill a Mockingbird (1962)'), (4.4169766165480482, 'North by Northwest (1959)'), (4.4094677480337543, 'Emma (1996)'), (4.4042857143634695, 'Vertigo (1958)'), (4.4034520554502796, 'Postino, Il (1994)'), (4.3967716982383376, 'Secrets & Lies (1996)'), (4.3856413995509103, 'Big Night (1996)')]
5 [(3.7183255738053607, 'Fly Away Home (1996)'), (3.6633254211881798, "Ulee's Gold (1997)"), (3.5372799055697985, 'Cop Land (1997)'), (3.5149683822988829, 'Seven Years in Tibet (1997)'), (3.469581917447591, 'Bound (1996)'), (3.4617867390614938, 'Wag the Dog (1997)'), (3.4527021275095611, 'Rainmaker, The (1997)'), (3.4496495884364671, 'Sling Blade (1996)'), (3.4400533520469319, "Singin' in the Rain (1952)"), (3.4399531973912678, 'Secrets & Lies (1996)')]
6 [(3.6450757118770802, 'Titanic (1997)'), (3.6419674474846175, 'Swingers (1996)'), (3.6346263653247584, 'Apt Pupil (1998)'), (3.6261382018050177, 'Professional, The (1994)'), (3.6238256630349728, 'Nightmare Before Christmas, The (1993)'), (3.6176103367235144, 'Philadelphia (1993)'), (3.6163393236633965, 'Manchurian Candidate, The (1962)'), (3.6158801676523433, 'Everyone Says I Love You (1996)'), (3.6148660366491452, 'Con Air (1997)'), (3.6125458805115862, 'Mystery Science Theater 3000: The Movie (1996)')]
7 [(4.5769327294745636, 'Fly Away Home (1996)'), (4.3569138485911942, 'Boogie Nights (1997)'), (4.3430811631356416, "Ulee's Gold (1997)"), (4.3387217336002255, 'Everyone Says I Love You (1996)'), (4.3370856474197863, 'L.A. Confidential (1997)'), (4.334079880695576, 'Good Will Hunting (1997)'), (4.3302758538211039, 'Lone Star (1996)'), (4.3231589530937846, 'Titanic (1997)'), (4.3159277940669369, 'Bound (1996)'), (4.3140087951876245, 'Big Night (1996)')]
8 [(4.5099993281415864, 'Fly Away Home (1996)'), (4.2346862647980528, 'Big Night (1996)'), (4.232651237329506, 'Secrets & Lies (1996)'), (4.2214676248329042, 'Maltese Falcon, The (1941)'), (4.2194500071542702, 'Manchurian Candidate, The (1962)'), (4.2112900704044289, 'Philadelphia (1993)'), (4.2088630225331416, 'Sling Blade (1996)'), (4.2047351130720045, "Singin' in the Rain (1952)"), (4.1991235849187269, "Ulee's Gold (1997)"), (4.1940617742206738, 'Remains of the Day, The (1993)')]
9 [(5.0, 'Fly Away Home (1996)'), (4.3881344831760405, 'Sneakers (1992)'), (4.384739673791211, 'Cinderella (1950)'), (4.3728633261524967, 'Crow, The (1994)'), (4.3601856063557216, 'Good, The Bad and The Ugly, The (1966)'), (4.3583657116832946, 'GoldenEye (1995)'), (4.3537695463082429, 'Die Hard: With a Vengeance (1995)'), (4.3487405697301451, 'Nightmare Before Christmas, The (1993)'), (4.3485936712624973, "Singin' in the Rain (1952)"), (4.3455946019288847, 'Star Trek VI: The Undiscovered Country (1991)')]
10 [(4.3656740682168014, "Ulee's Gold (1997)"), (4.3633159999804878, 'Good Will Hunting (1997)'), (4.360223994431875, 'Titanic (1997)'), (4.352806826677643, 'Tomorrow Never Dies (1997)'), (4.3500732774417745, 'Apt Pupil (1998)'), (4.3494915136704577, 'Professional, The (1994)'), (4.3491740447223943, 'Chasing Amy (1997)'), (4.34581077138371, 'Good, The Bad and The Ugly, The (1966)'), (4.3448056765651897, "Schindler's List (1993)"), (4.3436282240868342, 'Postino, Il (1994)')]

Task 4: User based Collaborative Filtering

In this task, let us try to perform user based collaborative filtering.

In [32]:
#In order to simplify the coding, let us create a nested hash structure to store the user-rating data
# It will look as follows:
#{
#    u1: {movie_name_1:rating1, movie_name_2:rating2, ....}, 
#    ....
#    un: {movie_name_1:rating1, movie_name_2:rating2, ....}, 
#}

#Of course, we will only store the movies that the user rated
#Use the all_movie_names to convert movie id to movie name
# remember that Movielens uses 1 based indexing
# so the name of movieid i is in all_movie_names[i-1]
In [33]:
#Task t4a
#Create the data structure as discussed above
# here is the logic:
# for each line in file ml-100k/u.data:
#   set user_rating_hash[user][movie] = rating
# read the instructions above again!

def compute_user_rating_hash():
    user_rating_hash = defaultdict(dict)
    
    #Your code below
    with open("ml-100k/u.data","r") as f:
        for line in f:
            user_id, movie_id, rating, timestamp = line.split("\t")
            user_rating_hash[int(user_id)][all_movie_names[int(movie_id)-1]] = int(rating)    
    
    return user_rating_hash
In [34]:
#Do not change below
user_rating_hash = compute_user_rating_hash()
In [35]:
#Do not change below
#How many users are there?
print len(user_rating_hash.keys())
#How many movies did each of the first 20 users rated?
print [len(user_rating_hash[i].keys()) for i in range(1,20+1)] 
#print the ratings of user 4
display(user_rating_hash[4])
943
[271, 61, 53, 24, 175, 208, 401, 59, 22, 184, 181, 51, 632, 98, 103, 140, 28, 277, 20, 48]
{'Air Force One (1997)': 5,
 'Assignment, The (1997)': 5,
 'Blues Brothers 2000 (1998)': 5,
 'Client, The (1994)': 3,
 'Conspiracy Theory (1997)': 3,
 'Contact (1997)': 5,
 'Cop Land (1997)': 5,
 'Desperate Measures (1998)': 5,
 'Event Horizon (1997)': 4,
 'In & Out (1997)': 5,
 'Incognito (1997)': 5,
 'Indiana Jones and the Last Crusade (1989)': 3,
 'Liar Liar (1997)': 5,
 'Lost Highway (1997)': 5,
 'Mimic (1997)': 3,
 "One Flew Over the Cuckoo's Nest (1975)": 4,
 'Scream (1996)': 4,
 'Seven (Se7en) (1995)': 4,
 'Spawn (1997)': 2,
 'Star Wars (1977)': 5,
 'Starship Troopers (1997)': 4,
 "Ulee's Gold (1997)": 5,
 'Wedding Singer, The (1998)': 5,
 'Wonderland (1997)': 5}
In [36]:
#Task t4b:
#We need to modify our logic for computing distance
#Here is the high level pseudocode:
# movie1 = movie names rated by user 1
# movie2 = movie names rated by user 2
# common movies = set intersection of movie1 and movie2
# if number of common movies is less than min_common_movies, return 0.0 [not 0]
# other wise create a vector with rating for common movies only
# compute euclidean distance between the vectors
# return 1 / (1+euclidean distace)

def compute_user_user_similarity(user_rating_hash, user_id_1, user_id_2, min_common_movies=0):
    #Get list of movie names rated by user 1. hint use keys function [see above for usage]
    movies_rated_user1 = user_rating_hash[user_id_1].keys()
    movies_rated_user2 = user_rating_hash[user_id_2].keys()
    
    #compute common movies
    common_movies = set(movies_rated_user1).intersection( set(movies_rated_user2) )
    if len(common_movies) < min_common_movies:
        return 0.0
    
    common_movies = sorted(list(common_movies))
    
    #vector1 is the set of ratings for user1 for movies in common_movies
    vector1 = [user_rating_hash[user_id_1][movie] for movie in common_movies]
    #vector2 is the set of ratings for user2 for movies in common_movies
    vector2 = [user_rating_hash[user_id_2][movie] for movie in common_movies]
    
    #Compute distance and return 1.0/(1.0+distance)
    distance = euclidean_distances(vector1, vector2)[0][0]
    return 1.0 / ( 1.0 + distance)
In [37]:
#Testing code
print [round(compute_user_user_similarity(user_rating_hash, 1, i),2) for i in range(1, 10+1)]
print [round(compute_user_user_similarity(user_rating_hash, 784, i),2) for i in range(1, 10+1)]
[1.0, 0.16, 0.16, 0.19, 0.08, 0.07, 0.06, 0.16, 0.31, 0.1]
[0.19, 0.19, 0.11, 0.33, 1.0, 0.19, 0.18, 0.5, 0.21, 0.29]
In [38]:
#Task t4c
#This function finds the k-most similar users 
#Here is the high level logic:
#  for each user in all_user_ids other than the input user id:
#     find similarity between this user and input_user_id and store as (similarity, other userid)
#     sort based on similarity
#  return top-k
# remember to pass min_common_movies
def top_k_most_similar_users(user_rating_hash, input_user_id, all_user_ids, k=10, min_common_movies=0):
    user_similarity = []
    
    for user_id in all_user_ids:
        if user_id == input_user_id:
            continue
        similarity = compute_user_user_similarity(user_rating_hash, input_user_id, user_id, min_common_movies)
        user_similarity.append( ( similarity, user_id) )
    return sorted(user_similarity, reverse=True)[:k]
In [39]:
#Do not change below
all_user_ids = range(1, 943+1)
print top_k_most_similar_users(user_rating_hash, 1, all_user_ids, 10, 5)
print top_k_most_similar_users(user_rating_hash, 1, all_user_ids, 10, 10)
print top_k_most_similar_users(user_rating_hash, 812, all_user_ids, 10, 5)
print top_k_most_similar_users(user_rating_hash, 812, all_user_ids, 10, 20)
[(0.41421356237309509, 876), (0.36602540378443865, 105), (0.33333333333333331, 895), (0.33333333333333331, 282), (0.33333333333333331, 107), (0.3090169943749474, 842), (0.3090169943749474, 696), (0.3090169943749474, 520), (0.3090169943749474, 516), (0.3090169943749474, 433)]
[(0.3090169943749474, 516), (0.3090169943749474, 433), (0.3090169943749474, 359), (0.25, 800), (0.25, 691), (0.2402530733520421, 564), (0.2402530733520421, 549), (0.2402530733520421, 46), (0.23166247903553999, 941), (0.22400923773979589, 252)]
[(0.5, 816), (0.5, 768), (0.41421356237309509, 555), (0.41421356237309509, 4), (0.36602540378443865, 534), (0.36602540378443865, 314), (0.36602540378443865, 127), (0.36602540378443865, 38), (0.33333333333333331, 826), (0.33333333333333331, 727)]
[(0.21712927295533244, 451), (0.16396078054371141, 782), (0.13652705949581431, 721), (0.095840694682461411, 181), (0.0, 943), (0.0, 942), (0.0, 941), (0.0, 940), (0.0, 939), (0.0, 938)]
In [40]:
#Task t4d
#In this task, we are going to predict the rating of a user for a movie using user based collaborative filtering
#Here is the high level logic:
# for each user u in all_user_ids:
#    s= similarity between u and input_user_id [remember to pass min_common_movies]
#    if similairty is 0.0 ignore u
#    if u has not rated this movie, ignore again
#    suppose u has rated this movie with a value of r
#    i am now going to give a "weighted rating" as r*s
# compute the predicted rating as sum of all weighted ratings / sum of all similarities

def predict_rating_for_movie_ucf(user_rating_hash, input_user_id, movie_name, all_user_ids, min_common_movies=5):
    total_weighted_rating = 0.0
    total_similarity= 0.0
    
    for user_id in all_user_ids:
        if user_id == input_user_id:
            continue
        
        #compute similarity between users
        similarity = compute_user_user_similarity(user_rating_hash, user_id, input_user_id, min_common_movies)
        
        #Reject user if similarity is 0
        if similarity <= 0.0:
            continue
            
        #reject user if (s)he has not rated the movie
        if movie_name not in user_rating_hash[user_id]:
            continue
            
        #Compute weighted rating
        weighted_rating = similarity * user_rating_hash[user_id][movie_name]
        
        #update total_weighted_rating and total_similarity
        total_weighted_rating += weighted_rating
        total_similarity += similarity
            
    #Do not change below
    if total_similarity == 0.0:
        return 0.0
    
    return total_weighted_rating / total_similarity
In [41]:
#Do not change below
all_user_ids = range(1, 943+1)
for user_id in range(1, 5+1):
    print "user_id = ", user_id
    print [ round(predict_rating_for_movie_ucf(user_rating_hash, user_id, all_movie_names[i], all_user_ids, min_common_movies=5),1)
          for i in range(1, 10+1)]
    print [ round(predict_rating_for_movie_ucf(user_rating_hash, user_id, all_movie_names[i], all_user_ids, min_common_movies=10),1)
          for i in range(1, 10+1)]
    print "\n"
user_id =  1
[3.2, 3.1, 3.6, 3.3, 3.8, 3.8, 4.0, 3.9, 3.9, 3.9]
[3.2, 3.1, 3.6, 3.3, 3.5, 3.8, 4.0, 3.9, 3.9, 3.8]


user_id =  2
[3.2, 3.1, 3.6, 3.3, 3.7, 3.8, 4.0, 3.9, 3.8, 3.9]
[3.2, 3.1, 3.6, 3.3, 3.5, 3.8, 4.1, 3.9, 3.8, 3.8]


user_id =  3
[3.2, 3.0, 3.6, 3.3, 3.4, 3.8, 3.9, 3.9, 3.8, 3.9]
[3.0, 2.8, 3.5, 3.1, 3.2, 3.8, 4.0, 3.9, 3.7, 3.8]


user_id =  4
[3.3, 3.0, 3.6, 3.4, 3.4, 3.8, 3.9, 3.9, 3.7, 3.9]
[3.2, 2.9, 3.4, 3.0, 2.8, 3.8, 4.0, 3.8, 3.5, 3.9]


user_id =  5
[3.2, 3.0, 3.5, 3.3, 3.6, 3.8, 4.0, 3.8, 3.8, 3.8]
[3.2, 3.0, 3.5, 3.3, 3.5, 3.8, 4.0, 3.8, 3.8, 3.8]


In [42]:
#Task t4e: 
#Here is the pseudocode for recommending movies
# for each movie this user has not rated:
#     for all other users:
#           predict rating for this movie and this user using t4d
#  return the top-k movies
def recommend_movies_ucf(user_rating_hash, all_user_ids, input_user_id, k=10, min_common_movies=5):
    predicted_ratings = []
    
    #Your code here
    for movie_name in all_movie_names:
        if movie_name in user_rating_hash[input_user_id]:
            continue
        
        #Predict the rating for input_user_id and movie_name using t4d
        predicted_rating = predict_rating_for_movie_ucf(user_rating_hash, input_user_id, movie_name, 
                                                        all_user_ids, min_common_movies)
        #append rating and movie name to predicted_ratings
        predicted_ratings.append( (predicted_rating, movie_name) )
    
    return sorted(predicted_ratings, reverse=True)[:k]
In [43]:
#Do not change below
all_user_ids = range(1, 943+1)

for user_id in range(1, 5):
    print recommend_movies_ucf(user_rating_hash, all_user_ids, user_id, k=10, min_common_movies=5)
[(5.0000000000000009, 'Saint of Fort Washington, The (1993)'), (5.0, 'They Made Me a Criminal (1939)'), (5.0, "Someone Else's America (1995)"), (5.0, 'Santa with Muscles (1996)'), (5.0, 'Prefontaine (1997)'), (5.0, 'Marlene Dietrich: Shadow and Light (1996) '), (5.0, 'Little City (1998)'), (5.0, 'Great Day in Harlem, A (1994)'), (5.0, 'Entertaining Angels: The Dorothy Day Story (1996)'), (5.0, 'Aiqing wansui (1994)')]
[(5.0000000000000009, 'Prefontaine (1997)'), (5.0, 'They Made Me a Criminal (1939)'), (5.0, 'Star Kid (1997)'), (5.0, "Someone Else's America (1995)"), (5.0, 'Santa with Muscles (1996)'), (5.0, 'Saint of Fort Washington, The (1993)'), (5.0, 'Marlene Dietrich: Shadow and Light (1996) '), (5.0, 'Great Day in Harlem, A (1994)'), (5.0, 'Entertaining Angels: The Dorothy Day Story (1996)'), (5.0, 'Aiqing wansui (1994)')]
[(5.0, 'Tough and Deadly (1995)'), (5.0, 'Star Kid (1997)'), (5.0, 'Santa with Muscles (1996)'), (5.0, 'Saint of Fort Washington, The (1993)'), (5.0, 'Marlene Dietrich: Shadow and Light (1996) '), (5.0, 'Great Day in Harlem, A (1994)'), (5.0, 'Entertaining Angels: The Dorothy Day Story (1996)'), (5.0, 'Aiqing wansui (1994)'), (4.9999999999999991, 'Prefontaine (1997)'), (4.6888446196178126, 'Anna (1996)')]
[(5.0000000000000009, 'Prefontaine (1997)'), (5.0, 'Star Kid (1997)'), (5.0, "Someone Else's America (1995)"), (5.0, 'Santa with Muscles (1996)'), (5.0, 'Saint of Fort Washington, The (1993)'), (5.0, 'Marlene Dietrich: Shadow and Light (1996) '), (5.0, 'Little City (1998)'), (5.0, 'Great Day in Harlem, A (1994)'), (4.6385142554001959, 'Leading Man, The (1996)'), (4.6325248550982909, 'Crossfire (1947)')]

Task 5: Latent Factor Models

In this task, let us try to find the simplest SVD based latent factor model.

In [70]:
number_of_users = 943
number_of_movies = 1682

ratings_matrix = sp.sparse.lil_matrix((number_of_users, number_of_movies))
In [75]:
#Task t5a: This task requires a different data structure and hence different from prior tasks.
# Here is the high level idea:
#  - Create a sparse matrix of type lil_matrix 
#  - populate it with the data from ratings file
#  - if user id i gave movie id j with a rating r, set matrix[i-1,j-1] to r 
# ie rows=users, col=movies
# Hint: If you are reading it from file, note that Python treats the data from file as a string
# so you might want to convert them to integer before inserting them.
with open("ml-100k/u.data","r") as f:
    for line in f:
        user_id, movie_id, rating, timestamp = line.split("\t")
        ratings_matrix[ int(user_id)-1, int(movie_id)-1] = int(rating)
In [79]:
print "Matrix shape is", ratings_matrix.shape
print "Number of non zero values", ratings_matrix.nnz
print "Number of non zero values", ratings_matrix.nonzero()
Matrix shape is (943, 1682)
Number of non zero values 158695
Number of non zero values (array([  0,   0,   0, ..., 942, 942, 942], dtype=int32), array([   0,    1,    2, ..., 1187, 1227, 1329], dtype=int32))
In [80]:
#Task t5b:
# Perform SVD on the ratings matrix
# Hint use the svds function imported above and not svd 
# K is the number of factors to have
def perform_svd(ratings_matrix, K=100):
    U, S, V = None, None, None
    return U, S, V
In [81]:
#Task t5c:
# Note that S is is an array and not a matrix
# Create a diagonal matrix where the diagonal matrix is populated from S
# For eg, if S = 2,4, 8
# then output must be
#  2 0 0
#  0 4 0
#  0 0 8

def construct_diagonal_matrix(S):
    return None
In [83]:
#Task t5d: 
# We are now going to reconstruct the matrix from the SVD that we have got
# new matrix = U S V 
# But its shape will be different as we only used the top-k factors
def reconstruct_low_rank_matrix(U, S, V):
    return None
In [84]:
#Task t5e:
# Using the reconstructed matrix, predict the rating of user to a movie
def predict_rating_svd(user_id, movie_id):
    return 0.0