When submitting, fill your team details in this cell. Note that this is a markdown cell.
Student 1 Full Name:
Student 1 Student ID:
Student 1 Email Address:
Student 2 Full Name:
Student 2 Student ID:
Student 2 Email Address:
Student 3 Full Name:
Student 3 Student ID:
Student 3 Email Address:
In this assignment, we will explore some of the fundamental concepts in building a recommendation systems for movies. This assignment material is inspired by code from a number of resources including those from Hanspeter Pfister, YHat, Sudeep Das, Jonny and others. At a high level, you will be doing the following tasks:
Before building the actual recommender, we will explore the movie dataset and see how ratings are distributed across different dimensions and how the sparse the overall dataset is. These factors (long tail/sparsity) cause lot of issues when building a recommender.
We will start with a simple nearest neighbor based recommender system. Given a movie, we will try to find the top-$k$ most similar movies. We will just provide the list of movies without predicting the ratings.
In this task, we will try to predict the ratings also. We will use item based collaborative filtering that computes similarity between movies and use them for recommendation.
In this task, we will design a user based collaborative filtering that is very similar to the recommender you designed in Task 3.
In this task, we will explore couple of advanced models involving latent factors.
########### Do not change anything below
%matplotlib inline
#Array processing
import numpy as np
#Data analysis, wrangling and common exploratory operations
import pandas as pd
from pandas import Series, DataFrame
from IPython.display import display
#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
#SVD for Sparse matrices
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import euclidean_distances
try:
import cPickle as pickle
except:
import pickle
from collections import defaultdict, Counter
import operator
from matplotlib import rcParams
rcParams['figure.figsize'] = (10, 6)
########################If needed you can import additional packages for helping you, although I would discourage it
########################Put all your imports in the space below. If you use some strange package,
##########################provide clear information as to how we can install it.
#######################End imports###################################
Dataset: We will be using the MovieLens 100K dataset. It is a fairly small dataset with 100K ratings from 1000 users on 1700 movies. You can download the dataset from http://grouplens.org/datasets/movielens/ . Some basic details about the dataset can be found in the README text file at http://files.grouplens.org/datasets/movielens/ml-100k-README.txt . You will need to understand the structure before doing the experiments. Download and unzip the file in the same folder as the IPython notebook. The rest of the code assumes that there is a folder called "ml-100k" in the current directory.
#####Do not change anything below
#Load the user data
users_df = pd.read_csv('ml-100k/u.user', sep='|', names=['UserId', 'Age', 'Gender', 'Occupation', 'ZipCode'])
#Load the movies data: we will only use movie id and title for this assignment
movies_df = pd.read_csv('ml-100k/u.item', sep='|', names=['MovieId', 'Title'], usecols=range(2))
#Load the ratings data: ignore the timestamps
ratings_df = pd.read_csv('ml-100k/u.data', sep='\t', names=['UserId', 'MovieId', 'Rating'],usecols=range(3))
#Working on three different data frames is a pain
# Let us create a single dataset by "joining" these three data frames
movie_ratings_df = pd.merge(movies_df, ratings_df)
movielens_df = pd.merge(movie_ratings_df, users_df)
movielens_df.head()
#Task t1a: Print the NAME of the top-10 movies with most ratings
print movielens_df.groupby('Title').size().order(ascending=False)[:10]
#Task t1b: Let us now analyze the rating behavior of the 1000 users in the dataset
# Create a histogram based on the number of ratings per user with 50 bins.
# Title="Count of Ratings per User", XLabel="Ratings per user", YLabel="#Users"
#t1b_user_rating_count is a groupby object that counts the number of ratings for each user.
t1b_user_rating_count = movielens_df.groupby('UserId')['Rating'].count()
t1b_user_rating_count.hist(bins=50)
plt.xlabel("Ratings per user")
plt.ylabel("#Users")
plt.title("Count of Ratings per User")
####The figure below shows that most users leave less than 25 ratings while few outliers leave a lot of ratings
#Task t1c: Let us now analyze the rating behavior of the 1000 users in the dataset
# Create a histogram based on the number of ratings per movie with 50 bins.
# Title="Count of Ratings per Movie", XLabel="Ratings per Movie", YLabel="#Movies"
#t1c_user_rating_count is a groupby object that counts the number of ratings for each movie.
t1c_user_rating_count = movielens_df.groupby('MovieId')['Rating'].count()
t1c_user_rating_count.hist(bins=50)
plt.xlabel("Ratings per Movie")
plt.ylabel("#Movies")
plt.title("Count of Ratings per Movie")
####The figure below shows that most movies receive less than 25 ratings while few popular get a lot of ratings
#Task t1d: Let us now analyze the rating distribution
# Create a histogram based on the ratings received for each movie with 5 bins.
# Title="Ratings Histogram", XLabel="Rating Provided", YLabel="#Ratings"
movielens_df['Rating'].hist(bins=5)
plt.xlabel("Rating Provided")
plt.ylabel("#Ratings")
plt.title("Ratings Histogram")
####The figure below shows that most movies at least 35K of ratings provided a score of 4 or higher.
#The code below shows that the average rating is 3.5
# This is problematic because each user has a different rating scale
# For some users 1 is a bad movie while for some others 3 is bad
# So a good model must take into the account the baseline rating behavior of users and movies
print "Average rating of ALL movies is", round(movie_ratings_df['Rating'].mean(), 2)
#Task t1e: Let us now study the baseline rating behavior in more detail
# For each user compute his/her average rating
# Create a histogram based on the average ratings with 5 bins.
# Title="Histogram of Average Ratings of Users", XLabel="Average Rating", YLabel="#Users"
#t1e_avg_ratings is a groupby object with average rating for each user
t1e_avg_ratings = movielens_df.groupby('UserId')['Rating'].mean()
t1e_avg_ratings.hist(bins=5)
plt.title("Histogram of Average Ratings of Users")
plt.xlabel("Average Rating")
plt.ylabel("#Users")
####The figure below shows that while the average rating of users vary
# What does this imply?
#Task t1f: Let us now study the baseline rating behavior in more detail
# For each movie compute its average rating
# Create a histogram based on the average ratings with 5 bins.
# Title="Histogram of Average Ratings of Movies", XLabel="Average Rating", YLabel="#Movies"
#t1e_avg_ratings is a groupby object with average rating for each user
t1f_avg_ratings = movielens_df.groupby('MovieId')['Rating'].mean()
t1f_avg_ratings.hist(bins=5)
plt.title("Histogram of Average Ratings of Movies")
plt.xlabel("Average Rating")
plt.ylabel("#Movies")
####The figure below shows that while the average rating of movies vary
# What does this imply?
Common Support: The concept of common support is a key idea is recommender systems. Given two items (movies in our case), the common support is the number of reviewers who rated both items. It is useful in both K-nearest neighbor and collaborative filtering based recommenders. Specifically, if the common support is 0 for a pair of movies, then it is quite hard to find their similarity!
#Task t1g: Let us now analyze the common support for movielens dataset
# Here is the high level idea
# We are going to create an array and populate it with the common support for all pair of movies
# We then are going to plot a histogram and see its distribution.
#This task might take quite some time - 1-2 hours for typical machines !
t1g_all_movies = movies_df['MovieId'].unique()
t1g_allpair_commonsupport = []
for i, movie1 in enumerate(t1g_all_movies):
for j, movie2 in enumerate(t1g_all_movies):
#Do not count a pair twice
if i < j:
userids_who_rated_movie1 = movielens_df[movielens_df.MovieId==movie1].UserId.unique()
userids_who_rated_movie2 = movielens_df[movielens_df.MovieId==movie2].UserId.unique()
num_common_users = len( set(userids_who_rated_movie1).intersection(userids_who_rated_movie2) )
t1g_allpair_commonsupport.append(num_common_users)
print "Average common support is ", round(np.mean(t1g_allpair_commonsupport), 2)
plt.hist(t1g_allpair_commonsupport)
#### What the average common support and its distribution imply?
#Task t1h: Let us now consider how sparse the matrix is
t1h_sparsity = 100000./(1682*943)
print "Sparsity of the dataset is ", t1h_sparsity
#This graph is actually less sparse than typical datasets for recommender systems
# which often have a sparsity much lower than 1%
# As discussed in the class, the sparsity imposes huge problem in terms of designing efficient and correct algorithms
#Task t1i: Compute the average rating for each movie grouped by gender
# In other words, for each movie, compute the average rating given by men and women
# Hint: use a pivot table
t1i_movielens_mean_ratings = movielens_df.pivot_table('Rating', index ='Title', columns='Gender',aggfunc='mean')
display(t1i_movielens_mean_ratings[:10])
Let us now build a simple global recommendation system based on the nearest neighbor idea.
#Task t2a:
# Create a dictionary where key is Movie Name and value is id
# You can either use the movies_df or read and parse the u.item file yourself
movie_name_to_id_dictionary = {}
#Write code to populate the movie names to this array
all_movie_names = []
f = open("ml-100k/u.item", "r")
lines = f.readlines()
for line in lines:
data = line.split("|")
movie_name, movie_id = data[1], int(data[0])
movie_name_to_id_dictionary[movie_name] = movie_id
all_movie_names.append(movie_name)
f.close()
#Task t2b: Write a function that takes two inputs:
# movie_id: id of the movie and common_users: a set of user ids
# and returns the list of rows corresponding to their movie ratings
def get_movie_reviews(movie_id, common_users):
#Get a boolean vector for themovielens_dfns_dfs provided by users in common_users for movie movie_id
# Hint: use the isin operator of Pandas
mask = (movielens_df['UserId'].isin(common_users)) & (movielens_df['MovieId'] == movie_id)
#Create a subset of data where the mask is True i.e. only collect data from users who satisfy the condition above
# Then sort them based on userid
movie_ratings = movielens_df[mask].sort('UserId')
#Do not change below
#Return the unique set of ratings provided
movie_ratings = movie_ratings[movie_ratings['UserId'].duplicated()==False]
return movie_ratings
#Do not change below
#Here are some sample test cases for evaluating t2b
print "get_movie_reviews(1, set([1]))"
display( get_movie_reviews(1, set([1])) )
print "get_movie_reviews(1, set(range(1, 10)))"
display( get_movie_reviews(1, set(range(1, 10))) )
print "get_movie_reviews(100, set(range(1, 10)))"
display( get_movie_reviews(100, set(range(1, 10))) )
print "get_movie_reviews(784, set(range(1, 784)))"
display( get_movie_reviews(784, set(range(1, 784))) )
#Task t2c: Let us now calculate the similarity between two movies
# based on the set of users who rated both movies
#Using euclidean distance is a bad idea - but simplifies the code
def calculate_similarity(movie_name_1, movie_name_2, min_common_users=0):
movie1 = movie_name_to_id_dictionary[movie_name_1]
movie2 = movie_name_to_id_dictionary[movie_name_2]
#This is the set of UNIQUE user ids who reviewed movie1
users_who_rated_movie1 = movielens_df[movielens_df['MovieId'] == movie1]['UserId'].unique()
#This is the set of UNIQUE user ids who reviewed movie2
users_who_rated_movie2 = movielens_df[movielens_df['MovieId'] == movie2]['UserId'].unique()
#Compute the common users who rated both movies:
# hint convert both to set and do the intersection
common_users = set(users_who_rated_movie1).intersection(users_who_rated_movie2)
#Using the code you wrote in t2a, get the reviews for the movies and common users
movie1_reviews = get_movie_reviews(movie1, common_users)
movie2_reviews = get_movie_reviews(movie2, common_users)
#Now you have the data frame for both movies
# Use the euclidean_distances function from sklean (imported already)
# to compute the distance between their rating values
distance = euclidean_distances(movie1_reviews['Rating'], movie2_reviews['Rating'])
if len(common_users) < min_common_users:
return [[float('inf')]]
return distance
#Do not change below
print calculate_similarity("Toy Story (1995)", "GoldenEye (1995)")
print calculate_similarity("GoldenEye (1995)", "Tomorrow Never Dies (1997)")
print calculate_similarity("Batman Forever (1995)", "Batman & Robin (1997)")
#Task t2d: Given a movie, find the top-k most similar movies
# that have the lowest euclidean distance
#Here is the high level logic:
# for each movie in all_movie_names (Except the input movie name)
# compute its similarity and store it in an array
# return the k movies with the smallest distances
# remember to pass min_common_users to calculate_similarity
def get_top_k_similar_movies(input_movie_name, k=5, min_common_users=0):
movie_similarity = []
for movie_name in all_movie_names:
if input_movie_name == movie_name:
continue
similarity = calculate_similarity(input_movie_name, movie_name, min_common_users)
similarity = similarity[0][0]
movie_similarity.append( (similarity, movie_name) )
return sorted(movie_similarity)[:k]
#print get_top_k_similar_movies("Toy Story (1995)", 10)
print "\nMovies similar to GoldenEye [25]", get_top_k_similar_movies("GoldenEye (1995)", 10, 25)
print "\nMovies similar to GoldenEye [50]", get_top_k_similar_movies("GoldenEye (1995)", 10, 50)
print "\nMovies similar to GoldenEye [100]", get_top_k_similar_movies("GoldenEye (1995)", 10, 100)
print "\n\n"
print "\nMovies similar to Usual Suspects [25]", get_top_k_similar_movies("Usual Suspects, The (1995)", 10, 25)
print "\nMovies similar to Usual Suspects [50]", get_top_k_similar_movies("Usual Suspects, The (1995)", 10, 50)
print "\nMovies similar to Usual Suspects [100]", get_top_k_similar_movies("Usual Suspects, The (1995)", 10, 100)
print "\n\n"
print "\nMovies similar to Batman Forever [25]", get_top_k_similar_movies("Batman Forever (1995)", 10, 25)
print "\nMovies similar to Batman Forever [50]", get_top_k_similar_movies("Batman Forever (1995)", 10, 50)
print "\nMovies similar to Batman Forever [100]", get_top_k_similar_movies("Batman Forever (1995)", 10, 100)
print "\n\n"
print "\nMovies similar to Shawshank Redemption [25]", get_top_k_similar_movies("Shawshank Redemption, The (1994)", 10, 25)
print "\nMovies similar to Shawshank Redemption [50]", get_top_k_similar_movies("Shawshank Redemption, The (1994)", 10, 50)
print "\nMovies similar to Shawshank Redemption [100]", get_top_k_similar_movies("Shawshank Redemption, The (1994)", 10, 100)
print "\n\n"
In this task, let us try to perform item based collaborative filtering. In spirit, it is very similar to what you already did for Task 2. With some minor changes, you can easily build an item based collaborative filtering recommenders.
#Do not change below
# By default euclidean distance can be give arbitrary values
# Let us "normalize" it by limit its value between 0 and 1 and slightly change the interpretation
# 0 means that the preferences are very different
# 1 means that preferences are identical
# For tasks 3 and 4, remember to use this function
#Vec1 and vec2 are vectors
def euclidean_distance_normed(vec1, vec2):
if len(vec1) == 0:
return 0.0
euc_distance = euclidean_distances(vec1, vec2)[0][0]
return 1.0 / (1.0 + euc_distance)
#Task t3a:
# In this task, you want to compute the similarity between two items
# which in this case means ratings_df
# You can use code from 2c except that you must now call euclidean_distance_normed
# when computing the distance
def calculate_similarity_normed(movie_name_1, movie_name_2, min_common_users=0):
movie1 = movie_name_to_id_dictionary[movie_name_1]
movie2 = movie_name_to_id_dictionary[movie_name_2]
#This is the set of UNIQUE user ids who reviewed movie1
users_who_rated_movie1 = movielens_df[movielens_df['MovieId'] == movie1]['UserId'].unique()
#This is the set of UNIQUE user ids who reviewed movie2
users_who_rated_movie2 = movielens_df[movielens_df['MovieId'] == movie2]['UserId'].unique()
#Compute the common users who rated both movies:
# hint convert both to set and do the intersection
common_users = set(users_who_rated_movie1).intersection(users_who_rated_movie2)
#Using the code you wrote in t2a, get the reviews for the movies and common users
movie1_reviews = get_movie_reviews(movie1, common_users)
movie2_reviews = get_movie_reviews(movie2, common_users)
#Do not change below
#Now you have the data frame for both movies
# Use the euclidean_distances function from sklean (imported already)
# to compute the distance between their rating values
distance = euclidean_distance_normed(movie1_reviews['Rating'].values, movie2_reviews['Rating'].values)
if len(common_users) < min_common_users:
return 0.0
return distance
#Do not change below
print calculate_similarity_normed("Toy Story (1995)", "GoldenEye (1995)")
print calculate_similarity_normed("GoldenEye (1995)", "Tomorrow Never Dies (1997)")
print calculate_similarity_normed("Batman Forever (1995)", "Batman & Robin (1997)")
#Do not change below
#We are now going to create item-item similarity database
# Since our data is "small", we will use a non-traditional approach of using nested hashes
# In real-life, using something like databases or other data structures is far more preferable
#Here is the high level structure
#{
# movie_name_1:
# {
# movie_name_2: similarity_between_movie_1_and_2,
# movie_name_3: similarity_between_movie_1_and_3,
# ....
# movie_name_n: similarity_between_movie_1_and_n
# },
# movie_name_2:
# {
# movie_name_1: similarity_between_movie_2_and_1,
# movie_name_3: similarity_between_movie_2_and_3,
# ....
# movie_name_n: similarity_between_movie_2_and_n
# },
# ....
# movie_name_n:
# {
# movie_name_1: similarity_between_movie_n_and_1,
# movie_name_2: similarity_between_movie_n_and_2,
# ....
# movie_name_n-1: similarity_between_movie_n_and_n-1
# },
#}
#Here is how to use this data structuere:
#To get similarity between movies
# data[movie1][movie2]
#To get similarity between one movie and all others
# data[movie1]
#DO not change below
#This hash stores the movie to movie
# as described above
movie_similarity_hash = defaultdict(dict)
#Item based filtering is expensive as you need to compute similarity of all pairs of items
# for this dataset it is 1682*1682 ~ 28 lakh pairs or 2.8 million
# running all of them might take hours and close to a day
# instead let us run on a smaller dataset
# specifically, let us only focus on the top-250 movies based on ratings
# which is more manageable
#Task t3b:
# Get the top-k movie names with most ratings
#Hint: use Counter class
def top_k_movie_names(k):
movie_ratings_counter = Counter()
with open("ml-100k/u.data","r") as f:
for line in f:
user_id, movie_id, rating, timestamp = line.split("\t")
movie_name = all_movie_names[ int(movie_id) - 1]
movie_ratings_counter[movie_name] += 1
return movie_ratings_counter.most_common(k)
#Do not change below
print "Top-10", top_k_movie_names(10), "\n"
print "Top-25", top_k_movie_names(25), "\n"
#Do not change below
top_250_movie_names = [item[0] for item in top_k_movie_names(250)]
#Task t3c:
#Use the following logic
# for each movie in movie_names:
# for all other movies in movie_names:
# compute similarity between two movies using calculate_similarity_normed
# remember to pass min_common_users to that function
# note that movie_similarity_hash is a defaultdict
# so similarity between movie1 and movie2 can be set as movie_similarity_hash[movie1][movie2]
# btw, similarity in our case is commutative.
# i.e. similarity(movie1, movie2) = similarity(movie2, movie1)
# so do not call the function twice !
# movie_names is an array that lists the movies for which you have to compute pairwise similarity
def compute_movie_to_movie_similarity(movie_names, min_common_users=0):
for movie_name_1 in movie_names:
for movie_name_2 in movie_names:
if movie_name_1 == movie_name_2:
continue
similarity = calculate_similarity_normed(movie_name_1, movie_name_2, min_common_users)
movie_similarity_hash[movie_name_1][movie_name_2] = similarity
movie_similarity_hash[movie_name_2][movie_name_1] = similarity
#Do not change below
#Let us first test if your code above is correct by testing against a small subset of data
movie_similarity_hash = defaultdict(dict)
# let use the top-10 movies
compute_movie_to_movie_similarity(top_250_movie_names[:10], min_common_users=0)
#Get similarity with
display(movie_similarity_hash["Toy Story (1995)"])
display(movie_similarity_hash['Return of the Jedi (1983)'])
print movie_similarity_hash["Toy Story (1995)"]["Independence Day (ID4) (1996)"]
#Do not change below
#Let us now test against top-250 most popular movies
#This might take 10-20 mins to run!
movie_similarity_hash = defaultdict(dict)
compute_movie_to_movie_similarity(top_250_movie_names, min_common_users=25)
#Do not change below
#Do this if you want to persist the data
# Let us persist the movie-movie similarity data structure
# that way you dont need to re-run the whole thing
#pickle is a serialization library in Python
# To persist/serialize, use the following line
#pickle.dump(movie_similarity_hash, open("movie_similarity.pickle", "wb"))
# To deserialize, uncomment the following line
#movie_similarity_hash = pickle.load( open( "movie_similarity.pickle", "rb" ) )
for movie_name in top_250_movie_names[:10]:
print "Top-10 most similar movies for ", movie_name, " :",
print sorted(movie_similarity_hash[movie_name].items(), key=operator.itemgetter(1), reverse=True)[:10]
print "\n"
#Task t3d
#Before doing t3d, please complete t4a so that user_rating_hash is available
# this will make the code below easier
#In this task, we are going to predict the rating of a user u for a movie m using item based collaborative filtering
#Here is the high level logic:
# for each item i rated by this user:
# s = similarity between i and input movie m
# if similarity between i and m is 0, ignore this item
# compute weighted rating for m based on i as rating for i * s
# compute the predicted rating as sum of all weighted ratings / sum of all similarities
def predict_rating_for_movie_icf(movie_similarity_hash, input_user_id, input_movie_name, movies_considered):
total_weighted_rating = 0.0
total_similarity= 0.0
#Hint: movie_similarity_hash is a nested hash where user id is key and
# all their rating as a dictionary
# this dictionary is ordered as moviename: rating
#if this user has already rated the movie, return that rating
if input_movie_name in user_rating_hash[input_user_id]:
return user_rating_hash[input_user_id][input_movie_name]
#For each movie the user has rated
for movie_name in user_rating_hash[input_user_id].keys():
#if user rated some movie, but it is not in the subset of movies that we computed pairwise similarity
# such as top-250, then do not consider it either
# for this task, the input is in movies_considered
if movie_name not in movies_considered:
continue
#compute similarity between movies
#dont recompute = use the hash
similarity = movie_similarity_hash[input_movie_name][movie_name]
#Reject item if similarity is 0
if similarity <= 0.0:
continue
#Compute weighted rating
weighted_rating = similarity * user_rating_hash[input_user_id][movie_name]
#update total_weighted_rating and total_similarity
total_weighted_rating += weighted_rating
total_similarity += similarity
#Do not change below
if total_similarity == 0.0:
return 0.0
return total_weighted_rating / total_similarity
#Let us compute the rating for first 5 users for the top-20 movies that they have not seen
for user_id in range(1, 5+1):
print user_id, [ (movie_name,
round(predict_rating_for_movie_icf(movie_similarity_hash, user_id, movie_name, top_250_movie_names),2))
for movie_name in top_250_movie_names[:20]
if movie_name not in user_rating_hash[user_id]]
#print movie_name, predict_rating_for_movie_icf(movie_similarity_hash, 1, 'Liar Liar (1997)', min_common_users=25)
#Task t3e:
#Here is the pseudocode for recommending movies
# for each movie this user has not rated in movies_considered:
# predict rating for this movie and this user using t4d
# return the top-k movies
def recommend_movies_icf(input_user_id, movies_considered, movie_similarity_hash,
user_rating_hash, k=10, min_common_movies=5):
predicted_ratings = []
#Your code here
for movie_name in movies_considered:
if movie_name in user_rating_hash[input_user_id]:
continue
#Predict the rating for input_user_id and movie_name using t3d
predicted_rating = predict_rating_for_movie_icf(movie_similarity_hash,
input_user_id, movie_name, movies_considered)
#append rating and movie name to predicted_ratings
predicted_ratings.append( (predicted_rating, movie_name) )
return sorted(predicted_ratings, reverse=True)[:k]
#Do not change below:
#Let us predict top-5 movies for first 10 users
for user_id in range(1,11):
print user_id, recommend_movies_icf(user_id, top_250_movie_names, movie_similarity_hash,
user_rating_hash, k=10, min_common_movies=5)
In this task, let us try to perform user based collaborative filtering.
#In order to simplify the coding, let us create a nested hash structure to store the user-rating data
# It will look as follows:
#{
# u1: {movie_name_1:rating1, movie_name_2:rating2, ....},
# ....
# un: {movie_name_1:rating1, movie_name_2:rating2, ....},
#}
#Of course, we will only store the movies that the user rated
#Use the all_movie_names to convert movie id to movie name
# remember that Movielens uses 1 based indexing
# so the name of movieid i is in all_movie_names[i-1]
#Task t4a
#Create the data structure as discussed above
# here is the logic:
# for each line in file ml-100k/u.data:
# set user_rating_hash[user][movie] = rating
# read the instructions above again!
def compute_user_rating_hash():
user_rating_hash = defaultdict(dict)
#Your code below
with open("ml-100k/u.data","r") as f:
for line in f:
user_id, movie_id, rating, timestamp = line.split("\t")
user_rating_hash[int(user_id)][all_movie_names[int(movie_id)-1]] = int(rating)
return user_rating_hash
#Do not change below
user_rating_hash = compute_user_rating_hash()
#Do not change below
#How many users are there?
print len(user_rating_hash.keys())
#How many movies did each of the first 20 users rated?
print [len(user_rating_hash[i].keys()) for i in range(1,20+1)]
#print the ratings of user 4
display(user_rating_hash[4])
#Task t4b:
#We need to modify our logic for computing distance
#Here is the high level pseudocode:
# movie1 = movie names rated by user 1
# movie2 = movie names rated by user 2
# common movies = set intersection of movie1 and movie2
# if number of common movies is less than min_common_movies, return 0.0 [not 0]
# other wise create a vector with rating for common movies only
# compute euclidean distance between the vectors
# return 1 / (1+euclidean distace)
def compute_user_user_similarity(user_rating_hash, user_id_1, user_id_2, min_common_movies=0):
#Get list of movie names rated by user 1. hint use keys function [see above for usage]
movies_rated_user1 = user_rating_hash[user_id_1].keys()
movies_rated_user2 = user_rating_hash[user_id_2].keys()
#compute common movies
common_movies = set(movies_rated_user1).intersection( set(movies_rated_user2) )
if len(common_movies) < min_common_movies:
return 0.0
common_movies = sorted(list(common_movies))
#vector1 is the set of ratings for user1 for movies in common_movies
vector1 = [user_rating_hash[user_id_1][movie] for movie in common_movies]
#vector2 is the set of ratings for user2 for movies in common_movies
vector2 = [user_rating_hash[user_id_2][movie] for movie in common_movies]
#Compute distance and return 1.0/(1.0+distance)
distance = euclidean_distances(vector1, vector2)[0][0]
return 1.0 / ( 1.0 + distance)
#Testing code
print [round(compute_user_user_similarity(user_rating_hash, 1, i),2) for i in range(1, 10+1)]
print [round(compute_user_user_similarity(user_rating_hash, 784, i),2) for i in range(1, 10+1)]
#Task t4c
#This function finds the k-most similar users
#Here is the high level logic:
# for each user in all_user_ids other than the input user id:
# find similarity between this user and input_user_id and store as (similarity, other userid)
# sort based on similarity
# return top-k
# remember to pass min_common_movies
def top_k_most_similar_users(user_rating_hash, input_user_id, all_user_ids, k=10, min_common_movies=0):
user_similarity = []
for user_id in all_user_ids:
if user_id == input_user_id:
continue
similarity = compute_user_user_similarity(user_rating_hash, input_user_id, user_id, min_common_movies)
user_similarity.append( ( similarity, user_id) )
return sorted(user_similarity, reverse=True)[:k]
#Do not change below
all_user_ids = range(1, 943+1)
print top_k_most_similar_users(user_rating_hash, 1, all_user_ids, 10, 5)
print top_k_most_similar_users(user_rating_hash, 1, all_user_ids, 10, 10)
print top_k_most_similar_users(user_rating_hash, 812, all_user_ids, 10, 5)
print top_k_most_similar_users(user_rating_hash, 812, all_user_ids, 10, 20)
#Task t4d
#In this task, we are going to predict the rating of a user for a movie using user based collaborative filtering
#Here is the high level logic:
# for each user u in all_user_ids:
# s= similarity between u and input_user_id [remember to pass min_common_movies]
# if similairty is 0.0 ignore u
# if u has not rated this movie, ignore again
# suppose u has rated this movie with a value of r
# i am now going to give a "weighted rating" as r*s
# compute the predicted rating as sum of all weighted ratings / sum of all similarities
def predict_rating_for_movie_ucf(user_rating_hash, input_user_id, movie_name, all_user_ids, min_common_movies=5):
total_weighted_rating = 0.0
total_similarity= 0.0
for user_id in all_user_ids:
if user_id == input_user_id:
continue
#compute similarity between users
similarity = compute_user_user_similarity(user_rating_hash, user_id, input_user_id, min_common_movies)
#Reject user if similarity is 0
if similarity <= 0.0:
continue
#reject user if (s)he has not rated the movie
if movie_name not in user_rating_hash[user_id]:
continue
#Compute weighted rating
weighted_rating = similarity * user_rating_hash[user_id][movie_name]
#update total_weighted_rating and total_similarity
total_weighted_rating += weighted_rating
total_similarity += similarity
#Do not change below
if total_similarity == 0.0:
return 0.0
return total_weighted_rating / total_similarity
#Do not change below
all_user_ids = range(1, 943+1)
for user_id in range(1, 5+1):
print "user_id = ", user_id
print [ round(predict_rating_for_movie_ucf(user_rating_hash, user_id, all_movie_names[i], all_user_ids, min_common_movies=5),1)
for i in range(1, 10+1)]
print [ round(predict_rating_for_movie_ucf(user_rating_hash, user_id, all_movie_names[i], all_user_ids, min_common_movies=10),1)
for i in range(1, 10+1)]
print "\n"
#Task t4e:
#Here is the pseudocode for recommending movies
# for each movie this user has not rated:
# for all other users:
# predict rating for this movie and this user using t4d
# return the top-k movies
def recommend_movies_ucf(user_rating_hash, all_user_ids, input_user_id, k=10, min_common_movies=5):
predicted_ratings = []
#Your code here
for movie_name in all_movie_names:
if movie_name in user_rating_hash[input_user_id]:
continue
#Predict the rating for input_user_id and movie_name using t4d
predicted_rating = predict_rating_for_movie_ucf(user_rating_hash, input_user_id, movie_name,
all_user_ids, min_common_movies)
#append rating and movie name to predicted_ratings
predicted_ratings.append( (predicted_rating, movie_name) )
return sorted(predicted_ratings, reverse=True)[:k]
#Do not change below
all_user_ids = range(1, 943+1)
for user_id in range(1, 5):
print recommend_movies_ucf(user_rating_hash, all_user_ids, user_id, k=10, min_common_movies=5)
In this task, let us try to find the simplest SVD based latent factor model.
number_of_users = 943
number_of_movies = 1682
ratings_matrix = sp.sparse.lil_matrix((number_of_users, number_of_movies))
#Task t5a: This task requires a different data structure and hence different from prior tasks.
# Here is the high level idea:
# - Create a sparse matrix of type lil_matrix
# - populate it with the data from ratings file
# - if user id i gave movie id j with a rating r, set matrix[i-1,j-1] to r
# ie rows=users, col=movies
# Hint: If you are reading it from file, note that Python treats the data from file as a string
# so you might want to convert them to integer before inserting them.
with open("ml-100k/u.data","r") as f:
for line in f:
user_id, movie_id, rating, timestamp = line.split("\t")
ratings_matrix[ int(user_id)-1, int(movie_id)-1] = int(rating)
print "Matrix shape is", ratings_matrix.shape
print "Number of non zero values", ratings_matrix.nnz
print "Number of non zero values", ratings_matrix.nonzero()
#Task t5b:
# Perform SVD on the ratings matrix
# Hint use the svds function imported above and not svd
# K is the number of factors to have
def perform_svd(ratings_matrix, K=100):
U, S, V = None, None, None
return U, S, V
#Task t5c:
# Note that S is is an array and not a matrix
# Create a diagonal matrix where the diagonal matrix is populated from S
# For eg, if S = 2,4, 8
# then output must be
# 2 0 0
# 0 4 0
# 0 0 8
def construct_diagonal_matrix(S):
return None
#Task t5d:
# We are now going to reconstruct the matrix from the SVD that we have got
# new matrix = U S V
# But its shape will be different as we only used the top-k factors
def reconstruct_low_rank_matrix(U, S, V):
return None
#Task t5e:
# Using the reconstructed matrix, predict the rating of user to a movie
def predict_rating_svd(user_id, movie_id):
return 0.0