When submitting, fill your team details in this cell. Note that this is a markdown cell.
Student 1 Full Name: Saravanan Thirumuruganathan
Student 1 Full ID: 1000xxxxxx
Student 1 Full Email Address: yyyy@zzzz
Student 2 Full Name: Satishkumar Masilamani
Student 2 Full ID: 1000xxxxxx
Student 2 Full Email Address: yyyy@zzzz
Student 3 Full Name:
Student 3 Full ID:
Student 3 Full Email Address:
At a high level, you will be building and evaluating different classifiers for recognizing handwritten digits of MNIST dataset and also build an evaluate various regression models for predicting house prices in Boston. At a more granular level, you will be doing the following:
In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data. The idea is to show that lot of times, you can simply run a set of classifiers and still get great results. While I will be giving some basic classifier code, you will have some opportunity to improve them by tuning the parameters.
In the second set of tasks, we will do multi-class classification where the idea is to classify the image to one of the ten digits (0-9). We will start with some basic classifiers that are intrinsically multi-class. Then we will learn about how to convert binary classifiers to multi-class classifiers and how scikit-learn makes it very easy.
In the first two set of tasks, we will narrowly focus on accuracy - what fraction of our predictions - were correct. However, there are a number of popular evaluation metrics. You will learn how (and when) to use these evaluation metrics.
This is an advanced topic where you will learn how to tune your classifier and find optimal parameters. We will explore two powerful techniques of grid search and parameter search. This is a very compute intensive task - so you will also explore how to leverage parallelization capabilities of IPython kernel to get results sooner.
In the final set of tasks, we will use regression to predict Boston house prices. We will explore both Ordinary Least Squares and also explore other regression variant of popular classifiers such as decision trees and SVM.
%matplotlib inline
#Array processing
import numpy as np
#Data analysis, wrangling and common exploratory operations
import pandas as pd
from pandas import Series, DataFrame
#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import Image
from sklearn.datasets import fetch_mldata, load_boston
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.svm import SVC, LinearSVC , SVR
from sklearn.linear_model import Perceptron, LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.cross_validation import KFold, train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
########You have to install the python package pydot for generating some graph figures
######## If you are using pip, the command is pip install pydot
########You will also need to download Graphviz from http://www.graphviz.org/Download.php
########If you are using Windows, make sure that Graphviz/dot are in path and can be used by pydot package.
########If you get any pydot error, see url for one possible solution
######## http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will
######## If you are using Anaconda, pydot can be installed as conda install pydot
######## If that does not work check url:
######## http://stackoverflow.com/questions/27482170/installing-pydot-and-graphviz-packages-in-anaconda-environment
import pydot, StringIO
########################If needed you can import additional packages for helping you, although I would discourage it
########################Put all your imports in the space below. If you use some strange package,
##########################provide clear information as to how we can install it.
#######################End imports###################################
In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data.
####################Do not change anything below
#Load MNIST data. fetch_mldata will download the dataset and put it in a folder called mldata.
#Some things to be aware of:
# The folder mldata will be created in the folder in which you started the notebook
# So to make your life easy, always start IPython notebook from same folder.
# Else the following code will keep downloading MNIST data
mnist = fetch_mldata("MNIST original")
#The data is organized as follows:
# Each row corresponds to an image
# Each image has 28*28 pixels which is then linearized to a vector of size 784 (ie. 28*28)
# mnist.data gives the image information while mnist.target gives the number in the image
print "#Images = %d and #Pixel per image = %s" % (mnist.data.shape[0], mnist.data.shape[1])
#Print first row of the dataset
img = mnist.data[0]
print "First image shows %d" % (mnist.target[0])
print "The corresponding matrix version of image is \n" , img
print "The image in grey shape is "
plt.imshow(img.reshape(28, 28), cmap="Greys")
#First 60K images are for training and last 10K are for testing
all_train_data = mnist.data[:60000]
all_test_data = mnist.data[60000:]
all_train_labels = mnist.target[:60000]
all_test_labels = mnist.target[60000:]
#For the first task, we will be doing binary classification and focus on two pairs of
# numbers: 7 and 9 which are known to be hard to distinguish
#Get all the seven images
sevens_data = mnist.data[mnist.target==7]
#Get all the none images
nines_data = mnist.data[mnist.target==9]
#Merge them to create a new dataset
binary_class_data = np.vstack([sevens_data, nines_data])
binary_class_labels = np.hstack([np.repeat(7, sevens_data.shape[0]), np.repeat(9, nines_data.shape[0])])
#In order to make the experiments repeatable, we will seed the random number generator to a known value
# That way the results of the experiments will always be same
np.random.seed(1234)
#randomly shuffle the data
binary_class_data, binary_class_labels = shuffle(binary_class_data, binary_class_labels)
print "Shape of data and labels are :" , binary_class_data.shape, binary_class_labels.shape
#There are approximately 14K images of 7 and 9.
#Let us take the first 5000 as training and remaining as test data
orig_binary_class_training_data = binary_class_data[:5000]
binary_class_training_labels = binary_class_labels[:5000]
orig_binary_class_testing_data = binary_class_data[5000:]
binary_class_testing_labels = binary_class_labels[5000:]
#The images are in grey scale where each number is between 0 to 255
# Now let us normalize them so that the values are between 0 and 1.
# This will be the only modification we will make to the image
binary_class_training_data = orig_binary_class_training_data / 255.0
binary_class_testing_data = orig_binary_class_testing_data / 255.0
scaled_training_data = all_train_data / 255.0
scaled_testing_data = all_test_data / 255.0
print binary_class_training_data[0,:]
###########Make sure that you remember the variable names and their meaning
#binary_class_training_data, binary_class_training_labels: Normalized images of 7 and 9 and the correct labels for training
#binary_class_testing_data, binary_class_testing_labels : Normalized images of 7 and 9 and correct labels for testing
#orig_binary_class_training_data, orig_binary_class_testing_data: Unnormalized images of 7 and 9
#all_train_data, all_test_data: un normalized images of all digits
#all_train_labels, all_test_labels: labels for all digits
#scaled_training_data, scaled_testing_data: Normalized version of all_train_data, all_test_data for all digits
All classifiers in scikit-learn follow a common pattern that makes life much easier. Follow these steps for all the tasks below.
We will start with one of the simplest classifiers. In the cell below, I have given the code for k-NN with k=1. This should give an idea about how to create, train and evaluate a classifier.
# Do not change anything in this cell.
# The following are some utility functions to help you visualize k-NN
#Code courtesy AMPLab
#This function displays one or more images in a grid manner.
def show_img_with_neighbors(imgs, n=1):
fig = plt.figure()
for i in xrange(0, n):
fig.add_subplot(1, n, i, xticklabels=[], yticklabels=[])
if n == 1:
img = imgs
else:
img = imgs[i]
plt.imshow(img.reshape(28, 28), cmap="Greys")
#This function shows some images for which k-NN made a mistake
# For each of the missed image, it will also show k most similar images so that you will get an idea of why it failed.
def show_erroring_images_for_model(errors_in_model, num_img_to_print, model, n_neighbors):
for errorImgIndex in errors_in_model[:num_img_to_print]:
error_image = binary_class_testing_data[errorImgIndex]
not_needed, result = model.kneighbors(error_image, n_neighbors=n_neighbors)
show_img_with_neighbors(error_image)
show_img_with_neighbors(binary_class_training_data[result[0],:], len(result[0]))
# Do not change anything in this cell.
#The code below creates a K-NN classifier with k=1.
#Clearly observe how I do it step by step.
#Step 1: Create a classifier with appropriate parameters
knn_model_k1 = KNeighborsClassifier(n_neighbors=1, algorithm='brute')
#Step 2: Fit it with training data
knn_model_k1.fit(binary_class_training_data, binary_class_training_labels)
#Print the model so that you know all parameters
print(knn_model_k1)
#Step 3: Make predictions based on testing data
predictions_knn_model_k1 = knn_model_k1.predict(binary_class_testing_data)
#Step 4: Evaluate the data
print "Accuracy of K-NN with k=1 is", metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k1)
#Let us now see the first five images that were predicted incorrectly see what the issue is
errors_knn_model_k1 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k1[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k1, 5, knn_model_k1, 1)
#task t1a
#Using the above code as model, create a KNN classifier with k=3
#Change the following line as appropriate
knn_model_k3 = KNeighborsClassifier(n_neighbors=3, algorithm='brute')
knn_model_k3.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_knn_model_k3 = knn_model_k3.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k3)
#Let us now see the first five images that were predicted incorrectly see what the issue is
errors_knn_model_k3 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k3[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k3, 5, knn_model_k3, 3)
#task t1b
#Using the above code as model, create a KNN classifier with k=5
#Change the following line as appropriate
knn_model_k5 = KNeighborsClassifier(n_neighbors=5, algorithm='brute')
knn_model_k5.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_knn_model_k5 = knn_model_k5.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k5)
#Let us now see the first five images that were predicted incorrectly see what the issue is
errors_knn_model_k5 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k5[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k5, 5, knn_model_k5, 5)
#task t1c
#Now let us evaluate KNN for different values of K and find the best K
#WARNING: This code will take 20-40 minutes to run. So make sure your code is correct
k_vals = xrange(1, 20)
#Initialize to 0
accuracy_vals = [0 for _ in k_vals]
for k in k_vals:
#Create a KNN with number of neighbors = k - #Change the following line as appropriate
knn_model_for_knn_k = KNeighborsClassifier(n_neighbors=k, algorithm='brute')
knn_model_for_knn_k.fit(binary_class_training_data, binary_class_training_labels)
#Make the prediction - #Change the following line as appropriate
predictions_for_knn_k = knn_model_for_knn_k.predict(binary_class_testing_data)
accuracy_vals[k-1] = metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_k)
#Now you have two arrays k_vals which have different values of k and accuracy_vals which have the corresponding accuracy
# of a model with that k
#Set opt_k to k that gives best accuracy.
# Hint: use np.argmax command.
# Also dont forget to add 1 to opt_k to correct the off by one error
# (as k varied from 1 to 20 while accuracy_vals[k] started from 0)
opt_k = k_vals[np.argmax(accuracy_vals)] ##Change the following line as appropriate
print "Best k for k-nn is:",opt_k
#task t1d
#Train the model with optimal k that we have seen so far
#Change the following line as appropriate
knn_model_opt_k = KNeighborsClassifier(n_neighbors=opt_k, algorithm='brute')
knn_model_opt_k.fit(binary_class_training_data, binary_class_training_labels)
predictions_for_knn_opt_k = knn_model_opt_k.predict(binary_class_testing_data)
print "Accuracy for best k is ", metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_opt_k)
#(Harder)task t1e
##############################Read the parameter values in the url
# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
#Try different combinations of the parameters so that you can beat the accuracy of knn_model_opt_k
#Change the following line as appropriate
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
knn_model_opt_k_student = None
knn_model_opt_k_student.fit(binary_class_training_data, binary_class_training_labels)
predictions_for_knn_opt_k_student = knn_model_opt_k_student.predict(binary_class_testing_data)
print "Accuracy for best k - variant by student is ", metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_opt_k_student)
In the next set of tasks, you will use Decision trees (see url http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier ) for classification.
###Do not make any change below
def plot_dtree(model,fileName):
#You would have to install a Python package pydot
#You would also have to install graphviz for your system - see http://www.graphviz.org/Download..php
#If you get any pydot error, see url
# http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will
dot_tree_data = StringIO.StringIO()
tree.export_graphviz(model , out_file = dot_tree_data)
dtree_graph = pydot.graph_from_dot_data(dot_tree_data.getvalue())
dtree_graph.write_png(fileName)
#task t2a
#Create a CART decision tree with DEFAULT values
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_default = DecisionTreeClassifier(random_state = 1234)
cart_model_default.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_default)
fileName = 'dtree_default.png'
plot_dtree(cart_model_default, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_default = cart_model_default.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_default)
#task t2b
#Create a CART decision tree with splitting criterion as entropy
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_entropy = DecisionTreeClassifier(random_state = 1234, criterion = "entropy")
cart_model_entropy.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_entropy)
fileName = 'dtree_entropy.png'
plot_dtree(cart_model_entropy, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_entropy = cart_model_entropy.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_entropy)
#task t2c
#Create a CART decision tree with splitting criterion as entropy and min_samples_leaf as 100
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_entropy_limit_leaves = DecisionTreeClassifier(random_state = 1234, criterion = "entropy", min_samples_leaf = 100)
cart_model_entropy_limit_leaves.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_entropy_limit_leaves)
fileName = 'dtree_entropy_limit_leaves.png'
plot_dtree(cart_model_entropy_limit_leaves, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_entropy_limit_leaves = cart_model_entropy_limit_leaves.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_entropy_limit_leaves)
#(Harder) task t2d
#Create a CART decision tree that beats the three models above
# You might want to consult the url
# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
#Change the following line as appropriate
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
cart_model_student = None #Create the model with parameters that beats the models above
cart_model_student.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_student)
fileName = 'dtree_model_student.png'
plot_dtree(cart_model_student, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_student = None #change this line to make predictions
print metrics.accuracy_score(binary_class_testing_labels, predictions_dtree_student)
In this task, you will create a set of Naive Bayes classifiers and evaluate them. You might want to use the following url http://scikit-learn.org/stable/modules/naive_bayes.html
#task t3a
#Create a Gaussian NB
#Change the following line as appropriate
nb_gaussian_model = GaussianNB() #create the model
nb_gaussian_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_gaussian_model)
#Change the following line as appropriate
predictions_gaussian_nb = nb_gaussian_model.predict(binary_class_testing_data) #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_gaussian_nb)
#task t3b
#Now create multinomial NB
#Change the following line as appropriate
nb_multinomial_model = MultinomialNB() #create the model
nb_multinomial_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_multinomial_model)
#Change the following line as appropriate
predictions_multinomial_nb = nb_multinomial_model.predict(binary_class_testing_data) #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_multinomial_nb)
#task t3c
#Now create a binomial NB
#Change the following line as appropriate
nb_binomial_model = BernoulliNB() #create the model
nb_binomial_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_binomial_model)
#Change the following line as appropriate
predictions_binomial_nb = nb_binomial_model.predict(binary_class_testing_data) #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_binomial_nb)
Let us test SVM on this dataset. You might want to read the http://scikit-learn.org/stable/modules/svm.html for help. We will focus on SVC and LinearSVC.
#task t4a
#Create a SVM using SVC class. Remember to set random state to 1234.
#Change the following line as appropriate
svc_svm_model = svc_svm_model = SVC(random_state = 1234) #Change this line
svc_svm_model.fit(binary_class_training_data, binary_class_training_labels)
print(svc_svm_model)
#Change the following line as appropriate
predictions_svm_svc = svc_svm_model.predict(binary_class_testing_data) #Chane this line
print metrics.accuracy_score(binary_class_testing_labels,predictions_svm_svc)
#task t4b
#Now create a linear SVM model using LinearSVC class. Remember to set random state to 1234.
#Change the following line as appropriate
linear_svc_svm_model = LinearSVC(random_state=1234)
linear_svc_svm_model.fit(binary_class_training_data, binary_class_training_labels)
print(linear_svc_svm_model)
#Change the following line as appropriate
predictions_linear_svm_svc = linear_svc_svm_model.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_linear_svm_svc)
#(Harder) task t4c
#####Either using SVC or LinearSVC, try tweaking the parameters so that you beat the two models above
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
#Change the following line as appropriate
svc_svm_model_student = None
svc_svm_model_student.fit(binary_class_training_data, binary_class_training_labels)
print(svc_svm_model_student)
#Change the following line as appropriate
predictions_svm_svc_student = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_svm_svc_student)
Logistic regression is a simple classifier that converts a regression model into a classification one. You can read the details at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
#task t5a
#Create a model with default parameters. Remember to set random state to 1234
#Change the following line as appropriate
lr_model_default = LogisticRegression(random_state = 1234)
lr_model_default.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_lr_model_default = lr_model_default.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_lr_model_default)
#(Harder) task t5b
#Now try to beat the model above by tweaking the parameters. Remember to set random state to 1234.
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
#Change the following line as appropriate
lr_model_default_student = None
lr_model_default_student.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_lr_model_default_student = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_lr_model_default_student)
Perceptron is a simple model that can be used for linearly separable data. See details at url http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html .
#task t6a
#Create a perceptron model with default parameters and random state = 1234
#Change the following line as appropriate
perceptron_model_default = Perceptron(random_state = 1234)
perceptron_model_default.fit(binary_class_training_data, binary_class_training_labels)
print(perceptron_model_default)
#Change the following line as appropriate
predictions_perceptron_default = perceptron_model_default.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_perceptron_default)
Random Forests is a very popular ensemble method. See url http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for details
#task t7a
#Create a random forest classifier with Default parameters
#Change the following line as appropriate
rf_model_default = RandomForestClassifier(random_state = 1234)
rf_model_default.fit(binary_class_training_data, binary_class_training_labels)
print(rf_model_default)
#Change the following line as appropriate
predictions_rf_default = rf_model_default.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_rf_default)
#(Harder) task t7b
#Create a random forest classifier that can beat the above default model.
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
#Change the following line as appropriate
rf_model_default_student = None
rf_model_default_student.fit(binary_class_training_data, binary_class_training_labels)
print(rf_model_default_student)
#Change the following line as appropriate
predictions_rf_default_student = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_rf_default_student)
So far, we have been focussing on binary classification problems. Now let us consider the corresponding multi class classification problem where given an image, we have to predict whether it is any of the digits 0 to 9.
We will evaluate 4 classifiers - KNN, Decision Trees, Random Forest and SVM. We can see that for the first three, no changes are needed to make them multi class. SVM however, is by default, a binary classifier. We can use two options : OneVsRestClassifier or OneVsOneClassifier to make it into Multi Class classifier. See http://scikit-learn.org/stable/modules/multiclass.html for further details.
Make sure that you remember the variable names and their meaning:
#task t8a
#Create a KNN Classifier for K=5 and train it on **scaled** training data and test it on scaled testing data
#Change the following line as appropriate
mc_knn_model_k = KNeighborsClassifier(n_neighbors=5, algorithm='brute')
mc_knn_model_k.fit(scaled_training_data, all_train_labels)
#Change the following line as appropriate
predictions_mc_knn_model = mc_knn_model_k.predict(scaled_testing_data)
print metrics.accuracy_score(all_test_labels, predictions_mc_knn_model)
#task t8b
#Create a Decision tree with DEFAULT parameters and train it on **scaled** training data and test it on scaled testing data
#Remember to set random state to 1234
#Change the following line as appropriate
mc_cart_model_default = DecisionTreeClassifier(random_state = 1234)
mc_cart_model_default.fit(scaled_training_data, all_train_labels)
print(mc_cart_model_default)
fileName = 'mc_dtree_default.png'
plot_dtree(mc_cart_model_default, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_mc_dtree_default = mc_cart_model_default.predict(scaled_testing_data)
print metrics.accuracy_score(all_test_labels, predictions_mc_dtree_default)
#(Harder) task t8c
#Using the multi classs decision tree above make some changes so that you can beat it.
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
# Remember to set random state to 1234
#Change the following line as appropriate
mc_cart_model_student = None
mc_cart_model_student.fit(scaled_training_data, all_train_labels)
print(mc_cart_model_student)
fileName = 'mc_dtree_student.png'
plot_dtree(mc_cart_model_student, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_mc_dtree_student = None
print metrics.accuracy_score(all_test_labels, predictions_mc_dtree_student)
#task t8d
#Create a multi class classifier based on random forest with default parameters.
#Change the following line as appropriate
mc_rf_model_default = RandomForestClassifier(random_state =1234)
mc_rf_model_default.fit(all_train_data, all_train_labels)
print(mc_rf_model_default)
#Change the following line as appropriate
predictions_mc_rf_default = mc_rf_model_default.predict(all_test_data)
print metrics.accuracy_score(all_test_labels,predictions_mc_rf_default)
#(Harder) task t8e
#Tune the random forest classifier so that it beats the default model above
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
#Change the following line as appropriate
mc_rf_model_student = None
mc_rf_model_student.fit(all_train_data, all_train_labels)
print(mc_rf_model_student)
#Change the following line as appropriate
predictions_mc_rf_student = None
print metrics.accuracy_score(all_test_labels,predictions_mc_rf_student)
#task t8f
#Create a SVM based OneVsRestClassifier Set random state to 1234
#Change the following line as appropriate
mc_ovr_linear_svc_svm_model = OneVsRestClassifier(LinearSVC(random_state =1234))
mc_ovr_linear_svc_svm_model.fit(scaled_training_data, all_train_labels)
print(mc_ovr_linear_svc_svm_model)
#Change the following line as appropriate
predictions_mc_ovr_linear_svm_svc = mc_ovr_linear_svc_svm_model.predict(scaled_testing_data)
print metrics.accuracy_score(all_test_labels,predictions_mc_ovr_linear_svm_svc)
#(Harder) task t8g
#Tune the model above so that it beats the default classifier. Set random state to 1234
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
#Change the following line as appropriate
mc_ovr_linear_svc_svm_model_student = None
mc_ovr_linear_svc_svm_model_student.fit(scaled_training_data, all_train_labels)
print(mc_ovr_linear_svc_svm_model_student)
#Change the following line as appropriate
predictions_mc_ovr_linear_svm_svc_student = None
print metrics.accuracy_score(all_test_labels,predictions_mc_ovr_linear_svm_svc_student)
#task t8h
#Create a SVM based OneVsOneClassifier. Set random state to 1234
#Change the following line as appropriate
mc_ovo_linear_svc_svm_model = OneVsOneClassifier(LinearSVC(random_state =1234))
mc_ovo_linear_svc_svm_model.fit(scaled_training_data, all_train_labels)
print(mc_ovo_linear_svc_svm_model)
#Change the following line as appropriate
predictions_mc_ovo_linear_svm_svc = mc_ovo_linear_svc_svm_model.predict(scaled_testing_data)
print metrics.accuracy_score(all_test_labels,predictions_mc_ovo_linear_svm_svc)
#(Harder) task t8i
#Tune the model so that it beats the classifier above. Set random state to 1234
#######Many different ways to do this - a number of teams got it right.
#######If interested - please ask in Piazza post for how others did it.
#Change the following line as appropriate
mc_ovo_linear_svc_svm_model_student = None
mc_ovo_linear_svc_svm_model_student.fit(scaled_training_data, all_train_labels)
print(mc_ovo_linear_svc_svm_model_student)
#Change the following line as appropriate
predictions_mc_ovo_linear_svm_svc_student = None
print metrics.accuracy_score(all_test_labels,predictions_mc_ovo_linear_svm_svc_student)
Let us evaluate different metrics for the multi class classification models that we created so far. You may want to check the url http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics for additional details.
#task t9a
#assign the best multi class classification model that you found above.
# For eg, depending on which decision tree model is best
# set best_dtree_model to either mc_cart_model_default or mc_cart_model_student
# Similarly do for other models as well.
# Remember to assign the MODEL variable
#Change the following lines as appropriate
best_knn_model_mc = mc_knn_model_k
best_dtree_model_mc = mc_cart_model_default
best_rf_model_mc = mc_rf_model_default
best_svm_ovr_model_mc = mc_ovr_linear_svc_svm_model
best_svm_ovo_model_mc = mc_ovo_linear_svc_svm_model
#Assign the best BINARY classification models that you found above
best_knn_model_bc = knn_model_opt_k
best_dtree_model_bc = cart_model_entropy
best_rf_model_bc = rf_model_default
best_svm_model_bc = svc_svm_model
######Note - for tasks 9c and 9d you will use multi class models (ie. those variables ending in _mc)
############# For the remaining you will use binary class models (ie. those variables ending in _bc)
#task t9b
#Create predictions of these 5 models on test dataset
#Change the following lines as appropriate
predictions_best_knn_model_mc = predictions_mc_knn_model
predictions_best_dtree_model_mc = predictions_mc_dtree_default
predictions_best_rf_model_mc = predictions_mc_rf_default
predictions_best_svm_ovr_model_mc = predictions_mc_ovr_linear_svm_svc
predictions_best_svm_ovo_model_mc = predictions_mc_ovo_linear_svm_svc
#Create predictions of these 4 models on test dataset
#Change the following lines as appropriate
predictions_best_knn_model_bc = predictions_for_knn_opt_k
predictions_best_dtree_model_bc = predictions_dtree_entropy
predictions_best_rf_model_bc = predictions_dtree_entropy
predictions_best_svm_model_bc = predictions_svm_svc
#task t9c
#Print the classification report for each of the models above
# Write code here
print "Classification report for : knn_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_knn_model_mc)
print "Classification report for : dtree_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_dtree_model_mc)
print "Classification report for : rf_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_rf_model_mc)
print "Classification report for : svm_ovr_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_svm_ovr_model_mc)
print "Classification report for : svm_ovo_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_svm_ovo_model_mc)
print "Classification report for : knn_model_bc"
print metrics.classification_report(binary_class_testing_labels, predictions_best_knn_model_bc)
print "Classification report for : dtree_model_bc"
print metrics.classification_report(binary_class_testing_labels, predictions_best_dtree_model_bc)
print "Classification report for : rf_model_bc"
print metrics.classification_report(binary_class_testing_labels, predictions_best_rf_model_bc)
print "Classification report for : svm_model_bc"
print metrics.classification_report(binary_class_testing_labels, predictions_best_svm_model_bc)
#task t9d
#Print the confusion matrix for each of the models above
# Write code here
print "Confusionmatrix for : knn_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_knn_model_mc)
print "Confusionmatrix for : dtree_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_dtree_model_mc)
print "Confusionmatrix for : rf_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_rf_model_mc)
print "Confusionmatrix for : svm_ovr_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_svm_ovr_model_mc)
print "Confusionmatrix for : svm_ovo_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_svm_ovo_model_mc)
print "Confusionmatrix for : knn_model_bc"
print metrics.confusion_matrix(binary_class_testing_labels, predictions_best_knn_model_bc)
print "Confusionmatrix for : dtree_model_bc"
print metrics.confusion_matrix(binary_class_testing_labels, predictions_best_dtree_model_bc)
print "Confusionmatrix for : rf_model_bc"
print metrics.confusion_matrix(binary_class_testing_labels, predictions_best_rf_model_bc)
print "Confusionmatrix for : svm_model_bc"
print metrics.confusion_matrix(binary_class_testing_labels, predictions_best_svm_model_bc)
#(Harder) task t9e
#Each of the model above has some probabilistic interpretation
# So sklearn allows you to get the probability values as part of classification
# Using this information, you can print roc_curve
#Print the roc curve for each of the models above
# Write code here
predictions_knn = [prob[1] for prob in best_knn_model_bc.predict_proba(binary_class_testing_data)]
knn_fpr,knn_tpr,knn_threshold=metrics.roc_curve(binary_class_testing_labels,predictions_knn,pos_label=9)
predictions_dtree = [prob[1] for prob in best_dtree_model_bc.predict_proba(binary_class_testing_data)]
dtree_fpr,dtree_tpr,dtree_threshold=metrics.roc_curve(binary_class_testing_labels,predictions_dtree,pos_label=9)
predictions_rf = [prob[1] for prob in best_rf_model_bc.predict_proba(binary_class_testing_data)]
rf_fpr,rf_tpr,rf_threshold=metrics.roc_curve(binary_class_testing_labels,predictions_rf,pos_label=9)
svm_fpr, svm_tpr, svm_thresholds = metrics.roc_curve(binary_class_testing_labels,predictions_best_svm_model_bc,pos_label=9)
#print best_rf_model_bc.classes_
print "K-NN Model:"
print "False-positive rate:", knn_fpr
print "True-positive rate: ", knn_tpr
print "Thresholds: ", knn_threshold
print "\nD-Tree Model:"
print "False-positive rate:", dtree_fpr
print "True-positive rate: ", dtree_tpr
print "Thresholds: ", dtree_threshold
print "\nRf Model:"
print "False-positive rate:", rf_fpr
print "True-positive rate: ", rf_tpr
print "Thresholds: ", rf_threshold
print "\nSVM Model:"
print "False-positive rate:", svm_fpr
print "True-positive rate: ", svm_tpr
print "Thresholds: ", svm_threshold
#(Harder) task t9f
#For each model, DRAW the ROC curve
# See url below for details
#http://nbviewer.ipython.org/github/datadave/GADS9-NYC-Spring2014-Lectures/blob/master/lessons/lesson09_decision_trees_random_forests/sklearn_decision_trees.ipynb
# Write code here
fig_knn = plt.figure()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC for KNN Model')
plt.plot(knn_fpr, knn_tpr)
fig_dtree = plt.figure()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC for DTree Model')
plt.plot(dtree_fpr, dtree_tpr)
fig_rf = plt.figure()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC for RF Model')
plt.plot(rf_fpr, rf_tpr)
fig_svm = plt.figure()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC for SVM Model')
plt.plot(svm_fpr, svm_tpr)
#(Harder) task t9g
#Print the AUC value for each of the models above
# Write code here
print "AUC Value for KNN ", metrics.auc(knn_fpr, knn_tpr)
print "AUC Value for DTree ", metrics.auc(dtree_fpr, dtree_tpr)
print "AUC Value for RF ", metrics.auc(rf_fpr, rf_tpr)
print "AUC Value for SVM ", metrics.auc(svm_fpr, svm_tpr)
#(Harder) task t9h
#Print the precision recall curve for each of the models above
# Write code here
#print the curve based on http://scikit-learn.org/stable/auto_examples/plot_precision_recall.html
predictions_knn = [prob[1] for prob in best_knn_model_bc.predict_proba(binary_class_testing_data)]
knn_precision, knn_recall, knn_threshold = metrics.precision_recall_curve(
binary_class_testing_labels, predictions_knn, pos_label=9)
predictions_dtree = [prob[1] for prob in best_dtree_model_bc.predict_proba(binary_class_testing_data)]
dtree_precision,dtree_recall,dtree_threshold = metrics.precision_recall_curve(
binary_class_testing_labels, predictions_dtree,pos_label=9)
predictions_rf = [prob[1] for prob in best_rf_model_bc.predict_proba(binary_class_testing_data)]
rf_precision,rf_recall,rf_threshold = metrics.precision_recall_curve(
binary_class_testing_labels, predictions_rf,pos_label=9)
svm_precision,svm_recall,svm_threshold = metrics.precision_recall_curve(
binary_class_testing_labels,predictions_best_svm_model_bc,pos_label=9)
fig_knn = plt.figure()
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-Recall Curve for KNN Model')
plt.xlim([0.0, 1.0])
plt.plot(knn_precision, knn_recall)
fig_dtree = plt.figure()
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-Recall Curve for DTree Model')
plt.xlim([0.0, 1.0])
plt.plot(dtree_precision, dtree_recall)
fig_rf = plt.figure()
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-Recall Curve for RF Model')
plt.xlim([0.0, 1.0])
plt.plot(rf_precision, rf_recall)
fig_svm = plt.figure()
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-Recall Curve for SVM Model')
plt.xlim([0.0, 1.0])
plt.plot(svm_precision, svm_recall)
So far in this assignment, you manually tweaked the model till it became better. For complex models, this is often cumbersome. A common trick people use is called Grid Search where you exhaustively test various parameter combinations and pick the best set of parameter values. This is a VERY computationally intensive process and hence it will require some parallelization.
In this assignment, you will learn how to tune two models (a RandmForest and SVC SVM) for MNIST dataset and then parallelize it so as to get results faster. You might want to take a look at the url http://scikit-learn.org/stable/modules/grid_search.html for additional details.
One thing you might want to note is that the GridSearchCV uses cross validation for comparing models. So you have to send the ENTIRE MNIST dataset - i.e. mnist.data and mnist.target . The following cell creates two variables all_scaled_data and all_scaled_target that you can pass to GridSearchCV. In order to get the results in reasonable time, set the cv parameter of GridSearchCV to 3. Also remember to set the verbose parameter to 2 to get some details about what happens internally.
###Do not make any change below
all_scaled_data = binary_class_data / 255.0
all_scaled_target = binary_class_labels
#(Harder) task t10a
#Tuning SVC SVM model for MNIST
tuned_parameters = [{'kernel' : ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},
{'kernel' : ['poly'], 'degree' : [5, 9], 'C' : [1, 10]}]
#Create a SVC SVM classifier object and tune it using GridSearchCV function.
#Change the following line as appropriate
#replace this line with gridsearchcv
svm = GridSearchCV(SVC(random_state=1234), param_grid=tuned_parameters, cv=3, verbose=2)
svm.fit(all_scaled_data,all_scaled_target)
predictions_svm = svm.predict(all_scaled_data)
#print the details of the best model and its accuracy
print "Best model:", svm
print "Best params learned via GridSearch", svm.best_estimator_
print "Accuracy of best model", svm.best_score_
print "Accuracy of learned model", metrics.accuracy_score(all_scaled_target,predictions_svm)
#(Harder) task t10b
#Tuning Random Forest for MNIST
tuned_parameters = [{'max_features': ['sqrt', 'log2'], 'n_estimators': [1000, 1500]}]
#replace this line with gridsearchcv. Remember to pass
# the following parameters to constructor or RandomForestClassifier min_samples_split=1, compute_importances=False, n_jobs=-1
#Change the following line as appropriate
rf = GridSearchCV(RandomForestClassifier(min_samples_split=1,compute_importances=False,n_jobs=-1, random_state=1234), tuned_parameters,cv=3,verbose=2)
rf.fit(all_scaled_data,all_scaled_target)
predictions_rf = rf.predict(all_scaled_data)
#print the details of the best model and its accuracy
print "Best model:", rf
print "Best params learned via GridSearch", rf.best_estimator_
print "Accuracy of best mode", rf.best_score_
print "Accuracy of learned model", metrics.accuracy_score(all_scaled_target,predictions_rf)
#(Harder) task t10c
#Parallelization
#Now re-run the tuning of SVC SVM model. But parallelize it via n_jobs option.
tuned_parameters = [{'kernel' : ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},
{'kernel' : ['poly'], 'degree' : [5, 9], 'C' : [1, 10]}]
#Change the following line as appropriate
#replace this line with gridsearchcv
svm_parallel = GridSearchCV(SVC(random_state=1234), tuned_parameters,cv=3,verbose=2,n_jobs=8)
svm_parallel.fit(all_scaled_data,all_scaled_target)
predictions_svm_paralell = svm_parallel.predict(all_scaled_data)
#print the details of the best model and its accuracy
print "Best model:", svm_parallel
print "Best params learned via GridSearch", svm_parallel.best_estimator_
print "Accuracy of best mode", svm_parallel.best_score_
print "Accuracy of learned model", metrics.accuracy_score(all_scaled_target, predictions_svm_paralell)
We will now implement some regression routines for predicting the house prices in Boston.
#Do not make any changes in this cell
boston = load_boston()
print boston.data.shape
print boston.feature_names
print np.max(boston.target), np.min(boston.target), np.mean(boston.target)
print boston.DESCR
#Do not make any changes in this cell.
print boston.data[0]
print np.max(boston.data), np.min(boston.data), np.mean(boston.data)
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=1234)
#Scale the data - important for regression. Learn what this function does
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train)
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test)
print np.max(X_train), np.min(X_train), np.mean(X_train), np.max(y_train), np.min(y_train), np.mean(y_train)
#task t11a
#Create 13 scatter plots such that variables (CRIM to LSTAT) are in X axis and MEDV in y-axis.
# Organize the images such that the images are in 3 rows of 4 images each and 1 in last row
#Write code here
plt.figure()
fig_t11a,axes_t11a = plt.subplots(4, 4, figsize=(15,20))
t11_img_index = 0
for i in range(boston.feature_names.size):
t11a_row, t11a_col = i // 4, i % 4
axes_t11a[t11a_row][t11a_col].scatter(boston.data[:,i],boston.target)
axes_t11a[t11a_row][t11a_col].set_title("Scatter Plot for " + boston.feature_names[i] + ' and MEDV')
axes_t11a[t11a_row][t11a_col].set_xlabel(boston.feature_names[i])
axes_t11a[t11a_row][t11a_col].set_ylabel('MEDV')
#Do not make any change here
#To make your life easy, I have created a function that
# (a) takes a regressor object,(b) trains it (c) makes some prediction (d) evaluates the prediction
def train_and_evaluate(clf, X_train, y_train):
clf.fit(X_train, y_train)
print "Coefficient of determination on training set:",clf.score(X_train, y_train)
cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=1234)
scores = cross_val_score(clf, X_train, y_train, cv=cv)
print "Average coefficient of determination using 5-fold crossvalidation:",np.mean(scores)
def plot_regression_fit(actual, predicted):
plt.figure()
plt.scatter(actual, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
#task t11b
#create a regressor object based on LinearRegression
# See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
#Change the following line as appropriate
clf_ols = LinearRegression() #change this line
train_and_evaluate(clf_ols,X_train,y_train)
clf_ols_predicted = clf_ols.predict(X_test)
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_ols_predicted))
#task t11c
#See http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
#Create a regression based on Support Vector Regressor. Set the kernel to linear
#Change the following line as appropriate
clf_svr= SVR(kernel='linear')
train_and_evaluate(clf_svr,X_train,y_train)
clf_svr_predicted = clf_svr.predict(X_test)
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_predicted))
#task t11d
#Create a regression based on Support Vector Regressor. Set the kernel to polynomial
#Change the following line as appropriate
clf_svr_poly= SVR(kernel='poly')
train_and_evaluate(clf_svr_poly,X_train,y_train)
clf_svr_poly_predicted = clf_svr_poly.predict(X_test)
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_poly_predicted))
#task t11e
#Create a regression based on Support Vector Regressor. Set the kernel to rbf
#Change the following line as appropriate
clf_svr_rbf= SVR(kernel='rbf')
train_and_evaluate(clf_svr_rbf,X_train,y_train)
clf_svr_rbf_predicted = clf_svr_rbf.predict(X_test)
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_rbf_predicted))
#task t11f
#See http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
#Create regression tree
#Change the following line as appropriate
clf_cart = DecisionTreeRegressor()
train_and_evaluate(clf_cart,X_train,y_train)
clf_cart_predicted = clf_cart.predict(X_test)
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_cart_predicted))
#task t11g
#See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
#Create a random forest regressor with 10 estimators and random state as 1234
#Change the following line as appropriate
clf_rf= RandomForestRegressor(n_estimators=10, random_state=1234)
train_and_evaluate(clf_rf,X_train,y_train)
#The following line prints the most important features
print np.sort(zip(clf_rf.feature_importances_,boston.feature_names),axis=0)
clf_rf_predicted = clf_rf.predict(X_test)
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_rf_predicted))