Programming Assignment 2: Classification of MNIST Handwritten Digits and Regression of House Prices through Boston Dataset

Team Details:

When submitting, fill your team details in this cell. Note that this is a markdown cell.

Student 1 Full Name: Student 1 Full ID: Student 1 Full Email Address:

Student 2 Full Name: Student 2 Full ID: Student 2 Full Email Address:

Student 3 Full Name: Student 3 Full ID: Student 3 Full Email Address:

Assignment Details

At a high level, you will be building and evaluating different classifiers for recognizing handwritten digits of MNIST dataset and also build an evaluate various regression models for predicting house prices in Boston. At a more granular level, you will be doing the following:

1. Binary Classification of MNIST Dataset

In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data. The idea is to show that lot of times, you can simply run a set of classifiers and still get great results. While I will be giving some basic classifier code, you will have some opportunity to improve them by tuning the parameters.

2. Multi-Class Classification of MNIST Dataset

In the second set of tasks, we will do multi-class classification where the idea is to classify the image to one of the ten digits (0-9). We will start with some basic classifiers that are intrinsically multi-class. Then we will learn about how to convert binary classifiers to multi-class classifiers and how scikit-learn makes it very easy.

3. Exploration of Different Evaluation Metrics

In the first two set of tasks, we will narrowly focus on accuracy - what fraction of our predictions - were correct. However, there are a number of popular evaluation metrics. You will learn how (and when) to use these evaluation metrics.

4. Parameter Tuning through Grid Search/Cross Validation and Parallelization

This is an advanced topic where you will learn how to tune your classifier and find optimal parameters. We will explore two powerful techniques of grid search and parameter search. This is a very compute intensive task - so you will also explore how to leverage parallelization capabilities of IPython kernel to get results sooner.

5. Evaluation of Various Regression Models for Boston Houses Dataset

In the final set of tasks, we will use regression to predict Boston house prices. We will explore both Ordinary Least Squares and also explore other regression variant of popular classifiers such as decision trees and SVM.

In [1]:
%matplotlib inline 

#Array processing
import numpy as np

#Data analysis, wrangling and common exploratory operations
import pandas as pd
from pandas import Series, DataFrame

#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import Image

from sklearn.datasets import fetch_mldata, load_boston                                                                       
from sklearn.utils import shuffle                                                                                            
from sklearn.neighbors import KNeighborsClassifier                                                                           
from sklearn import metrics                                                                                                  
from sklearn import tree                                                                                                     
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor                                                       
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB                                                       
from sklearn.svm import SVC, LinearSVC , SVR                                                                                 
from sklearn.linear_model import Perceptron, LogisticRegression, LinearRegression                                            
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor                                                    
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier                                                       
from sklearn.cross_validation import KFold, train_test_split, cross_val_score                                                
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV

########You have to install the python package pydot for generating some graph figures
########   If you are using pip, the command is pip install pydot
########You will also need to download Graphviz from http://www.graphviz.org/Download.php
########If you are using Windows, make sure that Graphviz/dot are in path and can be used by pydot package.
########If you get any pydot error, see url for one possible solution
######## http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will
######## If you are using Anaconda, pydot can be installed as conda install pydot
######## If that does not work check url:
########   http://stackoverflow.com/questions/27482170/installing-pydot-and-graphviz-packages-in-anaconda-environment

import pydot, StringIO 


########################If needed you can import additional packages for helping you, although I would discourage it
########################Put all your imports in the space below. If you use some strange package, 
##########################provide clear information as to how we can install it.

#######################End imports###################################

Part 1: Binary Classification of MNIST Dataset

In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data.

In []:
####################Do not change anything below
#Load MNIST data. fetch_mldata will download the dataset and put it in a folder called mldata. 
#Some things to be aware of:
#   The folder mldata will be created in the folder in which you started the notebook
#   So to make your life easy, always start IPython notebook from same folder.
#   Else the following code will keep downloading MNIST data
mnist = fetch_mldata("MNIST original")                      
#The data is organized as follows:
#  Each row corresponds to an image
#  Each image has 28*28 pixels which is then linearized to a vector of size 784 (ie. 28*28)
# mnist.data gives the image information while mnist.target gives the number in the image
print "#Images = %d and #Pixel per image = %s" % (mnist.data.shape[0], mnist.data.shape[1])

#Print first row of the dataset 
img = mnist.data[0]                                                                                                          
print "First image shows %d" % (mnist.target[0])
print "The corresponding matrix version of image is \n" , img
print "The image in grey shape is "
plt.imshow(img.reshape(28, 28), cmap="Greys")                                                                                
                                                                                                                             
#First 60K images are for training and last 10K are for testing
all_train_data = mnist.data[:60000]                                                                                          
all_test_data = mnist.data[60000:]                                                                                           
all_train_labels = mnist.target[:60000]                                                                                      
all_test_labels = mnist.target[60000:]                                                                                       
                                                              
                                                                                                                             
#For the first task, we will be doing binary classification and focus  on two pairs of 
#  numbers: 7 and 9 which are known to be hard to distinguish
#Get all the seven images
sevens_data = mnist.data[mnist.target==7]      
#Get all the none images
nines_data = mnist.data[mnist.target==9]       
#Merge them to create a new dataset
binary_class_data = np.vstack([sevens_data, nines_data])    
binary_class_labels = np.hstack([np.repeat(7, sevens_data.shape[0]), np.repeat(9, nines_data.shape[0])])    
 
#In order to make the experiments repeatable, we will seed the random number generator to a known value
# That way the results of the experiments will always be same
np.random.seed(1234)                        
#randomly shuffle the data
binary_class_data, binary_class_labels = shuffle(binary_class_data, binary_class_labels)  
print "Shape of data and labels are :" , binary_class_data.shape, binary_class_labels.shape   

#There are approximately 14K images of 7 and 9. 
#Let us take the first 5000 as training and remaining as test data                                          
orig_binary_class_training_data = binary_class_data[:5000]                                                  
binary_class_training_labels = binary_class_labels[:5000]                                                   
orig_binary_class_testing_data = binary_class_data[5000:]                                                   
binary_class_testing_labels = binary_class_labels[5000:] 

#The images are in grey scale where each number is between 0 to 255
# Now let us normalize them so that the values are between 0 and 1. 
# This will be the only modification we will make to the image
binary_class_training_data = orig_binary_class_training_data / 255.0                                        
binary_class_testing_data = orig_binary_class_testing_data / 255.0                                          
scaled_training_data = all_train_data / 255.0                                                                                
scaled_testing_data = all_test_data / 255.0  

print binary_class_training_data[0,:]                                                                       
     
###########Make sure that you remember the variable names and their meaning
#binary_class_training_data, binary_class_training_labels: Normalized images of 7 and 9 and the correct labels for training
#binary_class_testing_data, binary_class_testing_labels : Normalized images of 7 and 9 and correct labels for testing
#orig_binary_class_training_data, orig_binary_class_testing_data: Unnormalized images of 7 and 9
#all_train_data, all_test_data: un normalized images of all digits 
#all_train_labels, all_test_labels: labels for all digits
#scaled_training_data, scaled_testing_data: Normalized version of all_train_data, all_test_data for all digits

Binary Classification in scikit-learn

All classifiers in scikit-learn follow a common pattern that makes life much easier. Follow these steps for all the tasks below.

  1. Instantiate the classifier with appropriate parameters
  2. Train/fit the classifier with training data and correct labels
  3. Test the classifier with unseen data
  4. Evaluate the performance of classifier

K-Nearest Neighbor

We will start with one of the simplest classifiers. In the cell below, I have given the code for k-NN with k=1. This should give an idea about how to create, train and evaluate a classifier.

In [2]:
# Do not change anything in this cell.
# The following are some utility functions to help you visualize k-NN


#Code courtesy AMPLab

#This function displays one or more images in a grid manner.
def show_img_with_neighbors(imgs, n=1):                       
  fig = plt.figure()                                          
  for i in xrange(0, n):                                      
      fig.add_subplot(1, n, i, xticklabels=[], yticklabels=[])
      if n == 1:                                              
          img = imgs                                          
      else:                                                   
          img = imgs[i]                                       
      plt.imshow(img.reshape(28, 28), cmap="Greys")           

#This function shows some images for which k-NN made a mistake
# For each of the missed image, it will also show k most similar images so that you will get an idea of why it failed. 
def show_erroring_images_for_model(errors_in_model, num_img_to_print, model, n_neighbors): 
  for errorImgIndex in errors_in_model[:num_img_to_print]:                             
      error_image = binary_class_testing_data[errorImgIndex]                           
      not_needed, result = model.kneighbors(error_image, n_neighbors=n_neighbors)      
      show_img_with_neighbors(error_image)                                             
      show_img_with_neighbors(binary_class_training_data[result[0],:], len(result[0])) 
In []:
# Do not change anything in this cell.
#The code below creates a K-NN classifier with k=1.
#Clearly observe how I do it step by step.

#Step 1: Create a classifier with appropriate parameters
knn_model_k1 = KNeighborsClassifier(n_neighbors=1, algorithm='brute')
#Step 2: Fit it with training data
knn_model_k1.fit(binary_class_training_data, binary_class_training_labels)
#Print the model so that you know all parameters
print(knn_model_k1)                             
#Step 3: Make predictions based on testing data
predictions_knn_model_k1 = knn_model_k1.predict(binary_class_testing_data)
#Step 4: Evaluate the data
print "Accuracy of K-NN with k=1 is", metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k1)  

#Let us now see the first five images that were predicted incorrectly see what the issue is                                  
errors_knn_model_k1 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k1[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k1, 5, knn_model_k1, 1)                                                      
In []:
#task t1a
#Using the above code as model, create a KNN classifier with k=3
#Change the following line as appropriate
knn_model_k3 = None
knn_model_k3.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_knn_model_k3 = None                                                 
print metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k3)

#Let us now see the first five images that were predicted incorrectly see what the issue is                                  
errors_knn_model_k3 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k3[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k3, 5, knn_model_k3, 3)                                                      
In []:
#task t1b
#Using the above code as model, create a KNN classifier with k=5
#Change the following line as appropriate
knn_model_k5 = None
knn_model_k5.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_knn_model_k5 = None                                                 
print metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k5)

#Let us now see the first five images that were predicted incorrectly see what the issue is                                  
errors_knn_model_k5 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k5[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k5, 5, knn_model_k5, 5)                                                      
In []:
#task t1c
#Now let us evaluate KNN for different values of K and find the best K
#WARNING: This code will take 20-40 minutes to run. So make sure your code is correct
k_vals = xrange(1, 20)
#Initialize to 0
accuracy_vals = [0 for _ in k_vals] 
for k in k_vals:
    #Create a KNN with number of neighbors = k  - #Change the following line as appropriate
    knn_model_for_knn_k = None
    knn_model_for_knn_k.fit(binary_class_training_data, binary_class_training_labels)
    #Make the prediction - #Change the following line as appropriate
    predictions_for_knn_k = None
    accuracy_vals[k-1] = metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_k)
    
#Now you have two arrays k_vals which have different values of k and accuracy_vals which have the corresponding accuracy 
# of a model with that k
#Set opt_k to k that gives best accuracy. 
# Hint: use np.argmax command.
# Also dont forget to add 1 to opt_k to correct the off by one error 
#     (as k varied from 1 to 20 while accuracy_vals[k] started from 0)
opt_k = None ##Change the following line as appropriate
In []:
#task t1d
#Train the model with optimal k that we have seen so far
#Change the following line as appropriate
knn_model_opt_k = None
knn_model_opt_k.fit(binary_class_training_data, binary_class_training_labels)
predictions_for_knn_opt_k = knn_model_opt_k.predict(binary_class_testing_data)
print "Accuracy for best k is ", metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_opt_k)
In []:
#(Harder)task t1e
##############################Read the parameter values in the url
# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
#Try different combinations of the parameters so that you can beat the accuracy of knn_model_opt_k
#Change the following line as appropriate
knn_model_opt_k_student = None
knn_model_opt_k_student.fit(binary_class_training_data, binary_class_training_labels)
predictions_for_knn_opt_k_student = knn_model_opt_k_student.predict(binary_class_testing_data)
print "Accuracy for best k - variant by student is ", metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_opt_k_student)

Decision Trees

In the next set of tasks, you will use Decision trees (see url http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier ) for classification.

In []:
###Do not make any change below
def plot_dtree(model,fileName):                                                                                              
    #You would have to install a Python package pydot                                                                        
    #You would also have to install graphviz for your system - see http://www.graphviz.org/Download..php                     
    #If you get any pydot error, see url
    # http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will
    dot_tree_data = StringIO.StringIO()                                                                                      
    tree.export_graphviz(model , out_file = dot_tree_data)                                                                   
    dtree_graph = pydot.graph_from_dot_data(dot_tree_data.getvalue())                                                        
    dtree_graph.write_png(fileName)                   
In []:
#task t2a
#Create a CART decision tree with DEFAULT values
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_default = None 
cart_model_default.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_default) 
fileName = 'dtree_default.png'
plot_dtree(cart_model_default, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_default = None 
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_default) 
In []:
#task t2b
#Create a CART decision tree with splitting criterion as entropy
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_entropy = None 
cart_model_entropy.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_entropy) 
fileName = 'dtree_entropy.png'
plot_dtree(cart_model_entropy, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_entropy = None 
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_entropy) 
In []:
#task t2c
#Create a CART decision tree with splitting criterion as entropy and min_samples_leaf as 100
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_entropy_limit_leaves = None 
cart_model_entropy_limit_leaves.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_entropy_limit_leaves) 
fileName = 'dtree_entropy_limit_leaves.png'
plot_dtree(cart_model_entropy_limit_leaves, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_entropy_limit_leaves = None 
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_entropy_limit_leaves) 
In []:
#(Harder) task t2d
#Create a CART decision tree that beats the three models above
# You might want to consult the url
# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
#Change the following line as appropriate
cart_model_student = None #Create the model with parameters that beats the models above
cart_model_student.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_student)
fileName = 'dtree_model_student.png'
plot_dtree(cart_model_student, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_student = None #change this line to make predictions
print metrics.accuracy_score(binary_class_testing_labels, predictions_dtree_student) 

Naive Bayes

In this task, you will create a set of Naive Bayes classifiers and evaluate them. You might want to use the following url http://scikit-learn.org/stable/modules/naive_bayes.html

In []:
#task t3a
#Create a Gaussian NB
#Change the following line as appropriate
nb_gaussian_model = None #create the model
nb_gaussian_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_gaussian_model) 
#Change the following line as appropriate
predictions_gaussian_nb = None #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_gaussian_nb) 
In []:
#task t3b
#Now create multinomial NB
#Change the following line as appropriate
nb_multinomial_model = None #create the model
nb_multinomial_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_multinomial_model) 
#Change the following line as appropriate
predictions_multinomial_nb = None #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_multinomial_nb)
In []:
#task t3c
#Now create a binomial NB
#Change the following line as appropriate
nb_binomial_model = None #create the model
nb_binomial_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_binomial_model) 
#Change the following line as appropriate
predictions_binomial_nb = None #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_binomial_nb)

SVM

Let us test SVM on this dataset. You might want to read the http://scikit-learn.org/stable/modules/svm.html for help. We will focus on SVC and LinearSVC.

In []:
#task t4a
#Create a SVM using SVC class. Remember to set random state to 1234.
#Change the following line as appropriate
svc_svm_model = None #Change this line
svc_svm_model.fit(binary_class_training_data, binary_class_training_labels)                         
print(svc_svm_model)                    
#Change the following line as appropriate
predictions_svm_svc = None #Chane this line                                                       
print metrics.accuracy_score(binary_class_testing_labels,predictions_svm_svc)
In []:
#task t4b
#Now create a linear SVM model using LinearSVC class. Remember to set random state to 1234.
#Change the following line as appropriate
linear_svc_svm_model = None 
linear_svc_svm_model.fit(binary_class_training_data, binary_class_training_labels)            
print(linear_svc_svm_model)       
#Change the following line as appropriate
predictions_linear_svm_svc = None                                        
print metrics.accuracy_score(binary_class_testing_labels,predictions_linear_svm_svc)  
In []:
#(Harder) task t4c
#####Either using SVC or LinearSVC, try tweaking the parameters so that you beat the two models above
#Change the following line as appropriate
svc_svm_model_student = None 
svc_svm_model_student.fit(binary_class_training_data, binary_class_training_labels)                         
print(svc_svm_model_student)   
#Change the following line as appropriate
predictions_svm_svc_student = None 
print metrics.accuracy_score(binary_class_testing_labels,predictions_svm_svc_student) 

Logistic Regression

Logistic regression is a simple classifier that converts a regression model into a classification one. You can read the details at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In []:
#task t5a
#Create a model with default parameters. Remember to set random state to 1234
#Change the following line as appropriate
lr_model_default = None            
lr_model_default.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_lr_model_default = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_lr_model_default)
In []:
#(Harder) task t5b
#Now try to beat the model above by tweaking the parameters. Remember to set random state to 1234.
#Change the following line as appropriate
lr_model_default_student = None            
lr_model_default_student.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_lr_model_default_student = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_lr_model_default_student)

Perceptron

Perceptron is a simple model that can be used for linearly separable data. See details at url http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html .

In []:
#task t6a
#Create a perceptron model with default parameters and random state = 1234
#Change the following line as appropriate
perceptron_model_default = None
perceptron_model_default.fit(binary_class_training_data, binary_class_training_labels)       
print(perceptron_model_default)                 
#Change the following line as appropriate
predictions_perceptron_default = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_perceptron_default)    

Random Forests

Random Forests is a very popular ensemble method. See url http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for details

In []:
#task t7a
#Create a random forest classifier with Default parameters
#Change the following line as appropriate
rf_model_default = None                                                                   
rf_model_default.fit(binary_class_training_data, binary_class_training_labels)                                                                    
print(rf_model_default)            
#Change the following line as appropriate
predictions_rf_default = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_rf_default) 
In []:
#(Harder) task t7b
#Create a random forest classifier that can beat the above default model.
#Change the following line as appropriate
rf_model_default_student = None                                                                   
rf_model_default_student.fit(binary_class_training_data, binary_class_training_labels)                                                                    
print(rf_model_default_student)        
#Change the following line as appropriate
predictions_rf_default_student = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_rf_default_student) 

Part 2: Multi Class Classification

So far, we have been focussing on binary classification problems. Now let us consider the corresponding multi class classification problem where given an image, we have to predict whether it is any of the digits 0 to 9.

We will evaluate 4 classifiers - KNN, Decision Trees, Random Forest and SVM. We can see that for the first three, no changes are needed to make them multi class. SVM however, is by default, a binary classifier. We can use two options : OneVsRestClassifier or OneVsOneClassifier to make it into Multi Class classifier. See http://scikit-learn.org/stable/modules/multiclass.html for further details.

Make sure that you remember the variable names and their meaning:

  1. all_train_data, all_test_data: un normalized images of all digits
  2. all_train_labels, all_test_labels: labels for all digits
  3. scaled_training_data, scaled_testing_data: Normalized version of all_train_data, all_test_data for all digits
In []:
#task t8a
#Create a KNN Classifier for K=5 and train it on **scaled** training data and test it on scaled testing data
#Change the following line as appropriate
mc_knn_model_k = None 
mc_knn_model_k.fit(scaled_training_data, all_train_labels)
#Change the following line as appropriate
predictions_mc_knn_model = None 
print metrics.accuracy_score(all_test_labels, predictions_mc_knn_model)
In []:
#task t8b
#Create a Decision tree with DEFAULT parameters and train it on **scaled** training data and test it on scaled testing data
#Remember to set random state to 1234
#Change the following line as appropriate
mc_cart_model_default = None 
mc_cart_model_default.fit(scaled_training_data, all_train_labels)
print(mc_cart_model_default)
fileName = 'mc_dtree_default.png' 
plot_dtree(mc_cart_model_default, fileName) 
Image(filename=fileName)
#Change the following line as appropriate
predictions_mc_dtree_default = None 
print metrics.accuracy_score(all_test_labels, predictions_mc_dtree_default)
In []:
#(Harder) task t8c
#Using the multi classs decision tree above make some changes so that you can beat it.
# Remember to set random state to 1234
#Change the following line as appropriate
mc_cart_model_student = None
mc_cart_model_student.fit(scaled_training_data, all_train_labels)
print(mc_cart_model_student)
fileName = 'mc_dtree_student.png' 
plot_dtree(mc_cart_model_student, fileName) 
Image(filename=fileName)
#Change the following line as appropriate
predictions_mc_dtree_student = None 
print metrics.accuracy_score(all_test_labels, predictions_mc_dtree_student)
In []:
#task t8d
#Create a multi class classifier based on random forest with default parameters.
#Change the following line as appropriate
mc_rf_model_default = None   
mc_rf_model_default.fit(all_train_data, all_train_labels)
print(mc_rf_model_default)      
#Change the following line as appropriate
predictions_mc_rf_default = None
print metrics.accuracy_score(all_test_labels,predictions_mc_rf_default)
In []:
#(Harder) task t8e
#Tune the random forest classifier so that it beats the default model above
#Change the following line as appropriate
mc_rf_model_student = None   
mc_rf_model_student.fit(all_train_data, all_train_labels)
print(mc_rf_model_student)                               
#Change the following line as appropriate
predictions_mc_rf_student = None 
print metrics.accuracy_score(all_test_labels,predictions_mc_rf_student)
In []:
#task t8f
#Create a SVM based OneVsRestClassifier Set random state to 1234
#Change the following line as appropriate
mc_ovr_linear_svc_svm_model = None 
mc_ovr_linear_svc_svm_model.fit(scaled_training_data, all_train_labels)
print(mc_ovr_linear_svc_svm_model)   
#Change the following line as appropriate
predictions_mc_ovr_linear_svm_svc = None
print metrics.accuracy_score(all_test_labels,predictions_mc_ovr_linear_svm_svc) 
In []:
#(Harder) task t8g
#Tune the model above so that it beats the default classifier.  Set random state to 1234
#Change the following line as appropriate
mc_ovr_linear_svc_svm_model_student = None 
mc_ovr_linear_svc_svm_model_student.fit(scaled_training_data, all_train_labels)
print(mc_ovr_linear_svc_svm_model_student)   
#Change the following line as appropriate
predictions_mc_ovr_linear_svm_svc_student = None 
print metrics.accuracy_score(all_test_labels,predictions_mc_ovr_linear_svm_svc_student) 
In []:
#task t8h
#Create a SVM based OneVsOneClassifier. Set random state to 1234
#Change the following line as appropriate
mc_ovo_linear_svc_svm_model = None
mc_ovo_linear_svc_svm_model.fit(scaled_training_data, all_train_labels)
print(mc_ovo_linear_svc_svm_model)   
#Change the following line as appropriate
predictions_mc_ovo_linear_svm_svc = None
print metrics.accuracy_score(all_test_labels,predictions_mc_ovo_linear_svm_svc)   
In []:
#(Harder) task t8i
#Tune the model so that it beats the classifier above. Set random state to 1234
#Change the following line as appropriate
mc_ovo_linear_svc_svm_model_student = None
mc_ovo_linear_svc_svm_model_student.fit(scaled_training_data, all_train_labels)
print(mc_ovo_linear_svc_svm_model_student)   
#Change the following line as appropriate
predictions_mc_ovo_linear_svm_svc_student = None
print metrics.accuracy_score(all_test_labels,predictions_mc_ovo_linear_svm_svc_student)

Part 3: Exploration of Different Evaluation Metrics for Multi Class Classification

Let us evaluate different metrics for the multi class classification models that we created so far. You may want to check the url http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics for additional details.

In []:
#task t9a
#assign the best multi class classification model that you found above.
# For eg, depending on which decision tree model is best 
#  set best_dtree_model to either mc_cart_model_default or mc_cart_model_student
# Similarly do for other models as well. 
# Remember to assign the MODEL variable

#Change the following lines as appropriate
best_knn_model_mc = None 
best_dtree_model_mc = None 
best_rf_model_mc = None    
best_svm_ovr_model_mc = None 
best_svm_ovo_model_mc = None 


#Assign the best BINARY classification models that you found above
best_knn_model_bc = None
best_dtree_model_bc = None
best_rf_model_bc = None
best_svm_model_bc = None


######Note - for tasks 9c and 9d you will use multi class models (ie. those variables ending in _mc)
############# For the remaining you will use binary class models (ie. those variables ending in _bc)
In []:
#task t9b
#Create predictions of these 5 models on test dataset
#Change the following lines as appropriate
predictions_best_knn_model_mc = None
predictions_best_dtree_model_mc = None
predictions_best_rf_model_mc = None
predictions_best_svm_ovr_model_mc = None
predictions_best_svm_ovo_model_mc = None

#Create predictions of these 4 models on test dataset
#Change the following lines as appropriate
predictions_best_knn_model_bc = None
predictions_best_dtree_model_bc = None
predictions_best_rf_model_bc = None
predictions_best_svm_model_bc = None
In []:
#task t9c
#Print the classification report for each of the models above
#  Write code here
In []:
#task t9d
#Print the confusion matrix for each of the models above
#  Write code here
In []:
#(Harder) task t9e
#Each of the model above has some probabilistic interpretation
# So sklearn allows you to get the probability values as part of classification
# Using this information, you can print roc_curve
#Print the roc curve for each of the models above
#  Write code here
In []:
#(Harder) task t9f
#For each model, DRAW the ROC curve
# See url below for details
#http://nbviewer.ipython.org/github/datadave/GADS9-NYC-Spring2014-Lectures/blob/master/lessons/lesson09_decision_trees_random_forests/sklearn_decision_trees.ipynb
#  Write code here
In []:
#(Harder) task t9g
#Print the AUC value for each of the models above
#  Write code here
In []:
#(Harder) task t9h
#Print the precision recall curve for each of the models above
#  Write code here

print the curve based on http://scikit-learn.org/stable/auto_examples/plot_precision_recall.html

Part 4: Parameter Tuning through Grid Search/Cross Validation and Parallelization

So far in this assignment, you manually tweaked the model till it became better. For complex models, this is often cumbersome. A common trick people use is called Grid Search where you exhaustively test various parameter combinations and pick the best set of parameter values. This is a VERY computationally intensive process and hence it will require some parallelization.

In this assignment, you will learn how to tune two models (a RandmForest and SVC SVM) for MNIST dataset and then parallelize it so as to get results faster. You might want to take a look at the url http://scikit-learn.org/stable/modules/grid_search.html for additional details.

One thing you might want to note is that the GridSearchCV uses cross validation for comparing models. So you have to send the ENTIRE MNIST dataset - i.e. mnist.data and mnist.target . The following cell creates two variables all_scaled_data and all_scaled_target that you can pass to GridSearchCV. In order to get the results in reasonable time, set the cv parameter of GridSearchCV to 3. Also remember to set the verbose parameter to 2 to get some details about what happens internally.

In []:
###Do not make any change below
all_scaled_data = binary_class_data / 255.0
all_scaled_target = binary_class_labels
In []:
#(Harder) task t10a
#Tuning SVC SVM model for MNIST
tuned_parameters = [{'kernel' : ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},
                    {'kernel' : ['poly'], 'degree' : [5, 9], 'C' : [1, 10]}] 
#Create a SVC SVM classifier object and tune it using GridSearchCV  function.
#Change the following line as appropriate
svm = None #replace this line with gridsearchcv
#print the details of the best model and its accuracy
In []:
#(Harder) task t10b
#Tuning Random Forest for MNIST
tuned_parameters = [{'max_features': ['sqrt', 'log2'], 'n_estimators': [1000, 1500]}] 
#replace this line with gridsearchcv. Remember to pass 
# the following parameters to constructor or RandomForestClassifier min_samples_split=1, compute_importances=False, n_jobs=-1
#Change the following line as appropriate
rf = None 

#print the details of the best model and its accuracy
In []:
#(Harder) task t10c
#Parallelization
#Now re-run the tuning of SVC SVM model. But parallelize it via n_jobs option.
tuned_parameters = [{'kernel' : ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},
                    {'kernel' : ['poly'], 'degree' : [5, 9], 'C' : [1, 10]}]
#Change the following line as appropriate
svm_parallel = None #replace this line with gridsearchcv
#print the details of the best model and its accuracy

Part 5: Evaluation of Various Regression Models for Boston Houses Dataset

We will now implement some regression routines for predicting the house prices in Boston.

In []:
#Do not make any changes in this cell
boston = load_boston()  
print boston.data.shape 
print boston.feature_names 
print np.max(boston.target), np.min(boston.target), np.mean(boston.target)  
print boston.DESCR     
In []:
#Do not make any changes in this cell.
print boston.data[0]   
print np.max(boston.data), np.min(boston.data), np.mean(boston.data) 

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=1234)

#Scale the data - important for regression. Learn what this function does
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)

X_train = scalerX.transform(X_train)  
y_train = scalery.transform(y_train)  
X_test = scalerX.transform(X_test)    
y_test = scalery.transform(y_test)

print np.max(X_train), np.min(X_train), np.mean(X_train), np.max(y_train), np.min(y_train), np.mean(y_train) 
In []:
#task t11a
#Create 13 scatter plots such that variables (CRIM to LSTAT) are in X axis and MEDV in y-axis.
# Organize the images such that the images are in 3 rows of 4 images each and 1 in last row

#Write code here
In []:
#Do not make any change here
#To make your life easy, I have created a function that 
# (a) takes a regressor object,(b) trains it (c) makes some prediction (d) evaluates the prediction
def train_and_evaluate(clf, X_train, y_train): 
    clf.fit(X_train, y_train)   
    print "Coefficient of determination on training set:",clf.score(X_train, y_train)  
    cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=1234)   
    scores = cross_val_score(clf, X_train, y_train, cv=cv)   
    print "Average coefficient of determination using 5-fold crossvalidation:",np.mean(scores)  
    
def plot_regression_fit(actual, predicted):
    plt.scatter(actual, predicted)
    plt.plot([0, 50], [0, 50], '--k')
    plt.axis('tight')
    plt.xlabel('True price ($1000s)')
    plt.ylabel('Predicted price ($1000s)') 
In []:
#task t11b
#create a regressor object based on LinearRegression
# See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
#Change the following line as appropriate
clf_ols = None #change this line
train_and_evaluate(clf_ols,X_train,y_train) 
clf_ols_predicted = clf_ols.predict(X_test)   
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_ols_predicted))
In []:
#task t11c
#See http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
#Create a regression based on Support Vector Regressor. Set the kernel to linear
#Change the following line as appropriate
clf_svr= None  
train_and_evaluate(clf_svr,X_train,y_train) 
clf_svr_predicted = clf_svr.predict(X_test) 
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_predicted))   

#task t11d
#Create a regression based on Support Vector Regressor. Set the kernel to polynomial
#Change the following line as appropriate
clf_svr_poly= None     
train_and_evaluate(clf_svr_poly,X_train,y_train) 
clf_svr_poly_predicted = clf_svr_poly.predict(X_test)      
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_poly_predicted)) 

#task t11e
#Create a regression based on Support Vector Regressor. Set the kernel to rbf
#Change the following line as appropriate
clf_svr_rbf= None 
train_and_evaluate(clf_svr_rbf,X_train,y_train)
clf_svr_rbf_predicted = clf_svr_rbf.predict(X_test)    
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_rbf_predicted))
In []:
#task t11f
#See http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
#Create regression tree
#Change the following line as appropriate
clf_cart = None   
train_and_evaluate(clf_cart,X_train,y_train) 
clf_cart_predicted = clf_cart.predict(X_test)  
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_cart_predicted))  
In []:
#task t11g
#See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
#Create a random forest regressor with 10 estimators and random state as 1234
#Change the following line as appropriate
clf_rf= None
train_and_evaluate(clf_rf,X_train,y_train)  
#The following line prints the most important features
print np.sort(zip(clf_rf.feature_importances_,boston.feature_names),axis=0)  
clf_rf_predicted = clf_rf.predict(X_test)      
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_rf_predicted))