Programming Assignment 2: Classification of MNIST Handwritten Digits and Regression of House Prices through Boston Dataset

Team Details:

When submitting, fill your team details in this cell. Note that this is a markdown cell.

Student 1 Full Name: Saravanan Thirumuruganathan
Student 1 Full ID: 1000xxxxxx
Student 1 Full Email Address: yyyy@zzzz

Student 2 Full Name: Satishkumar Masilamani
Student 2 Full ID: 1000xxxxxx
Student 2 Full Email Address: yyyy@zzzz

Student 3 Full Name:
Student 3 Full ID:
Student 3 Full Email Address:

Assignment Details

At a high level, you will be building and evaluating different classifiers for recognizing handwritten digits of MNIST dataset and also build an evaluate various regression models for predicting house prices in Boston. At a more granular level, you will be doing the following:

1. Binary Classification of MNIST Dataset

In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data. The idea is to show that lot of times, you can simply run a set of classifiers and still get great results. While I will be giving some basic classifier code, you will have some opportunity to improve them by tuning the parameters.

2. Multi-Class Classification of MNIST Dataset

In the second set of tasks, we will do multi-class classification where the idea is to classify the image to one of the ten digits (0-9). We will start with some basic classifiers that are intrinsically multi-class. Then we will learn about how to convert binary classifiers to multi-class classifiers and how scikit-learn makes it very easy.

3. Exploration of Different Evaluation Metrics

In the first two set of tasks, we will narrowly focus on accuracy - what fraction of our predictions - were correct. However, there are a number of popular evaluation metrics. You will learn how (and when) to use these evaluation metrics.

4. Parameter Tuning through Grid Search/Cross Validation and Parallelization

This is an advanced topic where you will learn how to tune your classifier and find optimal parameters. We will explore two powerful techniques of grid search and parameter search. This is a very compute intensive task - so you will also explore how to leverage parallelization capabilities of IPython kernel to get results sooner.

5. Evaluation of Various Regression Models for Boston Houses Dataset

In the final set of tasks, we will use regression to predict Boston house prices. We will explore both Ordinary Least Squares and also explore other regression variant of popular classifiers such as decision trees and SVM.

In [1]:
%matplotlib inline 

#Array processing
import numpy as np

#Data analysis, wrangling and common exploratory operations
import pandas as pd
from pandas import Series, DataFrame

#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import Image

from sklearn.datasets import fetch_mldata, load_boston                                                                       
from sklearn.utils import shuffle                                                                                            
from sklearn.neighbors import KNeighborsClassifier                                                                           
from sklearn import metrics                                                                                                  
from sklearn import tree                                                                                                     
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor                                                       
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB                                                       
from sklearn.svm import SVC, LinearSVC , SVR                                                                                 
from sklearn.linear_model import Perceptron, LogisticRegression, LinearRegression                                            
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor                                                    
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier                                                       
from sklearn.cross_validation import KFold, train_test_split, cross_val_score                                                
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV

########You have to install the python package pydot for generating some graph figures
########   If you are using pip, the command is pip install pydot
########You will also need to download Graphviz from http://www.graphviz.org/Download.php
########If you are using Windows, make sure that Graphviz/dot are in path and can be used by pydot package.
########If you get any pydot error, see url for one possible solution
######## http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will
######## If you are using Anaconda, pydot can be installed as conda install pydot
######## If that does not work check url:
########   http://stackoverflow.com/questions/27482170/installing-pydot-and-graphviz-packages-in-anaconda-environment

import pydot, StringIO 


########################If needed you can import additional packages for helping you, although I would discourage it
########################Put all your imports in the space below. If you use some strange package, 
##########################provide clear information as to how we can install it.

#######################End imports###################################

Part 1: Binary Classification of MNIST Dataset

In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data.

In [2]:
####################Do not change anything below
#Load MNIST data. fetch_mldata will download the dataset and put it in a folder called mldata. 
#Some things to be aware of:
#   The folder mldata will be created in the folder in which you started the notebook
#   So to make your life easy, always start IPython notebook from same folder.
#   Else the following code will keep downloading MNIST data
mnist = fetch_mldata("MNIST original")                      
#The data is organized as follows:
#  Each row corresponds to an image
#  Each image has 28*28 pixels which is then linearized to a vector of size 784 (ie. 28*28)
# mnist.data gives the image information while mnist.target gives the number in the image
print "#Images = %d and #Pixel per image = %s" % (mnist.data.shape[0], mnist.data.shape[1])

#Print first row of the dataset 
img = mnist.data[0]                                                                                                          
print "First image shows %d" % (mnist.target[0])
print "The corresponding matrix version of image is \n" , img
print "The image in grey shape is "
plt.imshow(img.reshape(28, 28), cmap="Greys")                                                                                
                                                                                                                             
#First 60K images are for training and last 10K are for testing
all_train_data = mnist.data[:60000]                                                                                          
all_test_data = mnist.data[60000:]                                                                                           
all_train_labels = mnist.target[:60000]                                                                                      
all_test_labels = mnist.target[60000:]                                                                                       
                                                              
                                                                                                                             
#For the first task, we will be doing binary classification and focus  on two pairs of 
#  numbers: 7 and 9 which are known to be hard to distinguish
#Get all the seven images
sevens_data = mnist.data[mnist.target==7]      
#Get all the none images
nines_data = mnist.data[mnist.target==9]       
#Merge them to create a new dataset
binary_class_data = np.vstack([sevens_data, nines_data])    
binary_class_labels = np.hstack([np.repeat(7, sevens_data.shape[0]), np.repeat(9, nines_data.shape[0])])    
 
#In order to make the experiments repeatable, we will seed the random number generator to a known value
# That way the results of the experiments will always be same
np.random.seed(1234)                        
#randomly shuffle the data
binary_class_data, binary_class_labels = shuffle(binary_class_data, binary_class_labels)  
print "Shape of data and labels are :" , binary_class_data.shape, binary_class_labels.shape   

#There are approximately 14K images of 7 and 9. 
#Let us take the first 5000 as training and remaining as test data                                          
orig_binary_class_training_data = binary_class_data[:5000]                                                  
binary_class_training_labels = binary_class_labels[:5000]                                                   
orig_binary_class_testing_data = binary_class_data[5000:]                                                   
binary_class_testing_labels = binary_class_labels[5000:] 

#The images are in grey scale where each number is between 0 to 255
# Now let us normalize them so that the values are between 0 and 1. 
# This will be the only modification we will make to the image
binary_class_training_data = orig_binary_class_training_data / 255.0                                        
binary_class_testing_data = orig_binary_class_testing_data / 255.0                                          
scaled_training_data = all_train_data / 255.0                                                                                
scaled_testing_data = all_test_data / 255.0  

print binary_class_training_data[0,:]                                                                       
     
###########Make sure that you remember the variable names and their meaning
#binary_class_training_data, binary_class_training_labels: Normalized images of 7 and 9 and the correct labels for training
#binary_class_testing_data, binary_class_testing_labels : Normalized images of 7 and 9 and correct labels for testing
#orig_binary_class_training_data, orig_binary_class_testing_data: Unnormalized images of 7 and 9
#all_train_data, all_test_data: un normalized images of all digits 
#all_train_labels, all_test_labels: labels for all digits
#scaled_training_data, scaled_testing_data: Normalized version of all_train_data, all_test_data for all digits
#Images = 70000 and #Pixel per image = 784
First image shows 0
The corresponding matrix version of image is 
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0  51 159 253 159  50   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0  48 238 252 252 252 237   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0  54 227 253 252 239 233 252  57   6   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  10  60 224 252 253 252 202  84 252
 253 122   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0 163 252 252 252 253 252 252  96 189 253 167   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0  51 238 253 253 190 114 253 228
  47  79 255 168   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0  48 238 252 252 179  12  75 121  21   0   0 253 243  50   0   0   0
   0   0   0   0   0   0   0   0   0   0  38 165 253 233 208  84   0   0
   0   0   0   0 253 252 165   0   0   0   0   0   0   0   0   0   0   0
   0   7 178 252 240  71  19  28   0   0   0   0   0   0 253 252 195   0
   0   0   0   0   0   0   0   0   0   0   0  57 252 252  63   0   0   0
   0   0   0   0   0   0 253 252 195   0   0   0   0   0   0   0   0   0
   0   0   0 198 253 190   0   0   0   0   0   0   0   0   0   0 255 253
 196   0   0   0   0   0   0   0   0   0   0   0  76 246 252 112   0   0
   0   0   0   0   0   0   0   0 253 252 148   0   0   0   0   0   0   0
   0   0   0   0  85 252 230  25   0   0   0   0   0   0   0   0   7 135
 253 186  12   0   0   0   0   0   0   0   0   0   0   0  85 252 223   0
   0   0   0   0   0   0   0   7 131 252 225  71   0   0   0   0   0   0
   0   0   0   0   0   0  85 252 145   0   0   0   0   0   0   0  48 165
 252 173   0   0   0   0   0   0   0   0   0   0   0   0   0   0  86 253
 225   0   0   0   0   0   0 114 238 253 162   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0  85 252 249 146  48  29  85 178 225 253
 223 167  56   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  85 252 252 252 229 215 252 252 252 196 130   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0  28 199 252 252 253 252 252 233
 145   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0  25 128 252 253 252 141  37   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]
The image in grey shape is 
Shape of data and labels are : (14251, 784) (14251,)
[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.04313725  0.34117647  0.54117647  0.54117647  0.54117647  0.74901961
  0.50196078  0.0745098   0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.05098039  0.34901961  0.79607843  0.89411765  0.80784314  0.68235294
  0.36078431  0.36078431  0.62352941  0.9254902   0.17254902  0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.17647059  0.6745098   0.98823529  0.74117647  0.1254902   0.          0.
  0.          0.          0.09411765  0.98823529  0.67058824  0.05098039
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.24313725  0.9254902   0.89019608  0.25882353  0.01568627  0.          0.
  0.          0.          0.          0.09411765  0.98823529  0.98823529
  0.17647059  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.41568627  0.95294118  0.90588235  0.18823529  0.          0.          0.
  0.          0.          0.          0.          0.29803922  0.98823529
  0.98823529  0.17647059  0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.40784314  0.99215686  0.86666667  0.16470588  0.          0.          0.
  0.          0.          0.          0.24313725  0.74901961  1.
  0.99215686  0.80392157  0.08235294  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.25882353  0.9254902   0.98823529  0.          0.          0.          0.
  0.          0.          0.03529412  0.18431373  0.84313725  0.98823529
  0.97647059  0.80784314  0.21960784  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.05098039  0.79607843  0.98823529  0.53333333  0.          0.          0.
  0.          0.          0.1254902   0.30588235  0.98823529  0.98823529
  0.98823529  0.45098039  0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.18431373  0.98823529  0.69019608  0.01568627  0.          0.
  0.          0.          0.33333333  0.89803922  0.85882353  0.51764706
  0.98823529  0.82352941  0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.10196078  0.86666667  0.62745098  0.          0.
  0.29019608  0.57254902  0.90196078  0.98039216  0.29411765  0.07058824
  0.68235294  0.98823529  0.33333333  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.65882353  0.97647059
  0.99215686  1.          0.97647059  0.78039216  0.45490196  0.04313725
  0.          0.22352941  0.99215686  0.75294118  0.04313725  0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.4627451   0.80784314  0.36078431  0.29411765  0.          0.          0.
  0.01960784  0.69803922  0.94117647  0.14509804  0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.42352941  0.98823529  0.59215686  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.91372549  0.92156863  0.17254902  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.61960784  0.99215686  0.6         0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.03529412  0.91372549  0.94901961  0.16470588  0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.36470588  0.98823529  0.52941176  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.48627451  0.98823529  0.1254902   0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.10196078  0.87843137  0.98823529  0.29019608  0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.14901961  0.70196078  0.37254902  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.        ]

Binary Classification in scikit-learn

All classifiers in scikit-learn follow a common pattern that makes life much easier. Follow these steps for all the tasks below.

  1. Instantiate the classifier with appropriate parameters
  2. Train/fit the classifier with training data and correct labels
  3. Test the classifier with unseen data
  4. Evaluate the performance of classifier

K-Nearest Neighbor

We will start with one of the simplest classifiers. In the cell below, I have given the code for k-NN with k=1. This should give an idea about how to create, train and evaluate a classifier.

In [3]:
# Do not change anything in this cell.
# The following are some utility functions to help you visualize k-NN


#Code courtesy AMPLab

#This function displays one or more images in a grid manner.
def show_img_with_neighbors(imgs, n=1):                       
  fig = plt.figure()                                          
  for i in xrange(0, n):                                      
      fig.add_subplot(1, n, i, xticklabels=[], yticklabels=[])
      if n == 1:                                              
          img = imgs                                          
      else:                                                   
          img = imgs[i]                                       
      plt.imshow(img.reshape(28, 28), cmap="Greys")           

#This function shows some images for which k-NN made a mistake
# For each of the missed image, it will also show k most similar images so that you will get an idea of why it failed. 
def show_erroring_images_for_model(errors_in_model, num_img_to_print, model, n_neighbors): 
  for errorImgIndex in errors_in_model[:num_img_to_print]:                             
      error_image = binary_class_testing_data[errorImgIndex]                           
      not_needed, result = model.kneighbors(error_image, n_neighbors=n_neighbors)      
      show_img_with_neighbors(error_image)                                             
      show_img_with_neighbors(binary_class_training_data[result[0],:], len(result[0])) 
In [4]:
# Do not change anything in this cell.
#The code below creates a K-NN classifier with k=1.
#Clearly observe how I do it step by step.

#Step 1: Create a classifier with appropriate parameters
knn_model_k1 = KNeighborsClassifier(n_neighbors=1, algorithm='brute')
#Step 2: Fit it with training data
knn_model_k1.fit(binary_class_training_data, binary_class_training_labels)
#Print the model so that you know all parameters
print(knn_model_k1)                             
#Step 3: Make predictions based on testing data
predictions_knn_model_k1 = knn_model_k1.predict(binary_class_testing_data)
#Step 4: Evaluate the data
print "Accuracy of K-NN with k=1 is", metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k1)  

#Let us now see the first five images that were predicted incorrectly see what the issue is                                  
errors_knn_model_k1 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k1[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k1, 5, knn_model_k1, 1)                                                      
KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=1, p=2, weights='uniform')
Accuracy of K-NN with k=1 is 0.984109825965
/usr/local/lib/python2.7/dist-packages/matplotlib/axes/_subplots.py:69: MatplotlibDeprecationWarning: The use of 0 (which ends up being the _last_ sub-plot) is deprecated in 1.4 and will raise an error in 1.5
  mplDeprecation)
In [5]:
#task t1a
#Using the above code as model, create a KNN classifier with k=3
#Change the following line as appropriate
knn_model_k3 = KNeighborsClassifier(n_neighbors=3, algorithm='brute')
knn_model_k3.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_knn_model_k3 = knn_model_k3.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k3)

#Let us now see the first five images that were predicted incorrectly see what the issue is                                  
errors_knn_model_k3 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k3[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k3, 5, knn_model_k3, 3)                                                      
0.985515079451
In [6]:
#task t1b
#Using the above code as model, create a KNN classifier with k=5
#Change the following line as appropriate
knn_model_k5 = KNeighborsClassifier(n_neighbors=5, algorithm='brute')
knn_model_k5.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_knn_model_k5 = knn_model_k5.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k5)

#Let us now see the first five images that were predicted incorrectly see what the issue is                                  
errors_knn_model_k5 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k5[i] != binary_class_testing_labels[i]]
show_erroring_images_for_model(errors_knn_model_k5, 5, knn_model_k5, 5)                                                      
0.984434115231
In [34]:
#task t1c
#Now let us evaluate KNN for different values of K and find the best K
#WARNING: This code will take 20-40 minutes to run. So make sure your code is correct
k_vals = xrange(1, 20)
#Initialize to 0
accuracy_vals = [0 for _ in k_vals] 
for k in k_vals:
    #Create a KNN with number of neighbors = k  - #Change the following line as appropriate
    knn_model_for_knn_k = KNeighborsClassifier(n_neighbors=k, algorithm='brute')
    knn_model_for_knn_k.fit(binary_class_training_data, binary_class_training_labels)
    #Make the prediction - #Change the following line as appropriate
    predictions_for_knn_k = knn_model_for_knn_k.predict(binary_class_testing_data)
    accuracy_vals[k-1] = metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_k)
    
#Now you have two arrays k_vals which have different values of k and accuracy_vals which have the corresponding accuracy 
# of a model with that k
#Set opt_k to k that gives best accuracy. 
# Hint: use np.argmax command.
# Also dont forget to add 1 to opt_k to correct the off by one error 
#     (as k varied from 1 to 20 while accuracy_vals[k] started from 0)
opt_k = k_vals[np.argmax(accuracy_vals)] ##Change the following line as appropriate
print "Best k for k-nn is:",opt_k
Best k for k-nn is: 3
In [35]:
#task t1d
#Train the model with optimal k that we have seen so far
#Change the following line as appropriate
knn_model_opt_k = KNeighborsClassifier(n_neighbors=opt_k, algorithm='brute')
knn_model_opt_k.fit(binary_class_training_data, binary_class_training_labels)
predictions_for_knn_opt_k = knn_model_opt_k.predict(binary_class_testing_data)
print "Accuracy for best k is ", metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_opt_k)
Accuracy for best k is  0.985515079451
In [ ]:
#(Harder)task t1e
##############################Read the parameter values in the url
# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
#Try different combinations of the parameters so that you can beat the accuracy of knn_model_opt_k
#Change the following line as appropriate

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.
knn_model_opt_k_student = None
knn_model_opt_k_student.fit(binary_class_training_data, binary_class_training_labels)
predictions_for_knn_opt_k_student = knn_model_opt_k_student.predict(binary_class_testing_data)
print "Accuracy for best k - variant by student is ", metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_opt_k_student)

Decision Trees

In the next set of tasks, you will use Decision trees (see url http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier ) for classification.

In [3]:
###Do not make any change below
def plot_dtree(model,fileName):                                                                                              
    #You would have to install a Python package pydot                                                                        
    #You would also have to install graphviz for your system - see http://www.graphviz.org/Download..php                     
    #If you get any pydot error, see url
    # http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will
    dot_tree_data = StringIO.StringIO()                                                                                      
    tree.export_graphviz(model , out_file = dot_tree_data)                                                                   
    dtree_graph = pydot.graph_from_dot_data(dot_tree_data.getvalue())                                                        
    dtree_graph.write_png(fileName)                   
In [4]:
#task t2a
#Create a CART decision tree with DEFAULT values
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_default = DecisionTreeClassifier(random_state = 1234) 
cart_model_default.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_default) 
fileName = 'dtree_default.png'
plot_dtree(cart_model_default, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_default = cart_model_default.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_default) 
DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=1234, splitter='best')
0.967138687709
In [5]:
#task t2b
#Create a CART decision tree with splitting criterion as entropy
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_entropy = DecisionTreeClassifier(random_state = 1234, criterion = "entropy")  
cart_model_entropy.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_entropy) 
fileName = 'dtree_entropy.png'
plot_dtree(cart_model_entropy, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_entropy = cart_model_entropy.predict(binary_class_testing_data) 
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_entropy) 
DecisionTreeClassifier(compute_importances=None, criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=1234, splitter='best')
0.969408712572
In [6]:
#task t2c
#Create a CART decision tree with splitting criterion as entropy and min_samples_leaf as 100
#Remember to set the random state to 1234
#Change the following line as appropriate
cart_model_entropy_limit_leaves = DecisionTreeClassifier(random_state = 1234, criterion = "entropy", min_samples_leaf = 100) 
cart_model_entropy_limit_leaves.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_entropy_limit_leaves) 
fileName = 'dtree_entropy_limit_leaves.png'
plot_dtree(cart_model_entropy_limit_leaves, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_entropy_limit_leaves = cart_model_entropy_limit_leaves.predict(binary_class_testing_data) 
print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_entropy_limit_leaves) 
DecisionTreeClassifier(compute_importances=None, criterion='entropy',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=100, min_samples_split=2,
            random_state=1234, splitter='best')
0.926926818722
In [ ]:
#(Harder) task t2d
#Create a CART decision tree that beats the three models above
# You might want to consult the url
# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
#Change the following line as appropriate

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.


cart_model_student = None #Create the model with parameters that beats the models above
cart_model_student.fit(binary_class_training_data, binary_class_training_labels)
print(cart_model_student)
fileName = 'dtree_model_student.png'
plot_dtree(cart_model_student, fileName)
Image(filename=fileName)
#Change the following line as appropriate
predictions_dtree_student = None #change this line to make predictions
print metrics.accuracy_score(binary_class_testing_labels, predictions_dtree_student) 

Naive Bayes

In this task, you will create a set of Naive Bayes classifiers and evaluate them. You might want to use the following url http://scikit-learn.org/stable/modules/naive_bayes.html

In [7]:
#task t3a
#Create a Gaussian NB
#Change the following line as appropriate
nb_gaussian_model = GaussianNB() #create the model
nb_gaussian_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_gaussian_model) 
#Change the following line as appropriate
predictions_gaussian_nb = nb_gaussian_model.predict(binary_class_testing_data) #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_gaussian_nb) 
GaussianNB()
0.658955788563
In [8]:
#task t3b
#Now create multinomial NB
#Change the following line as appropriate
nb_multinomial_model = MultinomialNB() #create the model
nb_multinomial_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_multinomial_model) 
#Change the following line as appropriate
predictions_multinomial_nb = nb_multinomial_model.predict(binary_class_testing_data) #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_multinomial_nb)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
0.915900983677
In [9]:
#task t3c
#Now create a binomial NB
#Change the following line as appropriate
nb_binomial_model = BernoulliNB() #create the model
nb_binomial_model.fit(binary_class_training_data, binary_class_training_labels)
print(nb_binomial_model) 
#Change the following line as appropriate
predictions_binomial_nb = nb_binomial_model.predict(binary_class_testing_data) #make the predictions
print metrics.accuracy_score(binary_class_testing_labels,predictions_binomial_nb)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
0.920981515512

SVM

Let us test SVM on this dataset. You might want to read the http://scikit-learn.org/stable/modules/svm.html for help. We will focus on SVC and LinearSVC.

In [10]:
#task t4a
#Create a SVM using SVC class. Remember to set random state to 1234.
#Change the following line as appropriate
svc_svm_model = svc_svm_model = SVC(random_state = 1234) #Change this line
svc_svm_model.fit(binary_class_training_data, binary_class_training_labels)                         
print(svc_svm_model)                    
#Change the following line as appropriate
predictions_svm_svc = svc_svm_model.predict(binary_class_testing_data) #Chane this line                                                       
print metrics.accuracy_score(binary_class_testing_labels,predictions_svm_svc)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=1234,
  shrinking=True, tol=0.001, verbose=False)
0.953950924224
In [11]:
#task t4b
#Now create a linear SVM model using LinearSVC class. Remember to set random state to 1234.
#Change the following line as appropriate
linear_svc_svm_model = LinearSVC(random_state=1234) 
linear_svc_svm_model.fit(binary_class_training_data, binary_class_training_labels)            
print(linear_svc_svm_model)       
#Change the following line as appropriate
predictions_linear_svm_svc = linear_svc_svm_model.predict(binary_class_testing_data)                                        
print metrics.accuracy_score(binary_class_testing_labels,predictions_linear_svm_svc)  
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=1234, tol=0.0001, verbose=0)
0.949735163766
In [ ]:
#(Harder) task t4c
#####Either using SVC or LinearSVC, try tweaking the parameters so that you beat the two models above

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.


#Change the following line as appropriate
svc_svm_model_student = None 
svc_svm_model_student.fit(binary_class_training_data, binary_class_training_labels)                         
print(svc_svm_model_student)   
#Change the following line as appropriate
predictions_svm_svc_student = None 
print metrics.accuracy_score(binary_class_testing_labels,predictions_svm_svc_student) 

Logistic Regression

Logistic regression is a simple classifier that converts a regression model into a classification one. You can read the details at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [12]:
#task t5a
#Create a model with default parameters. Remember to set random state to 1234
#Change the following line as appropriate
lr_model_default = LogisticRegression(random_state = 1234)      
lr_model_default.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_lr_model_default = lr_model_default.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_lr_model_default)
0.957301913307
In [ ]:
#(Harder) task t5b
#Now try to beat the model above by tweaking the parameters. Remember to set random state to 1234.

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.


#Change the following line as appropriate
lr_model_default_student = None            
lr_model_default_student.fit(binary_class_training_data, binary_class_training_labels)
#Change the following line as appropriate
predictions_lr_model_default_student = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_lr_model_default_student)

Perceptron

Perceptron is a simple model that can be used for linearly separable data. See details at url http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html .

In [13]:
#task t6a
#Create a perceptron model with default parameters and random state = 1234
#Change the following line as appropriate
perceptron_model_default = Perceptron(random_state = 1234)
perceptron_model_default.fit(binary_class_training_data, binary_class_training_labels)       
print(perceptron_model_default)                 
#Change the following line as appropriate
predictions_perceptron_default = perceptron_model_default.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_perceptron_default)    
Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      n_iter=5, n_jobs=1, penalty=None, random_state=1234, shuffle=False,
      verbose=0, warm_start=False)
0.933520700465

Random Forests

Random Forests is a very popular ensemble method. See url http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for details

In [14]:
#task t7a
#Create a random forest classifier with Default parameters
#Change the following line as appropriate
rf_model_default = RandomForestClassifier(random_state = 1234)                                                                    
rf_model_default.fit(binary_class_training_data, binary_class_training_labels)                                                                    
print(rf_model_default)            
#Change the following line as appropriate
predictions_rf_default = rf_model_default.predict(binary_class_testing_data)
print metrics.accuracy_score(binary_class_testing_labels,predictions_rf_default) 
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=1234, verbose=0)
0.973948762296
In [ ]:
#(Harder) task t7b
#Create a random forest classifier that can beat the above default model.

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.


#Change the following line as appropriate
rf_model_default_student = None                                                                   
rf_model_default_student.fit(binary_class_training_data, binary_class_training_labels)                                                                    
print(rf_model_default_student)        
#Change the following line as appropriate
predictions_rf_default_student = None
print metrics.accuracy_score(binary_class_testing_labels,predictions_rf_default_student) 

Part 2: Multi Class Classification

So far, we have been focussing on binary classification problems. Now let us consider the corresponding multi class classification problem where given an image, we have to predict whether it is any of the digits 0 to 9.

We will evaluate 4 classifiers - KNN, Decision Trees, Random Forest and SVM. We can see that for the first three, no changes are needed to make them multi class. SVM however, is by default, a binary classifier. We can use two options : OneVsRestClassifier or OneVsOneClassifier to make it into Multi Class classifier. See http://scikit-learn.org/stable/modules/multiclass.html for further details.

Make sure that you remember the variable names and their meaning:

  1. all_train_data, all_test_data: un normalized images of all digits
  2. all_train_labels, all_test_labels: labels for all digits
  3. scaled_training_data, scaled_testing_data: Normalized version of all_train_data, all_test_data for all digits
In [15]:
#task t8a
#Create a KNN Classifier for K=5 and train it on **scaled** training data and test it on scaled testing data
#Change the following line as appropriate
mc_knn_model_k = KNeighborsClassifier(n_neighbors=5, algorithm='brute') 
mc_knn_model_k.fit(scaled_training_data, all_train_labels)
#Change the following line as appropriate
predictions_mc_knn_model = mc_knn_model_k.predict(scaled_testing_data) 
print metrics.accuracy_score(all_test_labels, predictions_mc_knn_model)
0.9688
In [28]:
#task t8b
#Create a Decision tree with DEFAULT parameters and train it on **scaled** training data and test it on scaled testing data
#Remember to set random state to 1234
#Change the following line as appropriate
mc_cart_model_default = DecisionTreeClassifier(random_state = 1234) 
mc_cart_model_default.fit(scaled_training_data, all_train_labels)
print(mc_cart_model_default)
fileName = 'mc_dtree_default.png' 
plot_dtree(mc_cart_model_default, fileName) 
Image(filename=fileName)
#Change the following line as appropriate
predictions_mc_dtree_default = mc_cart_model_default.predict(scaled_testing_data) 
print metrics.accuracy_score(all_test_labels, predictions_mc_dtree_default)
DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=1234, splitter='best')
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.0718425 to fit

0.8787
In [ ]:
#(Harder) task t8c
#Using the multi classs decision tree above make some changes so that you can beat it.

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.


# Remember to set random state to 1234
#Change the following line as appropriate
mc_cart_model_student = None
mc_cart_model_student.fit(scaled_training_data, all_train_labels)
print(mc_cart_model_student)
fileName = 'mc_dtree_student.png' 
plot_dtree(mc_cart_model_student, fileName) 
Image(filename=fileName)
#Change the following line as appropriate
predictions_mc_dtree_student = None 
print metrics.accuracy_score(all_test_labels, predictions_mc_dtree_student)
In [17]:
#task t8d
#Create a multi class classifier based on random forest with default parameters.
#Change the following line as appropriate
mc_rf_model_default = RandomForestClassifier(random_state =1234)   
mc_rf_model_default.fit(all_train_data, all_train_labels)
print(mc_rf_model_default)      
#Change the following line as appropriate
predictions_mc_rf_default = mc_rf_model_default.predict(all_test_data)
print metrics.accuracy_score(all_test_labels,predictions_mc_rf_default)
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=1234, verbose=0)
0.9487
In [ ]:
#(Harder) task t8e
#Tune the random forest classifier so that it beats the default model above

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.


#Change the following line as appropriate
mc_rf_model_student = None   
mc_rf_model_student.fit(all_train_data, all_train_labels)
print(mc_rf_model_student)                               
#Change the following line as appropriate
predictions_mc_rf_student = None 
print metrics.accuracy_score(all_test_labels,predictions_mc_rf_student)
In [18]:
#task t8f
#Create a SVM based OneVsRestClassifier Set random state to 1234
#Change the following line as appropriate
mc_ovr_linear_svc_svm_model = OneVsRestClassifier(LinearSVC(random_state =1234)) 
mc_ovr_linear_svc_svm_model.fit(scaled_training_data, all_train_labels)
print(mc_ovr_linear_svc_svm_model)   
#Change the following line as appropriate
predictions_mc_ovr_linear_svm_svc = mc_ovr_linear_svc_svm_model.predict(scaled_testing_data)
print metrics.accuracy_score(all_test_labels,predictions_mc_ovr_linear_svm_svc) 
OneVsRestClassifier(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=1234, tol=0.0001, verbose=0),
          n_jobs=1)
0.9182
In [ ]:
#(Harder) task t8g
#Tune the model above so that it beats the default classifier.  Set random state to 1234

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.


#Change the following line as appropriate
mc_ovr_linear_svc_svm_model_student = None 
mc_ovr_linear_svc_svm_model_student.fit(scaled_training_data, all_train_labels)
print(mc_ovr_linear_svc_svm_model_student)   
#Change the following line as appropriate
predictions_mc_ovr_linear_svm_svc_student = None 
print metrics.accuracy_score(all_test_labels,predictions_mc_ovr_linear_svm_svc_student) 
In [20]:
#task t8h
#Create a SVM based OneVsOneClassifier. Set random state to 1234
#Change the following line as appropriate
mc_ovo_linear_svc_svm_model = OneVsOneClassifier(LinearSVC(random_state =1234))
mc_ovo_linear_svc_svm_model.fit(scaled_training_data, all_train_labels)
print(mc_ovo_linear_svc_svm_model)   
#Change the following line as appropriate
predictions_mc_ovo_linear_svm_svc = mc_ovo_linear_svc_svm_model.predict(scaled_testing_data)
print metrics.accuracy_score(all_test_labels,predictions_mc_ovo_linear_svm_svc)   
OneVsOneClassifier(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=1234, tol=0.0001, verbose=0),
          n_jobs=1)
0.9372
In [ ]:
#(Harder) task t8i
#Tune the model so that it beats the classifier above. Set random state to 1234

#######Many different ways to do this - a number of teams got it right. 
#######If interested - please ask in Piazza post for how others did it.


#Change the following line as appropriate
mc_ovo_linear_svc_svm_model_student = None
mc_ovo_linear_svc_svm_model_student.fit(scaled_training_data, all_train_labels)
print(mc_ovo_linear_svc_svm_model_student)   
#Change the following line as appropriate
predictions_mc_ovo_linear_svm_svc_student = None
print metrics.accuracy_score(all_test_labels,predictions_mc_ovo_linear_svm_svc_student)

Part 3: Exploration of Different Evaluation Metrics for Multi Class Classification

Let us evaluate different metrics for the multi class classification models that we created so far. You may want to check the url http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics for additional details.

In [51]:
#task t9a
#assign the best multi class classification model that you found above.
# For eg, depending on which decision tree model is best 
#  set best_dtree_model to either mc_cart_model_default or mc_cart_model_student
# Similarly do for other models as well. 
# Remember to assign the MODEL variable

#Change the following lines as appropriate
best_knn_model_mc = mc_knn_model_k  
best_dtree_model_mc = mc_cart_model_default
best_rf_model_mc = mc_rf_model_default
best_svm_ovr_model_mc = mc_ovr_linear_svc_svm_model 
best_svm_ovo_model_mc = mc_ovo_linear_svc_svm_model 


#Assign the best BINARY classification models that you found above
best_knn_model_bc = knn_model_opt_k
best_dtree_model_bc = cart_model_entropy
best_rf_model_bc = rf_model_default
best_svm_model_bc = svc_svm_model


######Note - for tasks 9c and 9d you will use multi class models (ie. those variables ending in _mc)
############# For the remaining you will use binary class models (ie. those variables ending in _bc)
In [43]:
#task t9b
#Create predictions of these 5 models on test dataset
#Change the following lines as appropriate
predictions_best_knn_model_mc = predictions_mc_knn_model
predictions_best_dtree_model_mc = predictions_mc_dtree_default
predictions_best_rf_model_mc = predictions_mc_rf_default
predictions_best_svm_ovr_model_mc = predictions_mc_ovr_linear_svm_svc
predictions_best_svm_ovo_model_mc = predictions_mc_ovo_linear_svm_svc

#Create predictions of these 4 models on test dataset
#Change the following lines as appropriate
predictions_best_knn_model_bc = predictions_for_knn_opt_k
predictions_best_dtree_model_bc = predictions_dtree_entropy
predictions_best_rf_model_bc = predictions_dtree_entropy
predictions_best_svm_model_bc = predictions_svm_svc
In [44]:
#task t9c
#Print the classification report for each of the models above
#  Write code here

print "Classification report for : knn_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_knn_model_mc)
print "Classification report for : dtree_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_dtree_model_mc)
print "Classification report for : rf_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_rf_model_mc)
print "Classification report for : svm_ovr_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_svm_ovr_model_mc)
print "Classification report for : svm_ovo_model_mc"
print metrics.classification_report(all_test_labels, predictions_best_svm_ovo_model_mc)
print "Classification report for : knn_model_bc"
print metrics.classification_report(binary_class_testing_labels, predictions_best_knn_model_bc)
print "Classification report for : dtree_model_bc"
print metrics.classification_report(binary_class_testing_labels, predictions_best_dtree_model_bc)
print "Classification report for : rf_model_bc"
print metrics.classification_report(binary_class_testing_labels, predictions_best_rf_model_bc)
print "Classification report for : svm_model_bc"
print metrics.classification_report(binary_class_testing_labels, predictions_best_svm_model_bc)
Classification report for : knn_model_mc
             precision    recall  f1-score   support

        0.0       0.96      0.99      0.98       980
        1.0       0.95      1.00      0.98      1135
        2.0       0.98      0.96      0.97      1032
        3.0       0.96      0.97      0.97      1010
        4.0       0.98      0.96      0.97       982
        5.0       0.97      0.97      0.97       892
        6.0       0.98      0.99      0.98       958
        7.0       0.96      0.96      0.96      1028
        8.0       0.99      0.94      0.96       974
        9.0       0.96      0.95      0.95      1009

avg / total       0.97      0.97      0.97     10000

Classification report for : dtree_model_mc
             precision    recall  f1-score   support

        0.0       0.91      0.93      0.92       980
        1.0       0.95      0.96      0.96      1135
        2.0       0.87      0.86      0.87      1032
        3.0       0.84      0.85      0.84      1010
        4.0       0.87      0.88      0.88       982
        5.0       0.84      0.85      0.84       892
        6.0       0.90      0.88      0.89       958
        7.0       0.91      0.90      0.91      1028
        8.0       0.82      0.81      0.81       974
        9.0       0.85      0.85      0.85      1009

avg / total       0.88      0.88      0.88     10000

Classification report for : rf_model_mc
             precision    recall  f1-score   support

        0.0       0.96      0.99      0.98       980
        1.0       0.98      0.99      0.98      1135
        2.0       0.93      0.95      0.94      1032
        3.0       0.92      0.93      0.93      1010
        4.0       0.94      0.96      0.95       982
        5.0       0.94      0.93      0.93       892
        6.0       0.97      0.96      0.96       958
        7.0       0.96      0.94      0.95      1028
        8.0       0.94      0.92      0.93       974
        9.0       0.94      0.92      0.93      1009

avg / total       0.95      0.95      0.95     10000

Classification report for : svm_ovr_model_mc
             precision    recall  f1-score   support

        0.0       0.95      0.98      0.96       980
        1.0       0.96      0.98      0.97      1135
        2.0       0.93      0.89      0.91      1032
        3.0       0.90      0.91      0.90      1010
        4.0       0.92      0.93      0.92       982
        5.0       0.89      0.86      0.87       892
        6.0       0.94      0.95      0.94       958
        7.0       0.92      0.92      0.92      1028
        8.0       0.87      0.87      0.87       974
        9.0       0.90      0.89      0.89      1009

avg / total       0.92      0.92      0.92     10000

Classification report for : svm_ovo_model_mc
             precision    recall  f1-score   support

        0.0       0.97      0.97      0.97       980
        1.0       0.98      0.99      0.98      1135
        2.0       0.93      0.92      0.93      1032
        3.0       0.91      0.93      0.92      1010
        4.0       0.94      0.96      0.95       982
        5.0       0.91      0.89      0.90       892
        6.0       0.95      0.95      0.95       958
        7.0       0.95      0.93      0.94      1028
        8.0       0.89      0.91      0.90       974
        9.0       0.93      0.92      0.92      1009

avg / total       0.94      0.94      0.94     10000

Classification report for : knn_model_bc
             precision    recall  f1-score   support

          7       0.98      0.99      0.99      4746
          9       0.99      0.98      0.99      4505

avg / total       0.99      0.99      0.99      9251

Classification report for : dtree_model_bc
             precision    recall  f1-score   support

          7       0.97      0.97      0.97      4746
          9       0.96      0.97      0.97      4505

avg / total       0.97      0.97      0.97      9251

Classification report for : rf_model_bc
             precision    recall  f1-score   support

          7       0.97      0.97      0.97      4746
          9       0.96      0.97      0.97      4505

avg / total       0.97      0.97      0.97      9251

Classification report for : svm_model_bc
             precision    recall  f1-score   support

          7       0.96      0.95      0.95      4746
          9       0.95      0.96      0.95      4505

avg / total       0.95      0.95      0.95      9251

In [46]:
#task t9d
#Print the confusion matrix for each of the models above
#  Write code here
print "Confusionmatrix for : knn_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_knn_model_mc)
print "Confusionmatrix for : dtree_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_dtree_model_mc)
print "Confusionmatrix for : rf_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_rf_model_mc)
print "Confusionmatrix for : svm_ovr_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_svm_ovr_model_mc)
print "Confusionmatrix for : svm_ovo_model_mc"
print metrics.confusion_matrix(all_test_labels, predictions_best_svm_ovo_model_mc)
print "Confusionmatrix for : knn_model_bc"
print metrics.confusion_matrix(binary_class_testing_labels, predictions_best_knn_model_bc)
print "Confusionmatrix for : dtree_model_bc"
print metrics.confusion_matrix(binary_class_testing_labels, predictions_best_dtree_model_bc)
print "Confusionmatrix for : rf_model_bc"
print metrics.confusion_matrix(binary_class_testing_labels, predictions_best_rf_model_bc)
print "Confusionmatrix for : svm_model_bc"
print metrics.confusion_matrix(binary_class_testing_labels, predictions_best_svm_model_bc)
Confusionmatrix for : knn_model_mc
[[ 974    1    1    0    0    1    2    1    0    0]
 [   0 1133    2    0    0    0    0    0    0    0]
 [  11    8  991    2    1    0    1   15    3    0]
 [   0    3    3  976    1   13    1    6    3    4]
 [   3    7    0    0  944    0    4    2    1   21]
 [   5    0    0   12    2  862    4    1    2    4]
 [   5    3    0    0    3    2  945    0    0    0]
 [   0   22    4    0    3    0    0  988    0   11]
 [   8    3    5   13    6   12    5    5  913    4]
 [   5    7    3    9    7    3    1   10    2  962]]
Confusionmatrix for : dtree_model_mc
[[ 912    0    9    8    6   11   11    4   14    5]
 [   3 1089   10    4    0    5    8    1   13    2]
 [  10    9  891   27   15   10    9   29   24    8]
 [   7    3   34  861    6   47    4    8   22   18]
 [   6    4    8    6  865    6   13    9   20   45]
 [  13    7    2   44    6  754   21    6   23   16]
 [  20    4   13    7   19   23  846    0   22    4]
 [   2   11   25   16    9    5    3  925   10   22]
 [  11    9   27   36   19   27   17   15  785   28]
 [  13    6    8   19   46   11    5   19   23  859]]
Confusionmatrix for : rf_model_mc
[[ 969    0    2    0    0    7    1    1    0    0]
 [   1 1119    5    3    0    2    2    1    2    0]
 [   8    4  977    8    6    1    8   13    7    0]
 [   2    0   22  944    0   16    0   11   11    4]
 [   2    1    2    2  939    0    6    2    8   20]
 [   6    1    2   33    1  829    7    1   10    2]
 [   6    3    3    1   12    8  921    0    4    0]
 [   0    7   18    4    6    3    1  965    6   18]
 [   6    3   10   19    7    9    7    5  897   11]
 [   5    5    5   14   29    7    1    8    8  927]]
Confusionmatrix for : svm_ovr_model_mc
[[ 962    0    2    1    1    4    5    3    1    1]
 [   0 1112    3    2    0    1    5    1   11    0]
 [  11   11  914   18   10    4   13   12   36    3]
 [   4    0   19  918    2   22    5   12   20    8]
 [   1    4    6    3  913    0    9    3    5   38]
 [   9    2    0   40   12  767   17    7   30    8]
 [   7    4    7    2    5   21  909    1    2    0]
 [   2    8   23    5    7    1    1  948    4   29]
 [  11   13    8   21   14   31    8   13  843   12]
 [   7    8    2   15   31   12    0   26   12  896]]
Confusionmatrix for : svm_ovo_model_mc
[[ 953    0    6    1    1   11    4    1    2    1]
 [   0 1120    3    3    0    1    2    1    5    0]
 [   5    5  947   14    7    4    9   11   26    4]
 [   2    0    7  942    3   18    3    8   25    2]
 [   1    1    8    0  938    1    7    2    3   21]
 [   8    3    4   34    4  796   12    0   26    5]
 [   8    1   11    1    6   17  910    1    3    0]
 [   0    3   20   11    8    2    0  958    3   23]
 [   6    4    7   20    8   22    7    5  884   11]
 [   3    6    1    8   26    6    1   20   14  924]]
Confusionmatrix for : knn_model_bc
[[4687   59]
 [  75 4430]]
Confusionmatrix for : dtree_model_bc
[[4584  162]
 [ 121 4384]]
Confusionmatrix for : rf_model_bc
[[4584  162]
 [ 121 4384]]
Confusionmatrix for : svm_model_bc
[[4510  236]
 [ 190 4315]]
In [73]:
#(Harder) task t9e
#Each of the model above has some probabilistic interpretation
# So sklearn allows you to get the probability values as part of classification
# Using this information, you can print roc_curve
#Print the roc curve for each of the models above
#  Write code here

predictions_knn = [prob[1] for prob in best_knn_model_bc.predict_proba(binary_class_testing_data)]
knn_fpr,knn_tpr,knn_threshold=metrics.roc_curve(binary_class_testing_labels,predictions_knn,pos_label=9)

predictions_dtree = [prob[1] for prob in best_dtree_model_bc.predict_proba(binary_class_testing_data)]
dtree_fpr,dtree_tpr,dtree_threshold=metrics.roc_curve(binary_class_testing_labels,predictions_dtree,pos_label=9)

predictions_rf = [prob[1] for prob in best_rf_model_bc.predict_proba(binary_class_testing_data)]
rf_fpr,rf_tpr,rf_threshold=metrics.roc_curve(binary_class_testing_labels,predictions_rf,pos_label=9)

svm_fpr, svm_tpr, svm_thresholds = metrics.roc_curve(binary_class_testing_labels,predictions_best_svm_model_bc,pos_label=9)


#print best_rf_model_bc.classes_
print "K-NN Model:"
print "False-positive rate:", knn_fpr
print "True-positive rate: ", knn_tpr
print "Thresholds:         ", knn_threshold

print "\nD-Tree Model:"
print "False-positive rate:", dtree_fpr
print "True-positive rate: ", dtree_tpr
print "Thresholds:         ", dtree_threshold

print "\nRf Model:"
print "False-positive rate:", rf_fpr
print "True-positive rate: ", rf_tpr
print "Thresholds:         ", rf_threshold

print "\nSVM Model:"
print "False-positive rate:", svm_fpr
print "True-positive rate: ", svm_tpr
print "Thresholds:         ", svm_threshold
K-NN Model:
False-positive rate: [ 0.          0.00337126  0.01243152  0.03855879  1.        ]
True-positive rate:  [ 0.          0.94028857  0.98335183  0.99556049  1.        ]
Thresholds:          [ 2.          1.          0.66666667  0.33333333  0.        ]

D-Tree Model:
False-positive rate: [ 0.          0.03413401  1.        ]
True-positive rate:  [ 0.          0.97314095  1.        ]
Thresholds:          [ 2.  1.  0.]

Rf Model:
False-positive rate: [ 0.          0.00105352  0.00147493  0.00442478  0.01285293  0.02359882
  0.04003371  0.06552887  0.11083017  0.18099452  0.34618626  1.        ]
True-positive rate:  [ 0.          0.56071032  0.79200888  0.89278579  0.94250832  0.97136515
  0.98446171  0.99134295  0.99489456  0.99711432  0.99933407  1.        ]
Thresholds:          [ 2.   1.   0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.2  0.1  0. ]

SVM Model:
False-positive rate: [ 0.          0.04972609  1.        ]
True-positive rate:  [ 0.          0.95782464  1.        ]
Thresholds:          [10  9  7]
In [75]:
#(Harder) task t9f
#For each model, DRAW the ROC curve
# See url below for details
#http://nbviewer.ipython.org/github/datadave/GADS9-NYC-Spring2014-Lectures/blob/master/lessons/lesson09_decision_trees_random_forests/sklearn_decision_trees.ipynb
#  Write code here

fig_knn = plt.figure()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC for KNN Model')
plt.plot(knn_fpr, knn_tpr)

fig_dtree = plt.figure()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC for DTree Model')
plt.plot(dtree_fpr, dtree_tpr)


fig_rf = plt.figure()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC for RF Model')
plt.plot(rf_fpr, rf_tpr)


fig_svm = plt.figure()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC for SVM Model')
plt.plot(svm_fpr, svm_tpr)
Out[75]:
[<matplotlib.lines.Line2D at 0x7f17eba06290>]
In [76]:
#(Harder) task t9g
#Print the AUC value for each of the models above
#  Write code here

print "AUC Value for KNN ", metrics.auc(knn_fpr, knn_tpr)
print "AUC Value for DTree ", metrics.auc(dtree_fpr, dtree_tpr)
print "AUC Value for RF ", metrics.auc(rf_fpr, rf_tpr)
print "AUC Value for SVM ", metrics.auc(svm_fpr, svm_tpr)
AUC Value for KNN  0.995458153206
AUC Value for DTree  0.969503473455
AUC Value for RF  0.995709173634
AUC Value for SVM  0.954049277083
In [83]:
#(Harder) task t9h
#Print the precision recall curve for each of the models above
#  Write code here

#print the curve based on http://scikit-learn.org/stable/auto_examples/plot_precision_recall.html
    
predictions_knn = [prob[1] for prob in best_knn_model_bc.predict_proba(binary_class_testing_data)]    
knn_precision, knn_recall, knn_threshold =  metrics.precision_recall_curve(
                    binary_class_testing_labels, predictions_knn, pos_label=9)
    
predictions_dtree = [prob[1] for prob in best_dtree_model_bc.predict_proba(binary_class_testing_data)]
dtree_precision,dtree_recall,dtree_threshold =  metrics.precision_recall_curve(
                    binary_class_testing_labels, predictions_dtree,pos_label=9)

predictions_rf = [prob[1] for prob in best_rf_model_bc.predict_proba(binary_class_testing_data)]
rf_precision,rf_recall,rf_threshold = metrics.precision_recall_curve(
                    binary_class_testing_labels, predictions_rf,pos_label=9)
    
svm_precision,svm_recall,svm_threshold = metrics.precision_recall_curve(
                    binary_class_testing_labels,predictions_best_svm_model_bc,pos_label=9)

fig_knn = plt.figure()
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-Recall Curve for KNN Model')
plt.xlim([0.0, 1.0])
plt.plot(knn_precision, knn_recall)

fig_dtree = plt.figure()
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-Recall Curve for DTree Model')
plt.xlim([0.0, 1.0])
plt.plot(dtree_precision, dtree_recall)


fig_rf = plt.figure()
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-Recall Curve for RF Model')
plt.xlim([0.0, 1.0])
plt.plot(rf_precision, rf_recall)


fig_svm = plt.figure()
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.title('Precision-Recall Curve for SVM Model')
plt.xlim([0.0, 1.0])
plt.plot(svm_precision, svm_recall)
Out[83]:
[<matplotlib.lines.Line2D at 0x7f17d3487710>]

Part 4: Parameter Tuning through Grid Search/Cross Validation and Parallelization

So far in this assignment, you manually tweaked the model till it became better. For complex models, this is often cumbersome. A common trick people use is called Grid Search where you exhaustively test various parameter combinations and pick the best set of parameter values. This is a VERY computationally intensive process and hence it will require some parallelization.

In this assignment, you will learn how to tune two models (a RandmForest and SVC SVM) for MNIST dataset and then parallelize it so as to get results faster. You might want to take a look at the url http://scikit-learn.org/stable/modules/grid_search.html for additional details.

One thing you might want to note is that the GridSearchCV uses cross validation for comparing models. So you have to send the ENTIRE MNIST dataset - i.e. mnist.data and mnist.target . The following cell creates two variables all_scaled_data and all_scaled_target that you can pass to GridSearchCV. In order to get the results in reasonable time, set the cv parameter of GridSearchCV to 3. Also remember to set the verbose parameter to 2 to get some details about what happens internally.

In [22]:
###Do not make any change below
all_scaled_data = binary_class_data / 255.0
all_scaled_target = binary_class_labels
In [23]:
#(Harder) task t10a
#Tuning SVC SVM model for MNIST
tuned_parameters = [{'kernel' : ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},
                    {'kernel' : ['poly'], 'degree' : [5, 9], 'C' : [1, 10]}] 
#Create a SVC SVM classifier object and tune it using GridSearchCV  function.
#Change the following line as appropriate
#replace this line with gridsearchcv
svm = GridSearchCV(SVC(random_state=1234), param_grid=tuned_parameters, cv=3, verbose=2)
svm.fit(all_scaled_data,all_scaled_target)
predictions_svm = svm.predict(all_scaled_data)

#print the details of the best model and its accuracy
print "Best model:", svm
print "Best params learned via GridSearch", svm.best_estimator_
print "Accuracy of best model", svm.best_score_
print "Accuracy of learned model",  metrics.accuracy_score(all_scaled_target,predictions_svm)
Fitting 3 folds for each of 13 candidates, totalling 39 fits
[CV] kernel=rbf, C=10, gamma=0.1 .....................................
[CV] ............................ kernel=rbf, C=10, gamma=0.1 - 1.4min
[CV] kernel=rbf, C=10, gamma=0.1 .....................................
[CV] ............................ kernel=rbf, C=10, gamma=0.1 - 1.4min
[CV] kernel=rbf, C=10, gamma=0.1 .....................................
[CV] ............................ kernel=rbf, C=10, gamma=0.1 - 1.3min
[CV] kernel=rbf, C=10, gamma=0.01 ....................................
[CV] ........................... kernel=rbf, C=10, gamma=0.01 -  10.9s
[CV] kernel=rbf, C=10, gamma=0.01 ....................................
[CV] ........................... kernel=rbf, C=10, gamma=0.01 -  10.8s
[CV] kernel=rbf, C=10, gamma=0.01 ....................................
[CV] ........................... kernel=rbf, C=10, gamma=0.01 -  10.6s
[CV] kernel=rbf, C=10, gamma=0.001 ...................................
[CV] .......................... kernel=rbf, C=10, gamma=0.001 -  16.7s
[CV] kernel=rbf, C=10, gamma=0.001 ...................................
[CV] .......................... kernel=rbf, C=10, gamma=0.001 -  15.8s
[CV] kernel=rbf, C=10, gamma=0.001 ...................................
[CV] .......................... kernel=rbf, C=10, gamma=0.001 -  16.2s
[CV] kernel=rbf, C=100, gamma=0.1 ....................................
[CV] ........................... kernel=rbf, C=100, gamma=0.1 - 1.4min
[CV] kernel=rbf, C=100, gamma=0.1 ....................................
[CV] ........................... kernel=rbf, C=100, gamma=0.1 - 1.4min
[CV] kernel=rbf, C=100, gamma=0.1 ....................................
[CV] ........................... kernel=rbf, C=100, gamma=0.1 - 1.3min
[CV] kernel=rbf, C=100, gamma=0.01 ...................................
[CV] .......................... kernel=rbf, C=100, gamma=0.01 -  10.9s
[CV] kernel=rbf, C=100, gamma=0.01 ...................................
[CV] .......................... kernel=rbf, C=100, gamma=0.01 -  10.3s
[CV] kernel=rbf, C=100, gamma=0.01 ...................................
[CV] .......................... kernel=rbf, C=100, gamma=0.01 -  10.6s
[CV] kernel=rbf, C=100, gamma=0.001 ..................................
[CV] ......................... kernel=rbf, C=100, gamma=0.001 -  12.2s
[CV] kernel=rbf, C=100, gamma=0.001 ..................................
[CV] ......................... kernel=rbf, C=100, gamma=0.001 -  12.1s
[CV] kernel=rbf, C=100, gamma=0.001 ..................................
[CV] ......................... kernel=rbf, C=100, gamma=0.001 -  11.5s
[CV] kernel=rbf, C=1000, gamma=0.1 ...................................
[CV] .......................... kernel=rbf, C=1000, gamma=0.1 - 1.4min
[CV] kernel=rbf, C=1000, gamma=0.1 ...................................
[CV] .......................... kernel=rbf, C=1000, gamma=0.1 - 1.4min
[CV] kernel=rbf, C=1000, gamma=0.1 ...................................
[CV] .......................... kernel=rbf, C=1000, gamma=0.1 - 1.3min
[CV] kernel=rbf, C=1000, gamma=0.01 ..................................
[CV] ......................... kernel=rbf, C=1000, gamma=0.01 -  11.0s
[CV] kernel=rbf, C=1000, gamma=0.01 ..................................
[CV] ......................... kernel=rbf, C=1000, gamma=0.01 -  10.6s
[CV] kernel=rbf, C=1000, gamma=0.01 ..................................
[CV] ......................... kernel=rbf, C=1000, gamma=0.01 -  10.7s
[CV] kernel=rbf, C=1000, gamma=0.001 .................................
[CV] ........................ kernel=rbf, C=1000, gamma=0.001 -  11.5s
[CV] kernel=rbf, C=1000, gamma=0.001 .................................
[CV] ........................ kernel=rbf, C=1000, gamma=0.001 -  11.5s
[CV] kernel=rbf, C=1000, gamma=0.001 .................................
[CV] ........................ kernel=rbf, C=1000, gamma=0.001 -  11.1s
[CV] kernel=poly, C=1, degree=5 ......................................
[CV] ............................. kernel=poly, C=1, degree=5 - 1.7min
[CV] kernel=poly, C=1, degree=5 ......................................
[CV] ............................. kernel=poly, C=1, degree=5 - 1.7min
[CV] kernel=poly, C=1, degree=5 ......................................
[CV] ............................. kernel=poly, C=1, degree=5 - 1.8min
[CV] kernel=poly, C=1, degree=9 ......................................
[CV] ............................. kernel=poly, C=1, degree=9 - 1.8min
[CV] kernel=poly, C=1, degree=9 ......................................
[CV] ............................. kernel=poly, C=1, degree=9 - 1.8min
[CV] kernel=poly, C=1, degree=9 ......................................
[CV] ............................. kernel=poly, C=1, degree=9 - 1.8min
[CV] kernel=poly, C=10, degree=5 .....................................
[CV] ............................ kernel=poly, C=10, degree=5 - 1.8min
[CV] kernel=poly, C=10, degree=5 .....................................
[CV] ............................ kernel=poly, C=10, degree=5 - 1.8min
[CV] kernel=poly, C=10, degree=5 .....................................
[CV] ............................ kernel=poly, C=10, degree=5 - 1.8min
[CV] kernel=poly, C=10, degree=9 .....................................
[CV] ............................ kernel=poly, C=10, degree=9 - 1.9min
[CV] kernel=poly, C=10, degree=9 .....................................
[CV] ............................ kernel=poly, C=10, degree=9 - 1.9min
[CV] kernel=poly, C=10, degree=9 .....................................
[CV] ............................ kernel=poly, C=10, degree=9 - 1.8min
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:  1.4min
[Parallel(n_jobs=1)]: Done  39 out of  39 | elapsed: 37.5min finished
Best model: GridSearchCV(cv=3,
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=1234,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid=[{'kernel': ['rbf'], 'C': [10, 100, 1000], 'gamma': [0.1, 0.01, 0.001]}, {'kernel': ['poly'], 'C': [1, 10], 'degree': [5, 9]}],
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=2)
Best params learned via GridSearch SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.01,
  kernel='rbf', max_iter=-1, probability=False, random_state=1234,
  shrinking=True, tol=0.001, verbose=False)
Accuracy of best model 0.993053119079
Accuracy of learned model 1.0
In [24]:
#(Harder) task t10b
#Tuning Random Forest for MNIST
tuned_parameters = [{'max_features': ['sqrt', 'log2'], 'n_estimators': [1000, 1500]}] 
#replace this line with gridsearchcv. Remember to pass 
# the following parameters to constructor or RandomForestClassifier min_samples_split=1, compute_importances=False, n_jobs=-1
#Change the following line as appropriate
rf = GridSearchCV(RandomForestClassifier(min_samples_split=1,compute_importances=False,n_jobs=-1, random_state=1234), tuned_parameters,cv=3,verbose=2)
rf.fit(all_scaled_data,all_scaled_target)
predictions_rf = rf.predict(all_scaled_data)

#print the details of the best model and its accuracy
print "Best model:", rf
print "Best params learned via GridSearch", rf.best_estimator_
print "Accuracy of best mode", rf.best_score_
print "Accuracy of learned model",  metrics.accuracy_score(all_scaled_target,predictions_rf)
Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] max_features=sqrt, n_estimators=1000 ............................
[CV] ................... max_features=sqrt, n_estimators=1000 -   9.2s
/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/forest.py:776: DeprecationWarning: Setting compute_importances is no longer required as version 0.14. Variable importances are now computed on the fly when accessing the feature_importances_ attribute. This parameter will be removed in 0.16.
  DeprecationWarning)
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    9.2s
[CV] max_features=sqrt, n_estimators=1000 ............................
[CV] ................... max_features=sqrt, n_estimators=1000 -  10.6s
[CV] max_features=sqrt, n_estimators=1000 ............................
[CV] ................... max_features=sqrt, n_estimators=1000 -   9.4s
[CV] max_features=sqrt, n_estimators=1500 ............................
[CV] ................... max_features=sqrt, n_estimators=1500 -  15.5s
[CV] max_features=sqrt, n_estimators=1500 ............................
[CV] ................... max_features=sqrt, n_estimators=1500 -  14.5s
[CV] max_features=sqrt, n_estimators=1500 ............................
[CV] ................... max_features=sqrt, n_estimators=1500 -  14.6s
[CV] max_features=log2, n_estimators=1000 ............................
[CV] ................... max_features=log2, n_estimators=1000 -   5.4s
[CV] max_features=log2, n_estimators=1000 ............................
[CV] ................... max_features=log2, n_estimators=1000 -   4.2s
[CV] max_features=log2, n_estimators=1000 ............................
[CV] ................... max_features=log2, n_estimators=1000 -   4.1s
[CV] max_features=log2, n_estimators=1500 ............................
[CV] ................... max_features=log2, n_estimators=1500 -   6.1s
[CV] max_features=log2, n_estimators=1500 ............................
[CV] ................... max_features=log2, n_estimators=1500 -   6.1s
[CV] max_features=log2, n_estimators=1500 ............................
[CV] ................... max_features=log2, n_estimators=1500 -   7.9s
Best model: GridSearchCV(cv=3,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=1, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=1234, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid=[{'max_features': ['sqrt', 'log2'], 'n_estimators': [1000, 1500]}],
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=2)
Best params learned via GridSearch RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='sqrt',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=1, n_estimators=1500, n_jobs=-1,
            oob_score=False, random_state=1234, verbose=0)
Accuracy of best mode 0.987860501017
Accuracy of learned model 1.0
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  1.8min finished
In [26]:
#(Harder) task t10c
#Parallelization
#Now re-run the tuning of SVC SVM model. But parallelize it via n_jobs option.
tuned_parameters = [{'kernel' : ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},
                    {'kernel' : ['poly'], 'degree' : [5, 9], 'C' : [1, 10]}]
#Change the following line as appropriate
#replace this line with gridsearchcv
svm_parallel = GridSearchCV(SVC(random_state=1234), tuned_parameters,cv=3,verbose=2,n_jobs=8)
svm_parallel.fit(all_scaled_data,all_scaled_target)
predictions_svm_paralell = svm_parallel.predict(all_scaled_data)

#print the details of the best model and its accuracy
print "Best model:", svm_parallel
print "Best params learned via GridSearch", svm_parallel.best_estimator_
print "Accuracy of best mode", svm_parallel.best_score_
print "Accuracy of learned model",  metrics.accuracy_score(all_scaled_target, predictions_svm_paralell)
Fitting 3 folds for each of 13 candidates, totalling 39 fits
[Parallel(n_jobs=8)]: Done   1 jobs       | elapsed:  5.0min
[Parallel(n_jobs=8)]: Done  39 out of  39 | elapsed: 17.1min finished
 Best model: GridSearchCV(cv=3,
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=1234,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=8,
       param_grid=[{'kernel': ['rbf'], 'C': [10, 100, 1000], 'gamma': [0.1, 0.01, 0.001]}, {'kernel': ['poly'], 'C': [1, 10], 'degree': [5, 9]}],
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=2)
Best params learned via GridSearch SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.01,
  kernel='rbf', max_iter=-1, probability=False, random_state=1234,
  shrinking=True, tol=0.001, verbose=False)
Accuracy of best mode 0.993053119079
Accuracy of learned model 1.0
[CV] kernel=rbf, C=10, gamma=0.1 .....................................
[CV] kernel=rbf, C=10, gamma=0.1 .....................................
[CV] kernel=rbf, C=10, gamma=0.1 .....................................
[CV] kernel=rbf, C=10, gamma=0.01 ....................................
[CV] kernel=rbf, C=10, gamma=0.01 ....................................
[CV] kernel=rbf, C=10, gamma=0.01 ....................................
[CV] kernel=rbf, C=10, gamma=0.001 ...................................
[CV] kernel=rbf, C=10, gamma=0.001 ...................................
[CV] ............................ kernel=rbf, C=10, gamma=0.1 - 5.0min[CV] ............................ kernel=rbf, C=10, gamma=0.1 - 5.0min[CV] ............................ kernel=rbf, C=10, gamma=0.1 - 4.9min[CV] ........................... kernel=rbf, C=10, gamma=0.01 -  39.1s[CV] ........................... kernel=rbf, C=10, gamma=0.01 -  38.5s[CV] ........................... kernel=rbf, C=10, gamma=0.01 -  37.2s[CV] .......................... kernel=rbf, C=10, gamma=0.001 -  57.4s[CV] .......................... kernel=rbf, C=10, gamma=0.001 -  55.2s







[CV] kernel=rbf, C=1000, gamma=0.01 ..................................[CV] kernel=rbf, C=1000, gamma=0.01 ..................................[CV] kernel=rbf, C=1000, gamma=0.1 ...................................[CV] kernel=rbf, C=100, gamma=0.1 ....................................[CV] kernel=rbf, C=100, gamma=0.1 ....................................[CV] kernel=rbf, C=10, gamma=0.001 ...................................[CV] kernel=rbf, C=100, gamma=0.01 ...................................[CV] kernel=rbf, C=100, gamma=0.1 ....................................







[CV] ......................... kernel=rbf, C=1000, gamma=0.01 -  39.3s[CV] ......................... kernel=rbf, C=1000, gamma=0.01 -  38.5s[CV] .......................... kernel=rbf, C=1000, gamma=0.1 - 5.0min[CV] ........................... kernel=rbf, C=100, gamma=0.1 - 4.9min[CV] ........................... kernel=rbf, C=100, gamma=0.1 - 5.0min[CV] .......................... kernel=rbf, C=10, gamma=0.001 -  53.9s[CV] .......................... kernel=rbf, C=100, gamma=0.01 -  38.4s[CV] ........................... kernel=rbf, C=100, gamma=0.1 - 5.0min







[CV] kernel=rbf, C=1000, gamma=0.001 .................................[CV] kernel=rbf, C=1000, gamma=0.001 .................................[CV] kernel=poly, C=10, degree=5 .....................................[CV] kernel=rbf, C=1000, gamma=0.01 ..................................[CV] kernel=rbf, C=1000, gamma=0.001 .................................[CV] kernel=rbf, C=100, gamma=0.01 ...................................[CV] kernel=rbf, C=100, gamma=0.01 ...................................[CV] kernel=poly, C=1, degree=5 ......................................







[CV] ........................ kernel=rbf, C=1000, gamma=0.001 -  45.3s[CV] ........................ kernel=rbf, C=1000, gamma=0.001 -  40.9s[CV] ............................ kernel=poly, C=10, degree=5 - 5.6min[CV] ......................... kernel=rbf, C=1000, gamma=0.01 -  40.7s[CV] ........................ kernel=rbf, C=1000, gamma=0.001 -  43.3s[CV] .......................... kernel=rbf, C=100, gamma=0.01 -  38.9s[CV] .......................... kernel=rbf, C=100, gamma=0.01 -  41.7s[CV] ............................. kernel=poly, C=1, degree=5 - 6.4min







[CV] kernel=poly, C=1, degree=9 ......................................[CV] kernel=poly, C=1, degree=5 ......................................[CV] kernel=poly, C=10, degree=5 .....................................[CV] kernel=poly, C=1, degree=5 ......................................[CV] kernel=poly, C=1, degree=9 ......................................[CV] kernel=rbf, C=100, gamma=0.001 ..................................[CV] kernel=rbf, C=100, gamma=0.001 ..................................






[CV] ............................. kernel=poly, C=1, degree=9 - 6.5min[CV] ............................. kernel=poly, C=1, degree=5 - 6.4min[CV] ............................ kernel=poly, C=10, degree=5 - 4.3min[CV] ............................. kernel=poly, C=1, degree=5 - 6.5min[CV] ............................. kernel=poly, C=1, degree=9 - 6.5min[CV] ......................... kernel=rbf, C=100, gamma=0.001 -  46.0s[CV] ......................... kernel=rbf, C=100, gamma=0.001 -  42.6s






[CV] kernel=poly, C=10, degree=9 .....................................[CV] kernel=poly, C=10, degree=9 .....................................[CV] kernel=rbf, C=1000, gamma=0.1 ...................................[CV] kernel=poly, C=10, degree=9 .....................................[CV] kernel=rbf, C=100, gamma=0.001 ..................................




[CV] ............................ kernel=poly, C=10, degree=9 - 4.0min[CV] ............................ kernel=poly, C=10, degree=9 - 4.3min[CV] .......................... kernel=rbf, C=1000, gamma=0.1 - 4.9min[CV] ............................ kernel=poly, C=10, degree=9 - 4.3min[CV] ......................... kernel=rbf, C=100, gamma=0.001 -  40.8s




[CV] kernel=rbf, C=1000, gamma=0.1 ...................................[CV] kernel=poly, C=1, degree=9 ......................................

[CV] .......................... kernel=rbf, C=1000, gamma=0.1 - 5.0min[CV] ............................. kernel=poly, C=1, degree=9 - 6.2min

[CV] kernel=poly, C=10, degree=5 .....................................
[CV] ............................ kernel=poly, C=10, degree=5 - 6.1min

Part 5: Evaluation of Various Regression Models for Boston Houses Dataset

We will now implement some regression routines for predicting the house prices in Boston.

In [84]:
#Do not make any changes in this cell
boston = load_boston()  
print boston.data.shape 
print boston.feature_names 
print np.max(boston.target), np.min(boston.target), np.mean(boston.target)  
print boston.DESCR     
(506, 13)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
50.0 5.0 22.5328063241
Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [85]:
#Do not make any changes in this cell.
print boston.data[0]   
print np.max(boston.data), np.min(boston.data), np.mean(boston.data) 

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=1234)

#Scale the data - important for regression. Learn what this function does
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)

X_train = scalerX.transform(X_train)  
y_train = scalery.transform(y_train)  
X_test = scalerX.transform(X_test)    
y_test = scalery.transform(y_test)

print np.max(X_train), np.min(X_train), np.mean(X_train), np.max(y_train), np.min(y_train), np.mean(y_train) 
[  6.32000000e-03   1.80000000e+01   2.31000000e+00   0.00000000e+00
   5.38000000e-01   6.57500000e+00   6.52000000e+01   4.09000000e+00
   1.00000000e+00   2.96000000e+02   1.53000000e+01   3.96900000e+02
   4.98000000e+00]
711.0 0.0 70.0724468258
10.3350722657 -3.98458419342 1.75652740442e-15 3.12177191399 -1.81569984046 3.02308749713e-16
In [93]:
#task t11a
#Create 13 scatter plots such that variables (CRIM to LSTAT) are in X axis and MEDV in y-axis.
# Organize the images such that the images are in 3 rows of 4 images each and 1 in last row

#Write code here
plt.figure()
fig_t11a,axes_t11a = plt.subplots(4, 4, figsize=(15,20))

t11_img_index = 0
for i in range(boston.feature_names.size):
        t11a_row, t11a_col = i // 4, i % 4
        axes_t11a[t11a_row][t11a_col].scatter(boston.data[:,i],boston.target)
        axes_t11a[t11a_row][t11a_col].set_title("Scatter Plot for " + boston.feature_names[i] + ' and MEDV')
        axes_t11a[t11a_row][t11a_col].set_xlabel(boston.feature_names[i])
        axes_t11a[t11a_row][t11a_col].set_ylabel('MEDV')
    
<matplotlib.figure.Figure at 0x7f17f8355d10>
In [100]:
#Do not make any change here
#To make your life easy, I have created a function that 
# (a) takes a regressor object,(b) trains it (c) makes some prediction (d) evaluates the prediction
def train_and_evaluate(clf, X_train, y_train): 
    clf.fit(X_train, y_train)   
    print "Coefficient of determination on training set:",clf.score(X_train, y_train)  
    cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=1234)   
    scores = cross_val_score(clf, X_train, y_train, cv=cv)   
    print "Average coefficient of determination using 5-fold crossvalidation:",np.mean(scores)  
    
def plot_regression_fit(actual, predicted):
    plt.figure()
    plt.scatter(actual, predicted)
    plt.plot([0, 50], [0, 50], '--k')
    plt.axis('tight')
    plt.xlabel('True price ($1000s)')
    plt.ylabel('Predicted price ($1000s)') 
In [101]:
#task t11b
#create a regressor object based on LinearRegression
# See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
#Change the following line as appropriate
clf_ols = LinearRegression() #change this line
train_and_evaluate(clf_ols,X_train,y_train) 
clf_ols_predicted = clf_ols.predict(X_test)   
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_ols_predicted))
Coefficient of determination on training set: 0.730208069559
Average coefficient of determination using 5-fold crossvalidation: 0.697329670705
In [102]:
#task t11c
#See http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
#Create a regression based on Support Vector Regressor. Set the kernel to linear
#Change the following line as appropriate
clf_svr= SVR(kernel='linear')  
train_and_evaluate(clf_svr,X_train,y_train) 
clf_svr_predicted = clf_svr.predict(X_test) 
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_predicted))   

#task t11d
#Create a regression based on Support Vector Regressor. Set the kernel to polynomial
#Change the following line as appropriate
clf_svr_poly= SVR(kernel='poly')   
train_and_evaluate(clf_svr_poly,X_train,y_train) 
clf_svr_poly_predicted = clf_svr_poly.predict(X_test)      
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_poly_predicted)) 

#task t11e
#Create a regression based on Support Vector Regressor. Set the kernel to rbf
#Change the following line as appropriate
clf_svr_rbf= SVR(kernel='rbf')
train_and_evaluate(clf_svr_rbf,X_train,y_train)
clf_svr_rbf_predicted = clf_svr_rbf.predict(X_test)    
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_rbf_predicted))
Coefficient of determination on training set: 0.698960724444
Average coefficient of determination using 5-fold crossvalidation: 0.677821597121
Coefficient of determination on training set: 0.879810899174
Average coefficient of determination using 5-fold crossvalidation: 0.67285376301
Coefficient of determination on training set: 0.869497333473
Average coefficient of determination using 5-fold crossvalidation: 0.792544663494
In [103]:
#task t11f
#See http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
#Create regression tree
#Change the following line as appropriate
clf_cart = DecisionTreeRegressor()  
train_and_evaluate(clf_cart,X_train,y_train) 
clf_cart_predicted = clf_cart.predict(X_test)  
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_cart_predicted))  
Coefficient of determination on training set: 1.0
Average coefficient of determination using 5-fold crossvalidation: 0.722565488718
In [104]:
#task t11g
#See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
#Create a random forest regressor with 10 estimators and random state as 1234
#Change the following line as appropriate
clf_rf= RandomForestRegressor(n_estimators=10, random_state=1234)
train_and_evaluate(clf_rf,X_train,y_train)  
#The following line prints the most important features
print np.sort(zip(clf_rf.feature_importances_,boston.feature_names),axis=0)  
clf_rf_predicted = clf_rf.predict(X_test)      
plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_rf_predicted)) 
Coefficient of determination on training set: 0.973800976896
Average coefficient of determination using 5-fold crossvalidation: 0.8348507043
[['0.00102271671416' 'AGE']
 ['0.00170914978892' 'B']
 ['0.00224924400046' 'CHAS']
 ['0.00673636957636' 'CRIM']
 ['0.0108814568913' 'DIS']
 ['0.0123642647619' 'INDUS']
 ['0.0189779732305' 'LSTAT']
 ['0.0202834249095' 'NOX']
 ['0.0344749225758' 'PTRATIO']
 ['0.0378219977807' 'RAD']
 ['0.0560709681504' 'RM']
 ['0.364506357036' 'TAX']
 ['0.432901154584' 'ZN']]
In [ ]: