{
"metadata": {
"name": "",
"signature": "sha256:b5fc7b652e0052df22df54bf8af7e4a53ad6bc64d0b335cacb315601171369b8"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Programming Assignment 2: Classification of MNIST Handwritten Digits and Regression of House Prices through Boston Dataset\n",
"\n",
"###Team Details:\n",
"\n",
"When submitting, fill your team details in this cell. Note that this is a markdown cell.\n",
"\n",
"**Student 1 Full Name:**\n",
"**Student 1 Full ID:**\n",
"**Student 1 Full Email Address:**\n",
"\n",
"**Student 2 Full Name:**\n",
"**Student 2 Full ID:**\n",
"**Student 2 Full Email Address:**\n",
"\n",
"**Student 3 Full Name:**\n",
"**Student 3 Full ID:**\n",
"**Student 3 Full Email Address:**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Assignment Details\n",
"\n",
"At a high level, you will be building and evaluating different classifiers for recognizing handwritten digits of MNIST dataset and also build an evaluate various regression models for predicting house prices in Boston. At a more granular level, you will be doing the following:\n",
"\n",
"###1. Binary Classification of MNIST Dataset\n",
"\n",
"In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data. The idea is to show that lot of times, you can simply run a set of classifiers and still get great results. While I will be giving some basic classifier code, you will have some opportunity to improve them by tuning the parameters. \n",
"\n",
"###2. Multi-Class Classification of MNIST Dataset\n",
"\n",
"In the second set of tasks, we will do multi-class classification where the idea is to classify the image to one of the ten digits (0-9). We will start with some basic classifiers that are intrinsically multi-class. Then we will learn about how to convert binary classifiers to multi-class classifiers and how scikit-learn makes it very easy.\n",
"\n",
"###3. Exploration of Different Evaluation Metrics\n",
"In the first two set of tasks, we will narrowly focus on accuracy - what fraction of our predictions - were correct. However, there are a number of popular evaluation metrics. You will learn how (and when) to use these evaluation metrics. \n",
"\n",
"###4. Parameter Tuning through Grid Search/Cross Validation and Parallelization\n",
"This is an advanced topic where you will learn how to tune your classifier and find optimal parameters. We will explore two powerful techniques of grid search and parameter search. This is a very compute intensive task - so you will also explore how to leverage parallelization capabilities of IPython kernel to get results sooner. \n",
"\n",
"###5. Evaluation of Various Regression Models for Boston Houses Dataset\n",
"In the final set of tasks, we will use regression to predict Boston house prices. We will explore both Ordinary Least Squares and also explore other regression variant of popular classifiers such as decision trees and SVM. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%matplotlib inline \n",
"\n",
"#Array processing\n",
"import numpy as np\n",
"\n",
"#Data analysis, wrangling and common exploratory operations\n",
"import pandas as pd\n",
"from pandas import Series, DataFrame\n",
"\n",
"#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from IPython.core.display import Image\n",
"\n",
"from sklearn.datasets import fetch_mldata, load_boston \n",
"from sklearn.utils import shuffle \n",
"from sklearn.neighbors import KNeighborsClassifier \n",
"from sklearn import metrics \n",
"from sklearn import tree \n",
"from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor \n",
"from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB \n",
"from sklearn.svm import SVC, LinearSVC , SVR \n",
"from sklearn.linear_model import Perceptron, LogisticRegression, LinearRegression \n",
"from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor \n",
"from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier \n",
"from sklearn.cross_validation import KFold, train_test_split, cross_val_score \n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.grid_search import GridSearchCV\n",
"\n",
"########You have to install the python package pydot for generating some graph figures\n",
"######## If you are using pip, the command is pip install pydot\n",
"########You will also need to download Graphviz from http://www.graphviz.org/Download.php\n",
"########If you are using Windows, make sure that Graphviz/dot are in path and can be used by pydot package.\n",
"########If you get any pydot error, see url for one possible solution\n",
"######## http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will\n",
"######## If you are using Anaconda, pydot can be installed as conda install pydot\n",
"######## If that does not work check url:\n",
"######## http://stackoverflow.com/questions/27482170/installing-pydot-and-graphviz-packages-in-anaconda-environment\n",
"\n",
"import pydot, StringIO \n",
"\n",
"\n",
"########################If needed you can import additional packages for helping you, although I would discourage it\n",
"########################Put all your imports in the space below. If you use some strange package, \n",
"##########################provide clear information as to how we can install it.\n",
"\n",
"#######################End imports###################################\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Part 1: Binary Classification of MNIST Dataset\n",
"\n",
"In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"####################Do not change anything below\n",
"#Load MNIST data. fetch_mldata will download the dataset and put it in a folder called mldata. \n",
"#Some things to be aware of:\n",
"# The folder mldata will be created in the folder in which you started the notebook\n",
"# So to make your life easy, always start IPython notebook from same folder.\n",
"# Else the following code will keep downloading MNIST data\n",
"mnist = fetch_mldata(\"MNIST original\") \n",
"#The data is organized as follows:\n",
"# Each row corresponds to an image\n",
"# Each image has 28*28 pixels which is then linearized to a vector of size 784 (ie. 28*28)\n",
"# mnist.data gives the image information while mnist.target gives the number in the image\n",
"print \"#Images = %d and #Pixel per image = %s\" % (mnist.data.shape[0], mnist.data.shape[1])\n",
"\n",
"#Print first row of the dataset \n",
"img = mnist.data[0] \n",
"print \"First image shows %d\" % (mnist.target[0])\n",
"print \"The corresponding matrix version of image is \\n\" , img\n",
"print \"The image in grey shape is \"\n",
"plt.imshow(img.reshape(28, 28), cmap=\"Greys\") \n",
" \n",
"#First 60K images are for training and last 10K are for testing\n",
"all_train_data = mnist.data[:60000] \n",
"all_test_data = mnist.data[60000:] \n",
"all_train_labels = mnist.target[:60000] \n",
"all_test_labels = mnist.target[60000:] \n",
" \n",
" \n",
"#For the first task, we will be doing binary classification and focus on two pairs of \n",
"# numbers: 7 and 9 which are known to be hard to distinguish\n",
"#Get all the seven images\n",
"sevens_data = mnist.data[mnist.target==7] \n",
"#Get all the none images\n",
"nines_data = mnist.data[mnist.target==9] \n",
"#Merge them to create a new dataset\n",
"binary_class_data = np.vstack([sevens_data, nines_data]) \n",
"binary_class_labels = np.hstack([np.repeat(7, sevens_data.shape[0]), np.repeat(9, nines_data.shape[0])]) \n",
" \n",
"#In order to make the experiments repeatable, we will seed the random number generator to a known value\n",
"# That way the results of the experiments will always be same\n",
"np.random.seed(1234) \n",
"#randomly shuffle the data\n",
"binary_class_data, binary_class_labels = shuffle(binary_class_data, binary_class_labels) \n",
"print \"Shape of data and labels are :\" , binary_class_data.shape, binary_class_labels.shape \n",
"\n",
"#There are approximately 14K images of 7 and 9. \n",
"#Let us take the first 5000 as training and remaining as test data \n",
"orig_binary_class_training_data = binary_class_data[:5000] \n",
"binary_class_training_labels = binary_class_labels[:5000] \n",
"orig_binary_class_testing_data = binary_class_data[5000:] \n",
"binary_class_testing_labels = binary_class_labels[5000:] \n",
"\n",
"#The images are in grey scale where each number is between 0 to 255\n",
"# Now let us normalize them so that the values are between 0 and 1. \n",
"# This will be the only modification we will make to the image\n",
"binary_class_training_data = orig_binary_class_training_data / 255.0 \n",
"binary_class_testing_data = orig_binary_class_testing_data / 255.0 \n",
"scaled_training_data = all_train_data / 255.0 \n",
"scaled_testing_data = all_test_data / 255.0 \n",
"\n",
"print binary_class_training_data[0,:] \n",
" \n",
"###########Make sure that you remember the variable names and their meaning\n",
"#binary_class_training_data, binary_class_training_labels: Normalized images of 7 and 9 and the correct labels for training\n",
"#binary_class_testing_data, binary_class_testing_labels : Normalized images of 7 and 9 and correct labels for testing\n",
"#orig_binary_class_training_data, orig_binary_class_testing_data: Unnormalized images of 7 and 9\n",
"#all_train_data, all_test_data: un normalized images of all digits \n",
"#all_train_labels, all_test_labels: labels for all digits\n",
"#scaled_training_data, scaled_testing_data: Normalized version of all_train_data, all_test_data for all digits\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Binary Classification in scikit-learn\n",
"\n",
"All classifiers in scikit-learn follow a common pattern that makes life much easier. \n",
"Follow these steps for all the tasks below.\n",
"\n",
"1. Instantiate the classifier with appropriate parameters\n",
"2. Train/fit the classifier with training data and correct labels\n",
"3. Test the classifier with unseen data\n",
"4. Evaluate the performance of classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##K-Nearest Neighbor\n",
"\n",
"We will start with one of the simplest classifiers. In the cell below, I have given the code for k-NN with k=1.\n",
"This should give an idea about how to create, train and evaluate a classifier."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Do not change anything in this cell.\n",
"# The following are some utility functions to help you visualize k-NN\n",
"\n",
"\n",
"#Code courtesy AMPLab\n",
"\n",
"#This function displays one or more images in a grid manner.\n",
"def show_img_with_neighbors(imgs, n=1): \n",
" fig = plt.figure() \n",
" for i in xrange(0, n): \n",
" fig.add_subplot(1, n, i, xticklabels=[], yticklabels=[])\n",
" if n == 1: \n",
" img = imgs \n",
" else: \n",
" img = imgs[i] \n",
" plt.imshow(img.reshape(28, 28), cmap=\"Greys\") \n",
"\n",
"#This function shows some images for which k-NN made a mistake\n",
"# For each of the missed image, it will also show k most similar images so that you will get an idea of why it failed. \n",
"def show_erroring_images_for_model(errors_in_model, num_img_to_print, model, n_neighbors): \n",
" for errorImgIndex in errors_in_model[:num_img_to_print]: \n",
" error_image = binary_class_testing_data[errorImgIndex] \n",
" not_needed, result = model.kneighbors(error_image, n_neighbors=n_neighbors) \n",
" show_img_with_neighbors(error_image) \n",
" show_img_with_neighbors(binary_class_training_data[result[0],:], len(result[0])) "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Do not change anything in this cell.\n",
"#The code below creates a K-NN classifier with k=1.\n",
"#Clearly observe how I do it step by step.\n",
"\n",
"#Step 1: Create a classifier with appropriate parameters\n",
"knn_model_k1 = KNeighborsClassifier(n_neighbors=1, algorithm='brute')\n",
"#Step 2: Fit it with training data\n",
"knn_model_k1.fit(binary_class_training_data, binary_class_training_labels)\n",
"#Print the model so that you know all parameters\n",
"print(knn_model_k1) \n",
"#Step 3: Make predictions based on testing data\n",
"predictions_knn_model_k1 = knn_model_k1.predict(binary_class_testing_data)\n",
"#Step 4: Evaluate the data\n",
"print \"Accuracy of K-NN with k=1 is\", metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k1) \n",
"\n",
"#Let us now see the first five images that were predicted incorrectly see what the issue is \n",
"errors_knn_model_k1 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k1[i] != binary_class_testing_labels[i]]\n",
"show_erroring_images_for_model(errors_knn_model_k1, 5, knn_model_k1, 1) \n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t1a\n",
"#Using the above code as model, create a KNN classifier with k=3\n",
"#Change the following line as appropriate\n",
"knn_model_k3 = None\n",
"knn_model_k3.fit(binary_class_training_data, binary_class_training_labels)\n",
"#Change the following line as appropriate\n",
"predictions_knn_model_k3 = None \n",
"print metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k3)\n",
"\n",
"#Let us now see the first five images that were predicted incorrectly see what the issue is \n",
"errors_knn_model_k3 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k3[i] != binary_class_testing_labels[i]]\n",
"show_erroring_images_for_model(errors_knn_model_k3, 5, knn_model_k3, 3) \n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t1b\n",
"#Using the above code as model, create a KNN classifier with k=5\n",
"#Change the following line as appropriate\n",
"knn_model_k5 = None\n",
"knn_model_k5.fit(binary_class_training_data, binary_class_training_labels)\n",
"#Change the following line as appropriate\n",
"predictions_knn_model_k5 = None \n",
"print metrics.accuracy_score(binary_class_testing_labels, predictions_knn_model_k5)\n",
"\n",
"#Let us now see the first five images that were predicted incorrectly see what the issue is \n",
"errors_knn_model_k5 = [i for i in xrange(0, len(binary_class_testing_data)) if predictions_knn_model_k5[i] != binary_class_testing_labels[i]]\n",
"show_erroring_images_for_model(errors_knn_model_k5, 5, knn_model_k5, 5) \n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t1c\n",
"#Now let us evaluate KNN for different values of K and find the best K\n",
"#WARNING: This code will take 20-40 minutes to run. So make sure your code is correct\n",
"k_vals = xrange(1, 20)\n",
"#Initialize to 0\n",
"accuracy_vals = [0 for _ in k_vals] \n",
"for k in k_vals:\n",
" #Create a KNN with number of neighbors = k - #Change the following line as appropriate\n",
" knn_model_for_knn_k = None\n",
" knn_model_for_knn_k.fit(binary_class_training_data, binary_class_training_labels)\n",
" #Make the prediction - #Change the following line as appropriate\n",
" predictions_for_knn_k = None\n",
" accuracy_vals[k-1] = metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_k)\n",
" \n",
"#Now you have two arrays k_vals which have different values of k and accuracy_vals which have the corresponding accuracy \n",
"# of a model with that k\n",
"#Set opt_k to k that gives best accuracy. \n",
"# Hint: use np.argmax command.\n",
"# Also dont forget to add 1 to opt_k to correct the off by one error \n",
"# (as k varied from 1 to 20 while accuracy_vals[k] started from 0)\n",
"opt_k = None ##Change the following line as appropriate"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t1d\n",
"#Train the model with optimal k that we have seen so far\n",
"#Change the following line as appropriate\n",
"knn_model_opt_k = None\n",
"knn_model_opt_k.fit(binary_class_training_data, binary_class_training_labels)\n",
"predictions_for_knn_opt_k = knn_model_opt_k.predict(binary_class_testing_data)\n",
"print \"Accuracy for best k is \", metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_opt_k)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder)task t1e\n",
"##############################Read the parameter values in the url\n",
"# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier\n",
"#Try different combinations of the parameters so that you can beat the accuracy of knn_model_opt_k\n",
"#Change the following line as appropriate\n",
"knn_model_opt_k_student = None\n",
"knn_model_opt_k_student.fit(binary_class_training_data, binary_class_training_labels)\n",
"predictions_for_knn_opt_k_student = knn_model_opt_k_student.predict(binary_class_testing_data)\n",
"print \"Accuracy for best k - variant by student is \", metrics.accuracy_score(binary_class_testing_labels, predictions_for_knn_opt_k_student)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Decision Trees\n",
"\n",
"In the next set of tasks, you will use Decision trees (see url\n",
"http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier\n",
") for classification. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"###Do not make any change below\n",
"def plot_dtree(model,fileName): \n",
" #You would have to install a Python package pydot \n",
" #You would also have to install graphviz for your system - see http://www.graphviz.org/Download..php \n",
" #If you get any pydot error, see url\n",
" # http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will\n",
" dot_tree_data = StringIO.StringIO() \n",
" tree.export_graphviz(model , out_file = dot_tree_data) \n",
" dtree_graph = pydot.graph_from_dot_data(dot_tree_data.getvalue()) \n",
" dtree_graph.write_png(fileName) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t2a\n",
"#Create a CART decision tree with DEFAULT values\n",
"#Remember to set the random state to 1234\n",
"#Change the following line as appropriate\n",
"cart_model_default = None \n",
"cart_model_default.fit(binary_class_training_data, binary_class_training_labels)\n",
"print(cart_model_default) \n",
"fileName = 'dtree_default.png'\n",
"plot_dtree(cart_model_default, fileName)\n",
"Image(filename=fileName)\n",
"#Change the following line as appropriate\n",
"predictions_dtree_default = None \n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_default) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t2b\n",
"#Create a CART decision tree with splitting criterion as entropy\n",
"#Remember to set the random state to 1234\n",
"#Change the following line as appropriate\n",
"cart_model_entropy = None \n",
"cart_model_entropy.fit(binary_class_training_data, binary_class_training_labels)\n",
"print(cart_model_entropy) \n",
"fileName = 'dtree_entropy.png'\n",
"plot_dtree(cart_model_entropy, fileName)\n",
"Image(filename=fileName)\n",
"#Change the following line as appropriate\n",
"predictions_dtree_entropy = None \n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_entropy) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t2c\n",
"#Create a CART decision tree with splitting criterion as entropy and min_samples_leaf as 100\n",
"#Remember to set the random state to 1234\n",
"#Change the following line as appropriate\n",
"cart_model_entropy_limit_leaves = None \n",
"cart_model_entropy_limit_leaves.fit(binary_class_training_data, binary_class_training_labels)\n",
"print(cart_model_entropy_limit_leaves) \n",
"fileName = 'dtree_entropy_limit_leaves.png'\n",
"plot_dtree(cart_model_entropy_limit_leaves, fileName)\n",
"Image(filename=fileName)\n",
"#Change the following line as appropriate\n",
"predictions_dtree_entropy_limit_leaves = None \n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_dtree_entropy_limit_leaves) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t2d\n",
"#Create a CART decision tree that beats the three models above\n",
"# You might want to consult the url\n",
"# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier\n",
"#Change the following line as appropriate\n",
"cart_model_student = None #Create the model with parameters that beats the models above\n",
"cart_model_student.fit(binary_class_training_data, binary_class_training_labels)\n",
"print(cart_model_student)\n",
"fileName = 'dtree_model_student.png'\n",
"plot_dtree(cart_model_student, fileName)\n",
"Image(filename=fileName)\n",
"#Change the following line as appropriate\n",
"predictions_dtree_student = None #change this line to make predictions\n",
"print metrics.accuracy_score(binary_class_testing_labels, predictions_dtree_student) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Naive Bayes\n",
"\n",
"In this task, you will create a set of Naive Bayes classifiers and evaluate them. You might want to use the following url\n",
"http://scikit-learn.org/stable/modules/naive_bayes.html "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t3a\n",
"#Create a Gaussian NB\n",
"#Change the following line as appropriate\n",
"nb_gaussian_model = None #create the model\n",
"nb_gaussian_model.fit(binary_class_training_data, binary_class_training_labels)\n",
"print(nb_gaussian_model) \n",
"#Change the following line as appropriate\n",
"predictions_gaussian_nb = None #make the predictions\n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_gaussian_nb) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t3b\n",
"#Now create multinomial NB\n",
"#Change the following line as appropriate\n",
"nb_multinomial_model = None #create the model\n",
"nb_multinomial_model.fit(binary_class_training_data, binary_class_training_labels)\n",
"print(nb_multinomial_model) \n",
"#Change the following line as appropriate\n",
"predictions_multinomial_nb = None #make the predictions\n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_multinomial_nb)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t3c\n",
"#Now create a binomial NB\n",
"#Change the following line as appropriate\n",
"nb_binomial_model = None #create the model\n",
"nb_binomial_model.fit(binary_class_training_data, binary_class_training_labels)\n",
"print(nb_binomial_model) \n",
"#Change the following line as appropriate\n",
"predictions_binomial_nb = None #make the predictions\n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_binomial_nb)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##SVM\n",
"\n",
"Let us test SVM on this dataset. You might want to read the http://scikit-learn.org/stable/modules/svm.html for help.\n",
"We will focus on SVC and LinearSVC."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t4a\n",
"#Create a SVM using SVC class. Remember to set random state to 1234.\n",
"#Change the following line as appropriate\n",
"svc_svm_model = None #Change this line\n",
"svc_svm_model.fit(binary_class_training_data, binary_class_training_labels) \n",
"print(svc_svm_model) \n",
"#Change the following line as appropriate\n",
"predictions_svm_svc = None #Chane this line \n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_svm_svc)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t4b\n",
"#Now create a linear SVM model using LinearSVC class. Remember to set random state to 1234.\n",
"#Change the following line as appropriate\n",
"linear_svc_svm_model = None \n",
"linear_svc_svm_model.fit(binary_class_training_data, binary_class_training_labels) \n",
"print(linear_svc_svm_model) \n",
"#Change the following line as appropriate\n",
"predictions_linear_svm_svc = None \n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_linear_svm_svc) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t4c\n",
"#####Either using SVC or LinearSVC, try tweaking the parameters so that you beat the two models above\n",
"#Change the following line as appropriate\n",
"svc_svm_model_student = None \n",
"svc_svm_model_student.fit(binary_class_training_data, binary_class_training_labels) \n",
"print(svc_svm_model_student) \n",
"#Change the following line as appropriate\n",
"predictions_svm_svc_student = None \n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_svm_svc_student) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Logistic Regression\n",
"\n",
"Logistic regression is a simple classifier that converts a regression model into a classification one.\n",
"You can read the details at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html \n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t5a\n",
"#Create a model with default parameters. Remember to set random state to 1234\n",
"#Change the following line as appropriate\n",
"lr_model_default = None \n",
"lr_model_default.fit(binary_class_training_data, binary_class_training_labels)\n",
"#Change the following line as appropriate\n",
"predictions_lr_model_default = None\n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_lr_model_default)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t5b\n",
"#Now try to beat the model above by tweaking the parameters. Remember to set random state to 1234.\n",
"#Change the following line as appropriate\n",
"lr_model_default_student = None \n",
"lr_model_default_student.fit(binary_class_training_data, binary_class_training_labels)\n",
"#Change the following line as appropriate\n",
"predictions_lr_model_default_student = None\n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_lr_model_default_student)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Perceptron\n",
"\n",
"Perceptron is a simple model that can be used for linearly separable data. See details at url\n",
"http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html .\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t6a\n",
"#Create a perceptron model with default parameters and random state = 1234\n",
"#Change the following line as appropriate\n",
"perceptron_model_default = None\n",
"perceptron_model_default.fit(binary_class_training_data, binary_class_training_labels) \n",
"print(perceptron_model_default) \n",
"#Change the following line as appropriate\n",
"predictions_perceptron_default = None\n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_perceptron_default) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Random Forests\n",
"\n",
"Random Forests is a very popular ensemble method. See url \n",
"http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n",
"for details"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t7a\n",
"#Create a random forest classifier with Default parameters\n",
"#Change the following line as appropriate\n",
"rf_model_default = None \n",
"rf_model_default.fit(binary_class_training_data, binary_class_training_labels) \n",
"print(rf_model_default) \n",
"#Change the following line as appropriate\n",
"predictions_rf_default = None\n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_rf_default) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t7b\n",
"#Create a random forest classifier that can beat the above default model.\n",
"#Change the following line as appropriate\n",
"rf_model_default_student = None \n",
"rf_model_default_student.fit(binary_class_training_data, binary_class_training_labels) \n",
"print(rf_model_default_student) \n",
"#Change the following line as appropriate\n",
"predictions_rf_default_student = None\n",
"print metrics.accuracy_score(binary_class_testing_labels,predictions_rf_default_student) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Part 2: Multi Class Classification\n",
"\n",
"So far, we have been focussing on binary classification problems.\n",
"Now let us consider the corresponding multi class classification problem \n",
"where given an image, we have to predict whether it is any of the digits 0 to 9. \n",
"\n",
"We will evaluate 4 classifiers - KNN, Decision Trees, Random Forest and SVM.\n",
"We can see that for the first three, no changes are needed to make them multi class. \n",
"SVM however, is by default, a binary classifier.\n",
"We can use two options : OneVsRestClassifier or OneVsOneClassifier to make it into Multi Class classifier. \n",
"See http://scikit-learn.org/stable/modules/multiclass.html for further details.\n",
"\n",
"Make sure that you remember the variable names and their meaning:\n",
"\n",
"1. all_train_data, all_test_data: un normalized images of all digits \n",
"2. all_train_labels, all_test_labels: labels for all digits\n",
"3. scaled_training_data, scaled_testing_data: Normalized version of all_train_data, all_test_data for all digits"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t8a\n",
"#Create a KNN Classifier for K=5 and train it on **scaled** training data and test it on scaled testing data\n",
"#Change the following line as appropriate\n",
"mc_knn_model_k = None \n",
"mc_knn_model_k.fit(scaled_training_data, all_train_labels)\n",
"#Change the following line as appropriate\n",
"predictions_mc_knn_model = None \n",
"print metrics.accuracy_score(all_test_labels, predictions_mc_knn_model)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t8b\n",
"#Create a Decision tree with DEFAULT parameters and train it on **scaled** training data and test it on scaled testing data\n",
"#Remember to set random state to 1234\n",
"#Change the following line as appropriate\n",
"mc_cart_model_default = None \n",
"mc_cart_model_default.fit(scaled_training_data, all_train_labels)\n",
"print(mc_cart_model_default)\n",
"fileName = 'mc_dtree_default.png' \n",
"plot_dtree(mc_cart_model_default, fileName) \n",
"Image(filename=fileName)\n",
"#Change the following line as appropriate\n",
"predictions_mc_dtree_default = None \n",
"print metrics.accuracy_score(all_test_labels, predictions_mc_dtree_default)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t8c\n",
"#Using the multi classs decision tree above make some changes so that you can beat it.\n",
"# Remember to set random state to 1234\n",
"#Change the following line as appropriate\n",
"mc_cart_model_student = None\n",
"mc_cart_model_student.fit(scaled_training_data, all_train_labels)\n",
"print(mc_cart_model_student)\n",
"fileName = 'mc_dtree_student.png' \n",
"plot_dtree(mc_cart_model_student, fileName) \n",
"Image(filename=fileName)\n",
"#Change the following line as appropriate\n",
"predictions_mc_dtree_student = None \n",
"print metrics.accuracy_score(all_test_labels, predictions_mc_dtree_student)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t8d\n",
"#Create a multi class classifier based on random forest with default parameters.\n",
"#Change the following line as appropriate\n",
"mc_rf_model_default = None \n",
"mc_rf_model_default.fit(all_train_data, all_train_labels)\n",
"print(mc_rf_model_default) \n",
"#Change the following line as appropriate\n",
"predictions_mc_rf_default = None\n",
"print metrics.accuracy_score(all_test_labels,predictions_mc_rf_default)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t8e\n",
"#Tune the random forest classifier so that it beats the default model above\n",
"#Change the following line as appropriate\n",
"mc_rf_model_student = None \n",
"mc_rf_model_student.fit(all_train_data, all_train_labels)\n",
"print(mc_rf_model_student) \n",
"#Change the following line as appropriate\n",
"predictions_mc_rf_student = None \n",
"print metrics.accuracy_score(all_test_labels,predictions_mc_rf_student)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t8f\n",
"#Create a SVM based OneVsRestClassifier Set random state to 1234\n",
"#Change the following line as appropriate\n",
"mc_ovr_linear_svc_svm_model = None \n",
"mc_ovr_linear_svc_svm_model.fit(scaled_training_data, all_train_labels)\n",
"print(mc_ovr_linear_svc_svm_model) \n",
"#Change the following line as appropriate\n",
"predictions_mc_ovr_linear_svm_svc = None\n",
"print metrics.accuracy_score(all_test_labels,predictions_mc_ovr_linear_svm_svc) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t8g\n",
"#Tune the model above so that it beats the default classifier. Set random state to 1234\n",
"#Change the following line as appropriate\n",
"mc_ovr_linear_svc_svm_model_student = None \n",
"mc_ovr_linear_svc_svm_model_student.fit(scaled_training_data, all_train_labels)\n",
"print(mc_ovr_linear_svc_svm_model_student) \n",
"#Change the following line as appropriate\n",
"predictions_mc_ovr_linear_svm_svc_student = None \n",
"print metrics.accuracy_score(all_test_labels,predictions_mc_ovr_linear_svm_svc_student) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t8h\n",
"#Create a SVM based OneVsOneClassifier. Set random state to 1234\n",
"#Change the following line as appropriate\n",
"mc_ovo_linear_svc_svm_model = None\n",
"mc_ovo_linear_svc_svm_model.fit(scaled_training_data, all_train_labels)\n",
"print(mc_ovo_linear_svc_svm_model) \n",
"#Change the following line as appropriate\n",
"predictions_mc_ovo_linear_svm_svc = None\n",
"print metrics.accuracy_score(all_test_labels,predictions_mc_ovo_linear_svm_svc) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t8i\n",
"#Tune the model so that it beats the classifier above. Set random state to 1234\n",
"#Change the following line as appropriate\n",
"mc_ovo_linear_svc_svm_model_student = None\n",
"mc_ovo_linear_svc_svm_model_student.fit(scaled_training_data, all_train_labels)\n",
"print(mc_ovo_linear_svc_svm_model_student) \n",
"#Change the following line as appropriate\n",
"predictions_mc_ovo_linear_svm_svc_student = None\n",
"print metrics.accuracy_score(all_test_labels,predictions_mc_ovo_linear_svm_svc_student)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Part 3: Exploration of Different Evaluation Metrics for Multi Class Classification\n",
"\n",
"Let us evaluate different metrics for the multi class classification models that we created so far. \n",
"You may want to check the url http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics for additional details."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t9a\n",
"#assign the best multi class classification model that you found above.\n",
"# For eg, depending on which decision tree model is best \n",
"# set best_dtree_model to either mc_cart_model_default or mc_cart_model_student\n",
"# Similarly do for other models as well. \n",
"# Remember to assign the MODEL variable\n",
"\n",
"#Change the following lines as appropriate\n",
"best_knn_model_mc = None \n",
"best_dtree_model_mc = None \n",
"best_rf_model_mc = None \n",
"best_svm_ovr_model_mc = None \n",
"best_svm_ovo_model_mc = None \n",
"\n",
"\n",
"#Assign the best BINARY classification models that you found above\n",
"best_knn_model_bc = None\n",
"best_dtree_model_bc = None\n",
"best_rf_model_bc = None\n",
"best_svm_model_bc = None\n",
"\n",
"\n",
"######Note - for tasks 9c and 9d you will use multi class models (ie. those variables ending in _mc)\n",
"############# For the remaining you will use binary class models (ie. those variables ending in _bc)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t9b\n",
"#Create predictions of these 5 models on test dataset\n",
"#Change the following lines as appropriate\n",
"predictions_best_knn_model_mc = None\n",
"predictions_best_dtree_model_mc = None\n",
"predictions_best_rf_model_mc = None\n",
"predictions_best_svm_ovr_model_mc = None\n",
"predictions_best_svm_ovo_model_mc = None\n",
"\n",
"#Create predictions of these 4 models on test dataset\n",
"#Change the following lines as appropriate\n",
"predictions_best_knn_model_bc = None\n",
"predictions_best_dtree_model_bc = None\n",
"predictions_best_rf_model_bc = None\n",
"predictions_best_svm_model_bc = None"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t9c\n",
"#Print the classification report for each of the models above\n",
"# Write code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t9d\n",
"#Print the confusion matrix for each of the models above\n",
"# Write code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t9e\n",
"#Each of the model above has some probabilistic interpretation\n",
"# So sklearn allows you to get the probability values as part of classification\n",
"# Using this information, you can print roc_curve\n",
"#Print the roc curve for each of the models above\n",
"# Write code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t9f\n",
"#For each model, DRAW the ROC curve\n",
"# See url below for details\n",
"#http://nbviewer.ipython.org/github/datadave/GADS9-NYC-Spring2014-Lectures/blob/master/lessons/lesson09_decision_trees_random_forests/sklearn_decision_trees.ipynb\n",
"# Write code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t9g\n",
"#Print the AUC value for each of the models above\n",
"# Write code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t9h\n",
"#Print the precision recall curve for each of the models above\n",
"# Write code here\n",
"\n",
"print the curve based on http://scikit-learn.org/stable/auto_examples/plot_precision_recall.html"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Part 4: Parameter Tuning through Grid Search/Cross Validation and Parallelization\n",
"\n",
"So far in this assignment, you manually tweaked the model till it became better.\n",
"For complex models, this is often cumbersome.\n",
"A common trick people use is called Grid Search where you exhaustively test various parameter combinations\n",
"and pick the best set of parameter values.\n",
"This is a VERY computationally intensive process and hence it will require some parallelization.\n",
"\n",
"In this assignment, you will learn how to tune two models (a RandmForest and SVC SVM) for MNIST dataset\n",
"and then parallelize it so as to get results faster.\n",
"You might want to take a look at the url\n",
"http://scikit-learn.org/stable/modules/grid_search.html\n",
"for additional details.\n",
"\n",
"One thing you might want to note is that the GridSearchCV uses cross validation for comparing models.\n",
"So you have to send the ENTIRE MNIST dataset - i.e. mnist.data and mnist.target . \n",
"The following cell creates two variables all_scaled_data and all_scaled_target that you can pass to GridSearchCV.\n",
"In order to get the results in reasonable time, set the **cv** parameter of GridSearchCV to 3.\n",
"Also remember to set the **verbose** parameter to 2 to get some details about what happens internally."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"###Do not make any change below\n",
"all_scaled_data = binary_class_data / 255.0\n",
"all_scaled_target = binary_class_labels"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t10a\n",
"#Tuning SVC SVM model for MNIST\n",
"tuned_parameters = [{'kernel' : ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},\n",
" {'kernel' : ['poly'], 'degree' : [5, 9], 'C' : [1, 10]}] \n",
"#Create a SVC SVM classifier object and tune it using GridSearchCV function.\n",
"#Change the following line as appropriate\n",
"svm = None #replace this line with gridsearchcv\n",
"#print the details of the best model and its accuracy"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t10b\n",
"#Tuning Random Forest for MNIST\n",
"tuned_parameters = [{'max_features': ['sqrt', 'log2'], 'n_estimators': [1000, 1500]}] \n",
"#replace this line with gridsearchcv. Remember to pass \n",
"# the following parameters to constructor or RandomForestClassifier min_samples_split=1, compute_importances=False, n_jobs=-1\n",
"#Change the following line as appropriate\n",
"rf = None \n",
"\n",
"#print the details of the best model and its accuracy"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#(Harder) task t10c\n",
"#Parallelization\n",
"#Now re-run the tuning of SVC SVM model. But parallelize it via n_jobs option.\n",
"tuned_parameters = [{'kernel' : ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},\n",
" {'kernel' : ['poly'], 'degree' : [5, 9], 'C' : [1, 10]}]\n",
"#Change the following line as appropriate\n",
"svm_parallel = None #replace this line with gridsearchcv\n",
"#print the details of the best model and its accuracy"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Part 5: Evaluation of Various Regression Models for Boston Houses Dataset\n",
"\n",
"We will now implement some regression routines for predicting the house prices in Boston."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Do not make any changes in this cell\n",
"boston = load_boston() \n",
"print boston.data.shape \n",
"print boston.feature_names \n",
"print np.max(boston.target), np.min(boston.target), np.mean(boston.target) \n",
"print boston.DESCR \n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Do not make any changes in this cell.\n",
"print boston.data[0] \n",
"print np.max(boston.data), np.min(boston.data), np.mean(boston.data) \n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=1234)\n",
"\n",
"#Scale the data - important for regression. Learn what this function does\n",
"scalerX = StandardScaler().fit(X_train)\n",
"scalery = StandardScaler().fit(y_train)\n",
"\n",
"X_train = scalerX.transform(X_train) \n",
"y_train = scalery.transform(y_train) \n",
"X_test = scalerX.transform(X_test) \n",
"y_test = scalery.transform(y_test)\n",
"\n",
"print np.max(X_train), np.min(X_train), np.mean(X_train), np.max(y_train), np.min(y_train), np.mean(y_train) "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t11a\n",
"#Create 13 scatter plots such that variables (CRIM to LSTAT) are in X axis and MEDV in y-axis.\n",
"# Organize the images such that the images are in 3 rows of 4 images each and 1 in last row\n",
"\n",
"#Write code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Do not make any change here\n",
"#To make your life easy, I have created a function that \n",
"# (a) takes a regressor object,(b) trains it (c) makes some prediction (d) evaluates the prediction\n",
"def train_and_evaluate(clf, X_train, y_train): \n",
" clf.fit(X_train, y_train) \n",
" print \"Coefficient of determination on training set:\",clf.score(X_train, y_train) \n",
" cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=1234) \n",
" scores = cross_val_score(clf, X_train, y_train, cv=cv) \n",
" print \"Average coefficient of determination using 5-fold crossvalidation:\",np.mean(scores) \n",
" \n",
"def plot_regression_fit(actual, predicted):\n",
" plt.scatter(actual, predicted)\n",
" plt.plot([0, 50], [0, 50], '--k')\n",
" plt.axis('tight')\n",
" plt.xlabel('True price ($1000s)')\n",
" plt.ylabel('Predicted price ($1000s)') "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t11b\n",
"#create a regressor object based on LinearRegression\n",
"# See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\n",
"#Change the following line as appropriate\n",
"clf_ols = None #change this line\n",
"train_and_evaluate(clf_ols,X_train,y_train) \n",
"clf_ols_predicted = clf_ols.predict(X_test) \n",
"plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_ols_predicted))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t11c\n",
"#See http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html\n",
"#Create a regression based on Support Vector Regressor. Set the kernel to linear\n",
"#Change the following line as appropriate\n",
"clf_svr= None \n",
"train_and_evaluate(clf_svr,X_train,y_train) \n",
"clf_svr_predicted = clf_svr.predict(X_test) \n",
"plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_predicted)) \n",
"\n",
"#task t11d\n",
"#Create a regression based on Support Vector Regressor. Set the kernel to polynomial\n",
"#Change the following line as appropriate\n",
"clf_svr_poly= None \n",
"train_and_evaluate(clf_svr_poly,X_train,y_train) \n",
"clf_svr_poly_predicted = clf_svr_poly.predict(X_test) \n",
"plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_poly_predicted)) \n",
"\n",
"#task t11e\n",
"#Create a regression based on Support Vector Regressor. Set the kernel to rbf\n",
"#Change the following line as appropriate\n",
"clf_svr_rbf= None \n",
"train_and_evaluate(clf_svr_rbf,X_train,y_train)\n",
"clf_svr_rbf_predicted = clf_svr_rbf.predict(X_test) \n",
"plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_svr_rbf_predicted))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t11f\n",
"#See http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html\n",
"#Create regression tree\n",
"#Change the following line as appropriate\n",
"clf_cart = None \n",
"train_and_evaluate(clf_cart,X_train,y_train) \n",
"clf_cart_predicted = clf_cart.predict(X_test) \n",
"plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_cart_predicted)) \n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#task t11g\n",
"#See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor\n",
"#Create a random forest regressor with 10 estimators and random state as 1234\n",
"#Change the following line as appropriate\n",
"clf_rf= None\n",
"train_and_evaluate(clf_rf,X_train,y_train) \n",
"#The following line prints the most important features\n",
"print np.sort(zip(clf_rf.feature_importances_,boston.feature_names),axis=0) \n",
"clf_rf_predicted = clf_rf.predict(X_test) \n",
"plot_regression_fit(scalery.inverse_transform(y_test), scalery.inverse_transform(clf_rf_predicted)) "
],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}