Programming Assignment 3: Clustering and Dimensionality Reduction

Team Details:

When submitting, fill your team details in this cell. Note that this is a markdown cell.

========================Remember to remove your Facebook Credentials before submitting

Student 1 Full Name:
Student 1 Student ID:
Student 1 Email Address:

Student 2 Full Name:
Student 2 Student ID:
Student 2 Email Address:

Student 3 Full Name:
Student 3 Student ID:
Student 3 Email Address:

========================Remember to remove your Facebook Credentials before submitting

Assignment Details

In this assignment, you will explore the fascinating concepts of clustering and dimensionality reduction. Since the aim is to get some insights into how and when to use clustering, we will mostly stick to small synthetic datasets. However, for fun, at the end of this assignment, you will write some code to collect your friends information from Facebook and use it for clustering. I got some very interesting results when I run them against my friends' accounts. At a more granular level, you will be doing the following tasks:

1. Evaluation of $k$-Means over Diverse Datasets

In the first task, you will explore how $k$-Means perform on datasets with diverse structure. You will also explore different ways of customizing $k$-Means and learn about how to find good initialization and good value of $k$.

2. Evaluation of Hierarchical Clustering over Diverse Datasets

In this task, you will explore hierarchical clustering over different datasets. You will also evaluate different ways to merge clusters and good way to find the cut-off point for breaking the dendrogram.

3. Comparison of Clustering Evaluation Metrics

In the class, we mostly focused on SSE measure for evaluating how good a cluster is. There are many other statistical measures, and you will test them in this task.

4. Clustering your Facebook Friends

In this task, you will use the scraping knowledge you gained from Programming Assignment 1 to scrape details about your Facebook friends. You will then convert them to an appropriate vector format before clustering them.

5. Dimensionality Reduction

Dimensionality reduction is a key data pre-processing technique. You will perform PCA, a popular dimensionality reduction technique, over few images to get an intuition. Then you will apply it to MNIST data from Programming Assignment 2 to see how it performs.

In [184]:
####################Do not change anything below
%matplotlib inline 

#Array processing
import numpy as np

#Data analysis, wrangling and common exploratory operations
import pandas as pd
from pandas import Series, DataFrame

#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import Image
from IPython.display import display

import sklearn.datasets as datasets
from sklearn.utils import shuffle

from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn import metrics

#A different implementation of k-means
import scipy as sp
import scipy.cluster.vq
import scipy.spatial.distance
from scipy.spatial.distance import cdist, pdist

from sklearn.datasets import fetch_mldata, fetch_olivetti_faces
from sklearn.utils import shuffle 
from sklearn.svm import LinearSVC
from sklearn.cross_validation import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.decomposition import PCA

import time
from collections import defaultdict

#################Might require installation of new packages
#Install selenium. Upgrade it if you have previously installed as recent version has some important bug fixes
import selenium.webdriver as webdriver
from selenium.webdriver.common.by import By

import json
from urlparse import parse_qs, urlsplit, urlunsplit

########################If needed you can import additional packages for helping you, although I would discourage it
########################Put all your imports in the space below. If you use some strange package, 
##########################provide clear information as to how we can install it.

#######################End imports###################################
In [185]:
#Set the colormap to jet. Most of the time this is good enough
# See http://matplotlib.org/examples/color/colormaps_reference.html for details
plt.set_cmap('jet')
<matplotlib.figure.Figure at 0x7fe9ce1ee350>

Part 1. Evaluation of $k$-Means over Diverse Datasets

In the first task, you will explore how $k$-Means perform on datasets with diverse structure. You will also explore different ways of customizing $k$-Means and learn about how to find good initialization and good value of $k$.

In [186]:
#task t1a

#Write the code for the following function: 
#  Note that it is invoked from t1b and later tasks - so you might want to partially solve them to get the data
#  that is then used in this task
#The input arguments are:
#     original_data : This is a 2D original data . 
#          original_data[:, 0] gives the first dimension and original_data[:, 1] gives the second dimension
#     original_cluster_assignments: In general, you do not have the "correct" cluster assignments.
#          Since we used synthetic data, we can get it anyway. This variable gives the correct values
#     kmeanspp_cluster_assignments: This is the cluster assignment that we got from calling k-means 
#           with k-means++ as the initialization algorithm
#     kmeans_random_cluster_assignments: This is the cluster assignment that we got from calling k-means 
#           with random initialization

#The code must do the following:
#   Create a 1x3 subplot where you plot original_data in each of them
#   In the first sub-figure, you have to plot the cluster assignment from original_cluster_assignments
#   In the second sub-figure, you have to plot the cluster assignment from kmeanspp_cluster_assignments
#   In the third sub-figure, you have to plot the cluster assignment from kmeans_random_cluster_assignments
# Hint:
#   1. The scatter function has an argument called c that accepts a color sequence
#       Since all three figures plot the same data, think about how you can use the c argument
#       and the cluster assignments to show how the clustering worked
#   2. This function will be called for different datasets. So ensure that you create a new figure object
#        So that the images dont get super-imposed
def part1_plot_clustering(original_data, original_cluster_assignments, 
                              kmeanspp_cluster_assignments, kmeans_random_cluster_assignments):
    plt.figure()
    fig,axes = plt.subplots(1, 3, figsize=(15,4))
    
    #Change below: call scatter plot function on axes[0]
    axes[0].scatter()
    axes[0].set_title('Original')
    
    #Change below: call scatter plot function on axes[1]
    axes[1].scatter()
    axes[1].set_title('Init=K-Means++')
    
    #Change below: call scatter plot function on axes[2]
    axes[1].scatter()
    axes[2].set_title('Init=Random')
In [187]:
#task t1b
#Scikit-learn has a number of functions to generate synthetic data.
# For the first assignment, you will be using make_blobs function
# See url http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html for details
# Create a dataset with 200 2-D points with 4 cluster with a standard deviation of 1.0
# Remember to create it with a random state of 1234
# Remember to do it for all future tasks even if I forget to mention it :)

#Change below to Create a dataset with 200 2-D points with 4 cluster with a standard deviation of 1.0
t1b_data, t1b_ground_truth = None
#Change below: Call Kmeans with 4 clusters, with k-means++ initialization heuristic and random state of 1234
t1b_kmeanspp = None
#Change below: Print the centroids
print None
#Change below: Find the cluster assignments for the data
t1b_kmeanspp_cluster_assignments = None
#Change below: Call Kmeans with 4 clusters, with random initialization heuristic and random state of 1234
t1b_kmeans_random = None
#Change below: Find the cluster assignments for the data
t1b_kmeans_random_cluster_assignments = None
part1_plot_clustering(t1b_data, t1b_ground_truth, t1b_kmeanspp_cluster_assignments, t1b_kmeans_random_cluster_assignments)
None

<matplotlib.figure.Figure at 0x7fe9ce101810>
In [188]:
#task t1c
# Create a dataset (make_blobs) with 200 2-D points with 4 cluster with a standard deviation of 5.0
# Remember to create it with a random state of 1234

#Change below to Create a dataset with 200 2-D points with 4 cluster with a standard deviation of 5.0
t1c_data, t1c_ground_truth = None
#Change below: Call Kmeans with 4 clusters, with k-means++ initialization heuristic and random state of 1234
t1c_kmeanspp = None
#Change below: Find the cluster assignments for the data
t1c_kmeanspp_cluster_assignments = None
#Change below: Call Kmeans with 4 clusters, with random initialization heuristic and random state of 1234
t1c_kmeans_random = None
#Change below: Find the cluster assignments for the data
t1c_kmeans_random_cluster_assignments = None
part1_plot_clustering(t1c_data, t1c_ground_truth, t1c_kmeanspp_cluster_assignments, t1c_kmeans_random_cluster_assignments)
<matplotlib.figure.Figure at 0x7fe9df851410>
In [189]:
#task t1d
# Create a dataset (make_blobs) with 200 2-D points with 10 clusters and with a standard deviation of 1.0
# Remember to create it with a random state of 1234

t1d_data, t1d_ground_truth = None
t1d_kmeanspp = None
t1d_kmeanspp_cluster_assignments = None
t1d_kmeans_random = None
t1d_kmeans_random_cluster_assignments = None
part1_plot_clustering(t1d_data, t1d_ground_truth, t1d_kmeanspp_cluster_assignments, t1d_kmeans_random_cluster_assignments)

#How does the result look?
<matplotlib.figure.Figure at 0x7fe9dda12350>
In [190]:
#task t1e
# Create a dataset (make_blobs) with 200 2-D points with 10 clusters and with a standard deviation of 5.0
# Remember to create it with a random state of 1234
# Then call K-Means with k=10

t1e_data, t1e_ground_truth = None
t1e_kmeanspp = None
t1e_kmeanspp_cluster_assignments = None
t1e_kmeans_random = None
t1e_kmeans_random_cluster_assignments = None
part1_plot_clustering(t1e_data, t1e_ground_truth, t1e_kmeanspp_cluster_assignments, t1e_kmeans_random_cluster_assignments)

#How does the result look?
<matplotlib.figure.Figure at 0x7fe9df4bfad0>
In [191]:
#task t1f
# For this assignment, you will be using make_circles function
# See url http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html for details
# Create a dataset with 200 2-D points 
# Remember to create it with a random state of 1234
#In the code, call the K-Means function with k=2

t1f_data, t1f_ground_truth = None
t1f_kmeanspp = None
t1f_kmeanspp_cluster_assignments = None
t1f_kmeans_random = None
t1f_kmeans_random_cluster_assignments = None
part1_plot_clustering(t1f_data, t1f_ground_truth, t1f_kmeanspp_cluster_assignments, t1f_kmeans_random_cluster_assignments)
<matplotlib.figure.Figure at 0x7fe9df75f810>
In [192]:
#task t1g
# For this assignment, you will be using make_moons function
# See url http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html for details
# Create a dataset with 200 2-D points 
# Remember to create it with a random state of 1234
#In the code, call the K-Means function with k=2

t1g_data, t1g_ground_truth = None
t1g_kmeanspp = None
t1g_kmeanspp_cluster_assignments = None
t1g_kmeans_random = None
t1g_kmeans_random_cluster_assignments = None
part1_plot_clustering(t1g_data, t1g_ground_truth, t1g_kmeanspp_cluster_assignments, t1g_kmeans_random_cluster_assignments)
<matplotlib.figure.Figure at 0x7fe9dd91e950>

$k$-Means and Image Compression via Vector Quantization

Here is a cool application of $k$-Means for image compression (we will discuss a more advanced method in dimensionality reduction later). The idea is incredibly simple. Suppose we have an image that has lot of colors that we wish to compress. One simple way is to reduce the number of colors. One of the compression techniques used by JPG is called run length encoding (RLE). Suppose you have a text like AAAAAAAABB, RLE compresses it to 8A2B. Intuitively, lesser colours will have longer runs and hence better compression.

Suppose we run $k$-Means on the image with a reasonably small $k$ where the data points are the pixels. For example, suppose we decide to use only 64 colours. Then we run $k$-Means with $k=64$. We then get the set of centroids which now correspond to "mean" (or representative) colors. We now replace the color of each pixel with the color provided by its centroid. This way, we quantize an image to have exactly $k$ colors.

Here is a crude computation of the savings achieved. Each pixel in the original image is associated with 24 bits (8 bits each for RGB). So if the image is $128 \times 128$, we need $128 \times 128 \times 24$ bits. Suppose we compressed it into 16 colors only. We now need to store two information: a dictionary of the 16 colors (which need $16 \times 24$ bits) and each pixel now can be represented using just 4 bits ($\lg 16 = 4$) and hence needs $128 \times 128 \times 4$. Overall we compressed the image by a factor of 6!!

In the following task, you will do a crude demonstration.

In [193]:
#Do not change anything below
#China is one of the sample figures in sklearn

#Code courtesy: Sklearn
#china is a 3-D array where the first two dimensions are co-ordinates of pixels (row and column)
# and third coordinate is a tuple with 3 values for RGB value
china = datasets.load_sample_image("china.jpg")
china = np.array(china, dtype=np.float64) / 255
china_w, china_h, china_d = tuple(china.shape)
print "Width=%s, Height=%s, Depth=%s" % (china_w, china_h, china_d)
#Convert it to a 2D array for analysis
china_image_array = np.reshape(china, (china_w * china_h, china_d))
print "In 2-D the shape is ", china_image_array.shape

def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    d = codebook.shape[1]
    image = np.zeros((w, h, d))
    label_idx = 0
    for i in range(w):
        for j in range(h):
            image[i][j] = codebook[labels[label_idx]]
            label_idx += 1
    return image


plt.figure()
ax = plt.axes([0, 0, 1, 1])
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)
Width=427, Height=640, Depth=3
In 2-D the shape is  (273280, 3)

Out[193]:
<matplotlib.image.AxesImage at 0x7fe9cda51410>
In [194]:
#t1h:

#This task could run for a long time on typical machines
#Let us run K-means with different values of k
# Then using the new centroids, compress the image and display it. 

t1h_start_time = time.time()
plt.figure()
fig,axes = plt.subplots(2, 2, figsize=(15,6))

#the 2d is for convenience
t1h_k_values = [[16, 32], [64,128]]
for i in range(2):
    for j in range(2):
        print "Handling k =", t1h_k_values[i][j]
        #Change below: call Kmeans with k=t1h_k_values [i][j] and random state = 1234
        t1h_kmeans_obj = None
        #Change below: fit the object with china image array variable
        t1h_kmeans_fit = None
        axes[i][j].imshow(recreate_image(t1h_kmeans_fit.cluster_centers_, t1h_kmeans_fit.labels_, china_w, china_h))
        axes[i][j].set_title('Compression with' + str(t1h_k_values[i][j]) + " colors")
        
        axes[i][j].grid(False)
        axes[i][j].get_xaxis().set_ticks([])
        axes[i][j].get_yaxis().set_ticks([])

print "Clustering over entire data took %s seconds" % (time.time() - t1h_start_time)
Handling k = 16
Handling k = 32
Handling k = 64
Handling k = 128
Clustering over entire data took 536.04073596 seconds

<matplotlib.figure.Figure at 0x7fe9cd916690>