CSE 5334: Data Mining

Section 002, Spring 2015

TuTh 2:00PM - 3:20PM, PKH 321

Programming Assignments
  1. Exploratory Data Analysis using Python, Pandas, Matplotlib and Seaborn.
  2. Classification and Regression Algorithms using Scikit-learn
  3. Clustering using Scikit-learn
  4. Search Engine Basics
  5. Recommender Systems
  6. Capstone Project: Putting it all together

Late days: Each student has 5 late days to use at his or her discretion for the problem sets and programming assignments. One cannot use more than 2 days for one assignment or problem set without prior approval from the instructors. Please reserve your late days for legitimate emergencies. Each late day constitutes a 24-hour extension; you cannot split late days into smaller increments. If two partners on a programming exercise want to take a late day, they must contribute two days from their allotment; either one from each of them, or, with permission, two days from the person with extra late days.

Late penalties: Once a student runs out of late days, any late submissions are penalized at a rate of 50% per day. No assignment may be handed in more than 2 days late.

PA I: Exploratory Analysis of FEC's 2012 Presidential Election Contribution Dataset

In this assignment, you will be doing three different things. First, you will scrape content from three popular websites (Wikipedia, Walmart and Facebook). Next, you will use Pandas to perform a guided exploration over FEC data. Finally, you will use some basic 1-D and 2-D plots to visualize the data.

This is a LONG assignment. So please form a team and start early!

Deadline: 26-Feb, 2015
Dataset: [Link] (Zip file)
Dataset Description: [Link]
Notebook: [IPynb] [HTML]
Solution: [IPynb] [HTML]
Assignment Survey: [Link]
Note that this is an anonymous form. You will be required to login to ensure that each student submits only once. But the responses are anonymous.

PA II: Classification of MNIST Handwritten Digits and Regression of House Prices through Boston Dataset

At a high level, you will be building and evaluating different classifiers for recognizing handwritten digits of MNIST dataset and also build and evaluate various regression models for predicting house prices in Boston. At a more granular level, you will be doing the following:

1. Binary Classification of MNIST Dataset
In the first set of tasks, you will evaluate a number of popular classifiers for the task of recognizing handwritten digits from MNIST dataset. Specifically, we will focus on distinguishing between 7 and 9 which are known to be a hard pair. We will not use any sophisticated ideas from Computer Vision/Image Processing and use classifiers directly over the data. The idea is to show that lot of times, you can simply run a set of classifiers and still get great results. While I will be giving some basic classifier code, you will have some opportunity to improve them by tuning the parameters.
2. Multi-Class Classification of MNIST Dataset
In the second set of tasks, we will do multi-class classification where the idea is to classify the image to one of the ten digits (0-9). We will start with some basic classifiers that are intrinsically multi-class. Then we will learn about how to convert binary classifiers to multi-class classifiers and how scikit-learn makes it very easy.
3. Exploration of Different Evaluation Metrics
In the first two set of tasks, we will narrowly focus on accuracy - what fraction of our predictions were correct. However, there are a number of popular evaluation metrics. You will learn how (and when) to use these evaluation metrics.
4. Parameter Tuning through Grid Search/Cross Validation and Parallelization
This is an advanced topic where you will learn how to tune your classifier and find optimal parameters. We will explore two powerful techniques of grid search and parameter search. This is a very compute intensive task - so you will also explore how to leverage parallelization capabilities of IPython kernel to get results sooner.
5. Evaluation of Various Regression Models for Boston Houses Dataset
In the final set of tasks, we will use regression to predict Boston house prices. We will explore both Ordinary Least Squares and also explore other regression variants of popular classifiers such as decision trees and SVM.

This is a relatively easy assignment. But some of the classifiers might take a lot of time to run. So please form a team and start early!

Deadline: 15-Mar, 2015
Dataset: [Link] (download the dataset in Matlab format). The notebook also has some helper functions that can automatically download the data for you.
Notebook: [IPynb] [HTML]
Solution: [IPynb] [HTML]
Assignment Survey: [Link]
Note that this is an anonymous form. You will be required to login to ensure that each student submits only once. But the responses are anonymous.

PA III: Clustering and Dimensionality Reduction

In this assignment, you will explore the fascinating concepts of clustering and dimensionality reduction. Since the aim is to get some insights into how and when to use clustering, we will mostly stick to small synthetic datasets. However, for fun, at the end of this assignment, you will write some code to collect your friends information from Facebook and use it for clustering. I got some very interesting results when I run them against my friends' accounts. At a more granular level, you will be doing the following tasks:

1. Evaluation of k-Means over Diverse Datasets

In the first task, you will explore how k-Means perform on datasets with diverse structure. You will also explore different ways of customizing $k$-Means and learn about how to find good initialization and good value of k.

2. Evaluation of Hierarchical Clustering over Diverse Datasets

In this task, you will explore hierarchical clustering over different datasets. You will also evaluate different ways to merge clusters and good way to find the cut-off point for breaking the dendrogram.

3. Comparison of Clustering Evaluation Metrics

In the class, we mostly focused on SSE measure for evaluating how good a cluster is. There are many other statistical measures, and you will test them in this task.

4. Clustering your Facebook Friends

In this task, you will use the scraping knowledge you gained from Programming Assignment 1 to scrape details about your Facebook friends. You will then convert them to an appropriate vector format before clustering them.

5. Dimensionality Reduction

Dimensionality reduction is a key data pre-processing technique. You will perform PCA, a popular dimensionality reduction technique, over few images to get an intuition. Then you will apply it to MNIST data from Programming Assignment 2 to see how it performs.

Deadline: 5-Apr, 2015
Notebook: [IPynb] [HTML]
Solution: [IPynb] [HTML]

PA IV: Search Engines and Inbox Ranking

In this project, you will do two sub-projects that are loosely connected via text processing (and data mining). In the first project, you will build a toy search engine and implement some important modules for crawling, text-processing, answering and ranking keyword queries via vector space model, zone scoring and pagerank. In the second project, you will learn more about ranking in the context of mail box. We will build a toy version of Google's priority inbox using some limited set of features. At a more granular level, you will be doing the following tasks:

Sub Project 1: Search Engine

1. Building a Crawler for UTA:

In the first task, you will build a crawler for UTA using Scrapy module. I have given some starter code that you will have to flesh out.

2. Text Processing and Answering Queries using Vector Space Model and Zone Scoring:

In the second task, you will process the web pages from UTA using NLTK. You will implement the standard text processing techniques (such as stemming, lemmatization, normalization etc). You will then implement two simple ranking algorithms: one using vector space model (tf-idf) and another using zone scoring.

3. PageRank Computation:

In the third task, you will compute PageRank over the entire UTA corpus. Naively constructing an adjacency matrix might require gigabytes of RAM. For eg, an adjacency matrix for 50000 nodes will require close to 20 GB. We will instead explore the usage of sparse matrices that require substantially less memory. Specifically, you will investigate how to use two popular techniques - dictionary of keys and compressed sparse row. You will then investigate how this approach provides superior memory and time reductions.


Sub Project 2: Inbox Ranking

In this task, you will learn about how to do text mining. Specifically, you will download and parse your GMail data. You will be doing two tasks:

4. Priority Categorization of Emails:

We will now use GMail data to build a classifier to determine whether an email is important or not. For this, we will use Google's Priority Inbox as the ground truth. In other words, GMail internally tags the emails with a label called "Important" and we will consider them as important too. Of course, the other option is manually tag each email as important or not which might be tedious. You will download GMail data using Google's Takeout feature and evaluate different ways to construct features (such as Binary, Count and TF-IDF) and try different classifiers such as Naive Bayes and SVM.

5. Ranking Emails by Priority:

Often, simply categorizing email as important or not is not sufficient. In this task, you will build a ranking function that computes for each email a score based on how important it is. By sorting the emails based on the score, you can identify the top-k emails. At a high level, this is how many sites such as Facebook, LinkedIn etc operate. We will use some simple features such as sender, subject, date, body and threads and design a function inspired by GMail's Priority Inbox ranking function (Read the first 2 pages of http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36955.pdf if you are interested in how GMail does it).

Deadline: 26-Apr
Starter Code for Search Engine: [ZIP]

This is a challenging assignment. So start early!!

PA V: Building a Simple Movie Recommendation System

In this assignment, we will explore some of the fundamental concepts in building a recommendation system for movies. This assignment material is inspired by code from a number of resources including those from Hanspeter Pfister, YHat, Sudeep Das, Jonny and others. At a high level, you will be doing the following tasks:

1. Exploratory Analysis/Sparsity

Before building the actual recommender, we will explore the movie dataset and see how ratings are distributed across different dimensions and how the sparse the overall dataset is. These factors (long tail/sparsity) cause lot of issues when building a recommender.

2. Nearest Neighbor based Recommender System

We will start with a simple nearest neighbor based recommender system. Given a movie, we will try to find the top-k most similar movies. We will just provide the list of movies without predicting the ratings.

3. Item based Collaborative Filtering

In this task, we will try to predict the ratings also. We will use item based collaborative filtering that computes similarity between movies and use them for recommendation.

4. User based Collaborative Filtering

In this task, we will design a user based collaborative filtering that is very similar to the recommender you designed in Task 3.

5. Latent Factor Models

In this task, we will explore SVD based latent factor model for recommendation.

Deadline: 15-May, 2015
Notebook: [IPynb] [HTML]
Solution: [IPynb] [HTML]

Final Project

Deadline: 10-May, 2015

The objective of this project is to show your mastery of data mining by performing some non-trivial, challenging and fun project that will use one or more major concepts discussed in the class. I will create a Piazza post with more suggestions and additional details. At a high level, your proposed project has to fall into one of the three categories:

  1. Do some analysis on a known dataset (benchmark datasets, KDD Cup, Kaggle, data.gov or from state government of Texas etc).
  2. Learn and apply some interesting concept not discussed in class or explore in-depth some topic discussed in the class. Examples for the former might include Topic Models, Social Network Analysis, Anomaly Detection, Mining in Computational Finance, Mining in Election polls, Bayesian Statistics, data analytics in NoSQL databases etc. Examples for the latter could include Deep Learning/Neural Networks, Sampling, Computational Advertising, SVM, Feature Selection, Recommender Systems, Ensembles, challenges in High dimensional data (images, text, biological etc), Model testing on Web etc.
  3. Do a non-trivial mining project using some popular open source tool, package or resource. Examples might include Social Media APIs (Twitter, Facebook, Yelp, NYTimes), Visualization Software (D3, Tableau software), Large Scale processing (MapReduce, Apache Spark, AWS/AzureML/Google Predict/Mahout), Graph processing (GraphLab, Giraph, Pregel), Deep Learning (Torch, Caffe), Online learning (Vowpal Wabbit).

If you cannot choose a project, I would suggest the Click Prediction project. In this project, you will predict the click through rate of ads given a query, the ads (link) information, and user information. This was originally proposed in 2012 KDD Cup Track 2. Additional details of the data and task can be found here. The original dataset is divided into 3 parts: training, testing, and maps from feature id to features. The training set has 150M instances, and the testing data has 20M instances. Drs. Carlos Guestrin and Emily Fox from University of Washington have subsampled and simplified this dataset by joining the training and testing data with the feature maps. After spending couple of hours, I noticed that (linear) models that we have covered in class work quite well! Doing this project has additional advantages too. While we covered the basics of advertising in the class, you will be starting as an amateur in the domain and focus a lot on tools and techniques than domain knowledge. Also, this is a cool enough project to put in portfolio and getting some knowledge in this field wouldn't hurt. There are lot of discussion in the internet and there is an entire journal issue for papers that describe what the top-performing teams did. Data can be found here.