CSE 5334: Data Mining

Section 002, Spring 2015

TuTh 2:00PM - 3:20PM, PKH 321

Suggested Reading
Topic Reading Material
Visualization Slides are sufficient in general. You must also know how to interpret statistical graphs (basic plots bar charts, histograms to more complex ones like Box plots)
Data Collection and Munging Slides are sufficient.
Exploratory Data Analysis [DMA] Chapters 2 and 3
Classification Basics [IDM] Chapter 4.1, 4.2 [Book Link]
Nearest Neighbor [MMDS] Chapter 12.4, [IIR] Chapter 14.3, [URL1], [URL2]. Section 8 of Top-10 algorithms in data mining
Decision Trees [IDM] Chapter 4.3 [Book Link]
Section 1 and 10 of Top-10 algorithms in data mining
Linear Regression [ISLR] Chapter 3.1, 3.2
Regression Trees [ISLR] Chapter 8.1
Probability Basics Bayes Theorem: [URL]
Independence: [URL]
Conditional Independence: [URL]
[Optional] [URL], [URL]
Bayesian Classifiers, Naive Bayes Chapter 3 from [URL]
Section 9 of Top-10 algorithms in data mining
[IIR] Chapter 13.1-13.4 and [URL]
SVMs and Kernels [MMDS] Chapter 12.3
Section 2 in Top-10 algorithms in data mining
Kernel Trick: [URL]
Model Evaluation [IDM] Chapter 4.4 - 4.6 [Book Link]
Clustering: Basics, K-Means and Hierarchical [IDM] Chapter 8.1, 8.2, 8.3 and 8.5 [Book Link].
Section 2 in Top-10 algorithms in data mining
Frequent Itemsets and Association Rule Mining [MMDS] Chapter 6.1, 6.2, 6.4.1-6.4.4
Search Engines (Keyword Queries) [IIR] Chapter 1, 2.1, 2.2, 6.1-6.3
Search Engines (Link Analysis) [MMDS] Chapter 5.1-5.2
Search Engines (Computational Advertising) [MMDS] Chapter 8
Recommender Systems [MMDS] Chapter 9
LSH [MMDS] Chapter 3.1-3.4, 3.8
Neural Networks Chapters 1 and 2 from the book Neural Networks and Deep Learning
MapReduce [MMDS] Chapter 2.1-2.2

Practice Questions


Installation:In this class, we will be using Python as the programming language for all assignments. Please refer to this url for simple instructions about installing Python and Scientific Python. I strongly suggest using Anaconda distribution system. If you want to manually install the packages, use the instructions here. Remember to install other packages such as BeautifulSoup, Pattern, Seaborn and MrJob. Instructions are in the first url itself.

IPython Notebook Basics: Please spend some time exploring IPython Notebook. This is a fast, 10 minute video introducing the basics. As of now, it is perfectly fine if you do not understand what the Python code that are being executed means. Once you are done, please spend some time reading this tutorial which might require around 30 minutes of your time. We will be spending lot of time in IPython Notebooks. So an initial investment of time will save lot of your time later.

[Optional] This is a great post from Philip Guo about his impressions about IPython Notebook. This is an article from Nature magazine (a prestigious publisher in Science domain) about how IPython Notebook is becoming popular in non-CS domain.

Python Basics: Here is a GREAT IPytho Notebook that covers most important ideas in Python. For most of you, it should not take more than 1-2 hours.

[Optional] This IPython Notebook provides a non-technical introduction to entire Scientific Python stack.

Verification: It is important to have a common baseline so that the code we will be using have same output. If you have installed all packages correctly, you should have version numbers that are higher than what is given below. If not, install the necessary package and restart IPython Notebook/console. (Courtesy: Harvard, CS 109)

#IPython is what you are using now to run the notebook
import IPython
print "IPython version:      %6.6s (need at least 1.0)" % IPython.__version__

# Numpy is a library for working with Arrays
import numpy as np
print "Numpy version:        %6.6s (need at least 1.7.1)" % np.__version__

# SciPy implements many different numerical algorithms
import scipy as sp
print "SciPy version:        %6.6s (need at least 0.12.0)" % sp.__version__

# Pandas makes working with data tables easier
import pandas as pd
print "Pandas version:       %6.6s (need at least 0.11.0)" % pd.__version__

# Module for plotting
import matplotlib
print "Mapltolib version:    %6.6s (need at least 1.2.1)" % matplotlib.__version__

# SciKit Learn implements several Machine Learning algorithms
import sklearn
print "Scikit-Learn version: %6.6s (need at least 0.13.1)" % sklearn.__version__

# Requests is a library for getting data from the Web
import requests
print "requests version:     %6.6s (need at least 1.2.3)" % requests.__version__

#BeautifulSoup is a library to parse HTML and XML documents
import bs4
print "BeautifulSoup version:%6.6s (need at least 4.0)" % bs4.__version__

#MrJob is a library to run map reduce jobs on Amazon's computers
import mrjob
print "Mr Job version:       %6.6s (need at least 0.4)" % mrjob.__version__

#Pattern has lots of tools for working with data from the internet
import pattern
print "Pattern version:      %6.6s (need at least 2.6)" % pattern.__version__

#Seaborn is a nice library for visualizations
import seaborn
print "Seaborn version:      %6.6s (need at least 0.3.1)" % seaborn.__version__

Pandas is one of the most important package that you will learn to use in this class. You can find a wide variety of tutorials here. Specifically, focus on Lessons 1 to 7 in Lessons for New pandas Users. Each of these lessons must not take more than 5-10 minute of your time.

[Optional] You might also want to check out Chapters 1-5 in pandas Cookbook
[Optional] If you have some more time to spare, here is a great series of IPython Notebooks for analyzing data using Python. Focus on first two notebooks Introduction to Pandas and Data Wrangling with Pandas. Dr.Christopher Fonnesbeck's lecture is also in Youtube in case you prefer a live demonstration.

There are a number of software that could be used to generate cool visualization. Here is a high level pointers to each of them.

  • Matplotlib:

    This is the most important and basic visualization software in Python. Most of the other libraries either extend it or inspired by it. It is a bit verbose, but gets the job done. There are many tutorials available. You can use either this or this link for refresher.

    [Optional] In addition, Matplotlib site itself provides a gallery of examples with code.

  • Seaborn:

    Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. I find the charts generated by Seaborn to be more concise in terms of code and stylish. The Seaborn site also has lot of good resources.

  • Pandas Visualization:

    Pandas has some fairly sophisticated visualization functions. Lot of times, you might have used Pandas to compute complex objects (aggregations, groupby, pivot, crosstab etc) and might want to visualize them. Converting these objects to a format expected by Matplotlib or Seaborn can be a pain. In such cases, you might be better off directly using the plot function of Pandas. It generates decent visualization and also supports extensive customizations. Here are two great resources: Resource 1 and Resource 2.

  • [Optional] D3/Bokeh:

    D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. Lot of visualization you see in popular US media sites are done in D3. It is extremely powerful but takes quite some time to get used to it. Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. It is very similar to D3.js .

  • [Optional] Tableau Software:

    Tableau Software is a family of interactive data visualization products focused on business intelligence. The software is absolutely fantastic and you can create lot of cool visualization in a matter of minutes. The license is quite expensive, but they provide a 1 year free license for students. See further details here. I would encourage you to check it out. It makes exploratory analysis and visualization dramatically easy.

There are a number of tools used for Scraping in Python. A comprehensive listing is not feasible. Here is a list of most important ones.