When submitting, fill your team details in this cell. Note that this is a markdown cell.
Student 1 Name and ID: Saravanan Thirumuruganathan, 1000XXXXXX
Student 2 Name and ID: Satishkumar Masilamani, 1000XXXXXX
Student 3 Name and ID:
In this assignment, you will be learn more about three key steps in a data analytics process.
In the first set of tasks, you will put to use the various techniques we learnt in web scraping and use it to scrape three popular websites.
In the second and third set of tasks, you will conduct a guided exploration over 2012 FEC Presidential Election Dataset. You will learn and use some of the most common exploration/aggregation/descriptive operations. This should also help you learn most of the key functionalities in Pandas.
In the third set of tasks (that is done in conjunction with #2), you will also learn how to use visualization libraries to identify patterns in data that will help in your further data analysis. You will also explore most popular chart types and how to use different libraries and styles to make your visualizations more attractive.
Here are the set of packages that you will use extensively in this assignment. Do NOT import any new packages without confirming with the instructor. The objective is to make you identify and use the various functions in the packages imported below. Besides, to my knowledge, the libraries below do offer a concise way to achieve the homework tasks anyway.
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
#Array processing
import numpy as np
#Data analysis, wrangling and common exploratory operations
import pandas as pd
from pandas import Series, DataFrame
#A sane way to get web data
import requests
#Packages for web scraping. No need to use both. Feel free to use one of them.
from pattern import web
from BeautifulSoup import BeautifulSoup
#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.
import matplotlib.pyplot as plt
import seaborn as sns
#For some of the date operations
import datetime
pd.set_option('display.notebook_repr_html',True)
from IPython.display import display
In the first series of tasks, we will seek to scrape information from diverse websites. To make your job easier, you can assume that you do not need to do any complex webdriver stuff. For each task, you will implement a function that accepts an url and produces output in the format that the question asks.
NOTE: Make sure that you follow the input and output requirements. The assignment will be evaluated and graded automatically by a script. Any error will result in you getting a score of 0 for that task.
In the first task, we will seek to parse contents from a table in a Wikipedia page. You will implement a function that accepts two parameters. The first is the url and the second is the name of the table. For example, if the url was http://en.wikipedia.org/wiki/List_of_Test_cricket_records and the table name was "Team wins, losses and draws", then you have to parse the first table. Your code must produce as an output a Pandas data frame where the columns have same name as the table ('Team', 'First Test match', 'Matches', 'Won', 'Lost', 'Tied', 'Drawn' and '% Won') in this example.
Do NOT worry about the data format of the Pandas data frame. Simply treat each of them as a string.
#Input:
# url: URL of a wikipedia page
# table_id: Index of the table to scrape
#Output:
# df is a Pandas data frame that contains a tabular representation of the table
# The columns are named the same as the table columns
# Each row of df corresponds to a row in the table
def scraping_wikipedia_table(url, table_id):
website_html = requests.get(url).text
dom = web.Element(website_html)
tbls = [t for t in dom.by_tag('table')]
tbl = tbls[table_id-1] #correct the off by one error
headerRow = tbl.by_tag("tr")[0] #simple trick - first row always gives the header
#the title could be inside a th tag or td tag
# use the in operator to determine which is the case and use the right filtering
if len(headerRow.by_tag("th")) > 0:
heading = [web.plaintext(th.content) for th in headerRow.by_tag('th')]
else:
heading = [web.plaintext(td.content) for td in headerRow.by_tag('td')]
#DataFrame has many constructors. On a whim I am choosing the list of dicts
arrayOfDict = []
for row in tbl.by_tag("tr")[1:-1] : #ignore the last row about updated date
thContents = [web.plaintext(td.html) for td in row.by_tag("th")]
tdContents = [web.plaintext(td.html) for td in row.by_tag("td")]
contents = thContents + tdContents #big time hack - assumes th, if present is always the first column
#Convert contents to a dictionary with appropriate headings
rowAsDict = {heading[i] : contents[i] for i in range(len(contents))}
arrayOfDict.append(rowAsDict)
df = DataFrame(arrayOfDict,columns=heading)
return df
print display(scraping_wikipedia_table("http://en.wikipedia.org/wiki/List_of_Test_cricket_records", 1))
In the second task, we will seek to scrape the results of Walmart. Specifically, we will focus on Movies department. Your job is to implement a function that accepts a valid search result url (such as http://www.walmart.com/search/?query=marvel&cat_id=4096_530598 ), scrapes the 10-15 product listings in that url only and outputs a Pandas data frame with the following fields (with the EXACT names and data types as below):
Make sure that you observe the following:
#Input:
# url: URL of a Walmart results page for a search query in Movies department
#Output:
# df is a Pandas data frame that contains a tabular representation of the results
# The df must have 9 columns that must have the same name and data type as described above
# Each row of df corresponds to a movie in the results table
def getValuesSafely(domObj, domSearchString):
results = domObj(domSearchString)
if len(results) > 0:
return web.plaintext(results[0].content)
else:
return np.nan
def scraping_walmart_movies(url):
website_html = requests.get(url).text
dom = web.Element(website_html)
heading = ['Product Title', 'Sale Price', 'Number of Ratings', 'Free Shipping', 'Free Store Pickup',
'Starring', 'Running', 'Format', 'Rating']
arrayOfDict = []
movies = dom("div.js-tile.tile-landscape")
for movie in movies:
productTitle = getValuesSafely(movie, 'h4 > a')
salesPrice = getValuesSafely(movie, 'div.item-price-container > span.price.price-display')
numRatings = getValuesSafely(movie,'span.stars-reviews')
freeShipping = "shipping on orders over $50" in movie #Horrible hack
freeStorePickup = "store pickup" in movie #Horrible Hack
dl = movie('dl.media-details.dl-horizontal')
dt = dl[0].by_tag("dt")
dd = dl[0].by_tag("dd")
tempOrder = ["Director:", "Starring:", "Running:", "Format:", "Release:", "Rating:"]
tempDict = {key : np.nan for key in tempOrder}
index = 0
for key in tempOrder:
if key in movie:
tempDict[key] = web.plaintext(dd[index].content)
index = index + 1
movieAsDict = {'Product Title': productTitle, 'Sale Price': salesPrice, 'Number of Ratings': numRatings,
'Free Shipping':freeShipping, 'Free Store Pickup':freeStorePickup,
'Starring': tempDict["Starring:"], 'Running': tempDict["Running:"],
'Format': tempDict["Format:"], 'Rating': tempDict["Rating:"]}
arrayOfDict.append(movieAsDict)
df = DataFrame(arrayOfDict,columns=heading)
return df
print display(scraping_walmart_movies('http://www.walmart.com/search/?query=marvel&cat_id=4096_530598'))
In the third task, we will scrape some specific sub-pages of Facebook profiles. In order to avoid using sophisticated tools like Selenium, for this task you can assume that the input to the function is the DOM of the relevant page. You can then use it to parse relevant contents.
#Input: dom - DOM of the books page corresponding to an FB account's profile. Eg, DOM of https://www.facebook.com/zuck/books
#Output: An array (Python list) of books listed in the profile page.
# Note that this function requires a list as an output not a Pandas data frame
def scraping_facebook_books(dom):
return [li("a._gx7")[0].title for li in dom("ul.uiList > li")]
#Input: dom - DOM of the groups page corresponding to an FB account's profile. Eg, DOM of https://www.facebook.com/zuck/groups
#Output: A Pandas data frame with one row per group.
# It must have three columns - 'Group Name', 'Number of Members', 'Group Description'
# Note that all information of a group is in the same page (eg. https://www.facebook.com/zuck/groups)
# You can collect all data from same page even if they are incomplete (such as group description)
# Ensure that the column names as given above
def scraping_facebook_groups(dom):
arrayOfDict = []
for li in dom("ul.uiList > li"):
groupAsDict = {}
groupAsDict["Group Name"] = web.plaintext(li("div.mbs.fwb")[0].content)
groupAsDict["Number of Members"] = web.plaintext(li("div.mbs.fcg")[0].content)
groupAsDict["Group Description"] = web.plaintext(li("div.mtm > span")[0].content)
arrayOfDict.append(groupAsDict)
df = DataFrame(arrayOfDict,columns=["Group Name", "Number of Members", "Group Description"])
return df
#Input: dom - DOM of the music page corresponding to an FB account's profile. Eg, DOM of https://www.facebook.com/zuck/music
#Output: A Pandas data frame with one row per group.
# It must have four columns
# 'Name', 'Type' (eg. Musician/Band or Bands&Musicians) and 'Verified' (boolean: True/False), 'Profile Url'
# Note that all information of a group is in the same page (eg. https://www.facebook.com/zuck/music)
# Ensure that the column names as given above
def scraping_facebook_music(dom):
arrayOfDict = []
for li in dom("ul.uiList > li"):
musicAsDict = {}
musicAsDict["Name"] = li("a._gx7")[0].title
musicAsDict["Type"] = web.plaintext(li("div.fsm.fwn.fcg")[0].content)
musicAsDict["Verified"] = "Verified Page" in li
musicAsDict["Profile Url"] = li("a._gx7")[0].href
arrayOfDict.append(musicAsDict)
df = DataFrame(arrayOfDict,columns=["Name", "Type", "Verified", "Profile Url"])
return df
#Input: dom - DOM of the music page corresponding to an FB account's profile. Eg, DOM of https://www.facebook.com/zuck/movies
#Output: A Pandas data frame with one row per group.
# It must have following columns -
# 'Name', 'Type' (eg. Movie), 'Verified', 'Profile Url' - as before
# Ensure that the column names as given above
def scraping_facebook_movies(dom):
arrayOfDict = []
for li in dom("ul.uiList > li"):
movieAsDict = {}
movieAsDict["Name"] = li("a._gx7")[0].title
movieAsDict["Type"] = web.plaintext(li("div.fsm.fwn.fcg")[0].content)
movieAsDict["Verified"] = "Verified Page" in li
movieAsDict["Profile Url"] = li("a._gx7")[0].href
arrayOfDict.append(movieAsDict)
df = DataFrame(arrayOfDict,columns=["Name", "Type", "Verified", "Profile Url"])
return df
books_html = open("zuckBooks.html").read()
#Uncomment one of these lines based on whether you used Pattern or BeautifulSoup
dom = web.Element(books_html)
#dom = BeautifulSoup(books_html)
print "Books in Zuck's FB:\n", scraping_facebook_books(dom)
group_html = open("zuckGroups.html").read()
dom = web.Element(group_html)
#dom = BeautifulSoup(group_html)
print "\n\nGroups in Zuck's FB:\n"
display(scraping_facebook_groups(dom))
music_html = open("zuckMusic.html").read()
dom = web.Element(music_html)
#dom = BeautifulSoup(music_html)
print "\n\nMusic in Zuck's FB:\n"
display(scraping_facebook_music(dom))
movies_html = open("zuckMovies.html").read()
dom = web.Element(movies_html)
#dom = BeautifulSoup(movies_html)
print "\n\nMovies in Zuck's FB:\n"
display(scraping_facebook_movies(dom))
In this assignment, you will use the FEC dataset. The US Federal Election Commission publishes data on contributions to political campaigns. This includes contributor names, occupation and employer, address, and contribution amount. Specifically, we will be using the FEC data from 2012 election between Barack Obama and Mitt Romney. This is widely considered as a landmark election as both sides spent an unprecedented amount of 1 Billion dollars each (i.e. more than 6000 Crores in INR each). If you are interested, you can download the entire list of contributor details at the FEC site. It is relatively large (150 MB compressed, 1 GB uncompressed). For our experiments, we will use a smaller subset of the data collected (and cleaned) by Wes McKinney (the creator of Pandas). For the download link, see the Assignments page. It is small by most standards (around 1.4 million records, 20 MB compressed, 160 MB uncompressed) but large enough to give a taste of why data mining is compute intensive. Hopefully, this will also give you an appreciation as to the awesomeness of Pandas/Numpy - you can do really cool stuff with 2-3 lines of code that runs in seconds.
Dataset Description: You can find the details (such as meaning of the column names) of the dataset in the FEC website.
While a knowledge of US Elections or FEC Campaign finance rules is not necessary for solving this assignment, you can check out the following links if you are curious.
How to Become the US President: A Step-by-Step Guide
US Election guide: how does the election work?
If you are fascinated with Campaign finance reform (as I am), here are some good links:
Super-PACs and Dark Money: ProPublica’s Guide to the New World of Campaign Finance
Visualization is a key component of exploration. You will perform a number of 1-D and 2-D charts to get some intuition about the data. You can choose to use either Matplotlib or Seaborn for plotting. The default figures generated from Matplotlib might look a bit ugly. So you might want to try Seaborn to get better figures. The defaults in Seaborn are much saner and sometimes makes your life lot easier. Seaborn has number of styles - so feel free to experiment with them and choose the one you like. We have earmarked 10 points for the aesthetics of your visualizations.
Almost all the tasks below could be purely solved using Pandas. Most analytic tasks should not require more than 2-3 lines of code (visualization on the other hand is a different matter - you might need 5-10 lines for each chart unless you use Seaborn). Here is a NON COMPREHENSIVE list of functions that you might want to know to solve this assignment : agg, apply, argmax, argmin, count, crosstab, cumsum, cut, describe, groupby, head, idxmax, idxmin, info, isin, map, max, min, pivot_table, size, sum, transform, unique, value_counts .
You can find a set of exploratory analytics tasks below. Ensure that you clearly follow the instructions. The assignment will be graded automatically - so failure to follow might cause some issues. Also do NOT rename the functions or variables that the instructor has set. Of course, feel free to create new variables/functions that will hep your code.
The FEC data contains information about all presidential candidates. As a sitting president, Barack Obama was the only candidate from the Democratic party. In the Republican party, there was a process called Primary (see links above) where number of candidates competed to be the nominee. Mitt Romney won the Republican primary and competed with Barack Obama in the elections, which Obama won.
The Python code below reads the FEC dataset into a Pandas data frame with the name fec_all. If your machine has less than 2 GB of RAM, then change the function argument low_memory to True. Once the frame is loaded, we remove all negative contributions (where the campaign refunded amount to a contributor for some reason). Finally, we create a new data frame called fec that contains the contributions to Barack Obama and Mitt Romney alone.
For this code to work, the file 'fec_2012_contribution_subset.csv' must be in the same folder as the notebook.
To reduce my typing, I might refer to Obama as BO and Romney as MR in the text below.
#read the csv file into a Pandas data frame
fec_all = pd.read_csv('fec_2012_contribution_subset.csv', low_memory=False)
#We will be doing party wise analysis later. So, we want to associate each candidate with their party
parties = {'Bachmann, Michelle': 'Republican',
'Cain, Herman': 'Republican',
'Gingrich, Newt': 'Republican',
'Huntsman, Jon': 'Republican',
'Johnson, Gary Earl': 'Republican',
'McCotter, Thaddeus G': 'Republican',
'Obama, Barack': 'Democrat',
'Paul, Ron': 'Republican',
'Pawlenty, Timothy': 'Republican',
'Perry, Rick': 'Republican',
"Roemer, Charles E. 'Buddy' III": 'Republican',
'Romney, Mitt': 'Republican',
'Santorum, Rick': 'Republican'}
#create a new column called party that sets the value to the party of the candidate
# The way this line works is as follows:
# 1. fec_all.cand_nm gives a vector (or Series in Pandas terminology)
# 2. For each row, the code looks up the candidate name to the dictionary parties
# 3. If the name of the candidate (cand_nm) is in parties, it returns the value (i.e. Republican or Democrat)
# 4. This whole thing is done for each row and you get another vector as output
# 5. Finally, you create a new column in fec_all called 'party' and assign the vector you got to it.
# 6. All in a single line :)
fec_all['party'] = fec_all.cand_nm.map(parties)
#ignore the refunds
# Get the subset of dataset where contribution amount is positive
fec_all = fec_all[fec_all.contb_receipt_amt > 0]
#fec_all contains details about all presidential candidates.
#fec contains the details about contributions to Barack Obama and Mitt Romney only
#for the rest of the tasks, unless explicitly specified, work on the fec data frame.
fec = fec_all[fec_all.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
Let us start with some easy stuff. As of now, you do not know anything about the dataset. So the first task will be to get some basic sense.
#Task 1a: print the details of the data frame.
# Basically, this should contain information such as number of rows,columns, name of columns, #non-null values for each column etc
# Hint: there is a Pandas function to do this
#replace None with your code
print "Task 1a: Details of FEC data frame are: \n", fec.info()
#Task 1b: finding the number of rows and columns in the data frame.
#Hint: find the property of the data frame that gives the number of rows and columns
#replace None with your code
t1b_num_rows = fec.shape[0]
t1b_num_cols = fec.shape[1]
print "\n\nTask 1b: #Rows=%s, #Columns=%s" % (t1b_num_rows, t1b_num_cols)
#Task 1c: The only numeric data is 'contb_receipt_amt' which is the amount of contribution.
# Print the descriptive details (min, max, quantiles etc) for 'contb_receipt_amt'
#Hint: as above there is a direct pandas command for it.
#replace None with your code
print "\n\nTask 1c: Descriptive details of contb_receipt_amt is \n", fec.contb_receipt_amt.describe()
#Task 1d: Let us now print the number of unique values for few columns
#Hint: Look for a Pandas function to do this.
t1d_num_uniq_cand_id = fec.cand_id.nunique()
t1d_num_uniq_cand_nm = fec.cand_nm.nunique()
t1d_num_uniq_contbr_city = fec.contbr_city.nunique()
t1d_num_uniq_contbr_st = fec.contbr_st.nunique()
print "\n\nTask 1d: #Uniq cand_id = ", t1d_num_uniq_cand_id
print "Task 1d: #Uniq cand_num = ", t1d_num_uniq_cand_nm
print "Task 1d: #Uniq contbr_city = ", t1d_num_uniq_contbr_city
print "Task 1d: #Uniq contbr_st = ", t1d_num_uniq_contbr_st
In this task, we will perform some very high level filtering. Pandas has a convenient and powerful syntax for filtering (for eg, see above for how I filtered negative contributions and non-Obama, Romney candidates).
#Task 2a: Let us find out how much contributions did Obama and Romney made in this dataset
# Remember, that this is not the complete amount as it excludes other sources like PACs, Super PACs and
# spending by party committes.
#Hint: use cand_nm field
t2a_tot_amt_obama = fec.contb_receipt_amt[fec.cand_nm == 'Obama, Barack'].sum()
t2a_tot_amt_romney = fec.contb_receipt_amt[fec.cand_nm == 'Romney, Mitt'].sum()
print "Task 2a: Total Contribution for Obama is %s and for Romney is %s" % (t2a_tot_amt_obama, t2a_tot_amt_romney)
#Task 2b: How much contribution did folks from California, New York and Texas make totally (i.e. to both Obama and Romney).
#use contbr_st field
t2b_tot_amt_CA = fec.contb_receipt_amt[fec.contbr_st == 'CA'].sum()
t2b_tot_amt_NY = fec.contb_receipt_amt[fec.contbr_st == 'NY'].sum()
t2b_tot_amt_TX = fec.contb_receipt_amt[fec.contbr_st == 'TX'].sum()
print "\n\nTask 2b: Total contributions from California is %s, New York is %s and Texas is %s" % (t2b_tot_amt_CA, t2b_tot_amt_NY, t2b_tot_amt_TX)
#Task 2c: Let us now use multiple filtering criteria
# How much money did folks from Texas made to BO and MR?
# How much money did folks from UT Arlington made to BO and MR?
t2c_tot_contr_tx_bo = fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contbr_st == 'TX')].sum()
t2c_tot_contr_tx_mr = fec.contb_receipt_amt[(fec.cand_nm == 'Romney, Mitt') & (fec.contbr_st == 'TX')].sum()
t2c_tot_contr_uta_bo = fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contbr_employer == 'UT ARLINGTON')
& (fec.contbr_st == 'TX')].sum()
t2c_tot_contr_uta_mr = fec.contb_receipt_amt[(fec.cand_nm == 'Romney, Mitt') & (fec.contbr_employer == 'UT ARLINGTON')
& (fec.contbr_st == 'TX')].sum()
print "\n\nTask 2c: From TX, BO got %s and MR got %s dollars" % (t2c_tot_contr_tx_bo, t2c_tot_contr_tx_mr)
print "Task 2c: From UTA, BO got %s and MR got %s dollars" % (t2c_tot_contr_uta_bo, t2c_tot_contr_uta_mr)
#Task 2d: How much did Engineers from Google gave to BO and MR.
# This task is a bit tricky as there are many variations: eg, SOFTWARE ENGINEER vs ENGINEER and GOOGLE INC. vs GOOGLE
t2d_tot_engr_goog_bo = fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contbr_employer.str.contains("GOOGLE") == True)
& (fec.contbr_occupation.str.contains("ENGINEER") == True)].sum()
t2d_tot_engr_goog_mr = fec.contb_receipt_amt[(fec.cand_nm == 'Romney, Mitt') & (fec.contbr_employer.str.contains("GOOGLE") == True)
& (fec.contbr_occupation.str.contains("ENGINEER") == True)].sum()
print "\n\nTask 2d: From Google Engineers, BO got %s and MR got %s dollars" % (t2d_tot_engr_goog_bo, t2d_tot_engr_goog_mr)
In this task, we will perform some very high level aggregation. Pandas has some convenient functions for aggregation (do NOT write a for loop - Pandas has some very efficient, vectorized code).
#Task 3a: For each state, print the total contribution they made to both candidates.
t3a_state_contr_both = fec.contb_receipt_amt.groupby(fec.contbr_st).sum()
print "Task 3a: Total contribution made to both candidates by each state are:\n", t3a_state_contr_both
#Task 3b: Now let us limit ourselves to TX. For each city in TX, print the total contribution made to both candidates
t3b_tx_city_contr_both = fec.contb_receipt_amt[fec.contbr_st == 'TX'].groupby(fec.contbr_city).sum()
print "\n\nTask 3b: Total contribution made to both candidates by each city in TX are:\n", t3b_tx_city_contr_both
#Task 3c: Now let us zoom into Arlington, TX. For each zipcode in Arlington, print the total contribution made to both candidates
t3c_arlington_contr_both = fec.contb_receipt_amt[(fec.contbr_st == 'TX') & (fec.contbr_city == 'ARLINGTON')].groupby(fec.contbr_zip).sum()
print "\n\nTask 3c: Total contribution made to both candidates by each zipcode in Arlington are:\n", t3c_arlington_contr_both
In this task, you will try to combine aggregation with filtering and then rank the results based on the results. Pandas is often quite clever and might sort the data for you already.
Hint: Pandas has ready made functions for all the following.
#Task 4a: Print the number of contributors to Obama in each state.
t4a_num_contr_obama_per_state = fec.contbr_nm[fec.cand_nm == 'Obama, Barack'].groupby(fec.contbr_st).count()
print "Number of contributions to Obama in each state is \n", t4a_num_contr_obama_per_state
#Task 4b: Print the top-10 states (based on number of contributors) that contributed to Obama.
# print both state name and number of contributors
t4b_top10_obama_contr_states = fec[fec.cand_nm == 'Obama, Barack'].contbr_st.value_counts()[:10]
print "\n\nTop-10 states with most contributors to Obama are :\n", t4b_top10_obama_contr_states
#Task 4c: Print the top-20 occupations that contributed overall (to both BO and MR)
t4c_top20_contr_occupation = fec.contbr_occupation.value_counts()[:20]
print "\n\nTop-20 Occupations with most contributors are :\n", t4c_top20_contr_occupation
#Task 4d: Print the top-10 Employers that contributed overall (to both BO and MR)
t4d_top10_contr_employer_all = fec.contbr_employer.value_counts()[:10]
print "\n\nTop-10 Employers with most contributors are :\n", t4d_top10_contr_employer_all
Let us take a break from analytics to do some Visualization. Notice that the visualization tasks are NOT ordered based on their complexity. So some of them might be more challenging than others.
#Task 5a: Draw a "horizontal" bar chart with one bar each for Obama and Romney with the value corresponding to total amount they raised.
# Remember to make the bar chart into a horizontal bar chart
#########################begin code for Task 5a
plt.figure()
figObj5a = fec.contb_receipt_amt.groupby(fec.cand_nm).sum().plot(kind='barh', title='Task 5a',color=['blue','red'])
figObj5a.set_xlabel('Total Amount Raised')
figObj5a.set_ylabel('Candidate Names')
#########################end code for Task 5a
#Task 5b: Draw the "horizontal" bar chart of total NUMBER of contributions made per Candidate.
# ie Candidate name vs number of contributions for that candidate
#########################begin code for Task 5b
plt.figure()
figObj5b = fec.contbr_nm.groupby(fec.cand_nm).agg('count').plot(kind='barh', title='Task 5b',color=['blue','red'])
figObj5b.set_xlabel('#Contributors')
figObj5b.set_ylabel('Candidate Names')
#########################end code for Task 5b
#Task 5c: Draw the "horizontal" bar chart of average contributions made per Candidate.
# ie Candidate Name vs avg contribution
#########################begin code for Task 5c
plt.figure()
figObj5c = fec.contb_receipt_amt.groupby(fec.cand_nm).mean().plot(kind='barh',title='Task 5c',color=['blue','red'])
figObj5c.set_xlabel('Average Value of Contributions')
figObj5c.set_ylabel('Candidate Names')
#########################end code for Task 5c
#Task 5d: Draw a "horizontal" bar chart that lists the top-10 states based on the TOTAL contribution they made to both candidates
# For each state, draw two lines - one in blue for Obama and one in red for Romney
# Display the proportion of the total contribution that came from the state.
# For eg, if Obama made 1 billion and TX gave 100 million of it, the proportion is 10%
# Remember to make the bar chart into a horizontal bar chart
#########################begin code for Task 5d
plt.figure()
t5dTotalContrib = fec.groupby(['cand_nm', 'contbr_st']).contb_receipt_amt.sum().unstack(0).fillna(0)
t5dTotalContrib['Total'] = t5dTotalContrib['Obama, Barack'] + t5dTotalContrib['Romney, Mitt']
t5dPercentContrib = t5dTotalContrib / t5dTotalContrib.sum(axis=0)
t5dTotalContrib['Obama Pct'], t5dTotalContrib['Romney Pct'] = t5dPercentContrib['Obama, Barack'], t5dPercentContrib['Romney, Mitt']
t5dTop10States = t5dTotalContrib.sort('Total', ascending=False)[:10]
figObj5d = t5dTop10States[['Obama Pct', 'Romney Pct']].plot(kind='barh', title='Task 5d', color=['blue','red'])
figObj5d.set_xlabel('Percent Contributions')
figObj5d.set_ylabel('States')
#########################end code for Task 5d
#Task 5e: Now repeat the same process based on Occupation (again top-10)
#########################begin code for Task 5e
plt.figure()
t5eTotalContrib = fec.groupby(['cand_nm', 'contbr_occupation']).contb_receipt_amt.sum().unstack(0).fillna(0)
t5eTotalContrib['Total'] = t5eTotalContrib['Obama, Barack'] + t5eTotalContrib['Romney, Mitt']
t5ePercentContrib = t5eTotalContrib / t5eTotalContrib.sum(axis=0)
t5eTotalContrib['Obama Pct'], t5eTotalContrib['Romney Pct'] = t5ePercentContrib['Obama, Barack'], t5ePercentContrib['Romney, Mitt']
t5eTop10Occuation = t5eTotalContrib.sort('Total', ascending=False)[:10]
figObj5e = t5eTop10Occuation[['Obama Pct', 'Romney Pct']].plot(kind='barh', title='Task 5e', color=['blue','red'])
figObj5e.set_xlabel('Percent Contributions')
figObj5e.set_ylabel('Occupations')
#########################end code for Task 5e
#Task 5f: Now repeat the same process based on Employers (again top-10)
#########################begin code for Task 5f
plt.figure()
t5fTotalContrib = fec.groupby(['cand_nm', 'contbr_employer']).contb_receipt_amt.sum().unstack(0).fillna(0)
t5fTotalContrib['Total'] = t5fTotalContrib['Obama, Barack'] + t5fTotalContrib['Romney, Mitt']
t5fPercentContrib = t5fTotalContrib / t5fTotalContrib.sum(axis=0)
t5fTotalContrib['Obama Pct'], t5fTotalContrib['Romney Pct'] = t5fPercentContrib['Obama, Barack'], t5fPercentContrib['Romney, Mitt']
t5fTop10Employers = t5fTotalContrib.sort('Total', ascending=False)[:10]
figObj5f = t5fTop10Employers[['Obama Pct', 'Romney Pct']].plot(kind='barh', title='Task 5f', color=['blue','red'])
figObj5f.set_xlabel('Percent Contributions')
figObj5f.set_ylabel('Employers')
#########################end code for Task 5f
#Task 5g: Draw the histogram of total NUMBER of contributions made per each state.
# X-axis : state, Y-axis : number of contribution from that state
#########################begin code for Task 5g
plt.figure(figsize=(16, 8))
figObj5g = fec.contbr_nm.groupby(fec.contbr_st).count().plot(kind = 'bar',title='Task 5g')
figObj5g.set_xlabel('State Name')
figObj5g.set_ylabel('Number of Contributors')
#########################end code for Task 5g
#Task 5h: Draw a histogram of actual contributions made for Obama. Set bin size as 50
#X-axis: contribution amount bin, Y-axis: frequency
#########################begin code for Task 5h
#####Note: Hist by default, tries to create bins that are neatly arranged from min to max.
# However, the contribution amounts are very skewed.
# If you run the command fec.contb_receipt_amt.describe() you can see that 75% of them are below 250 but max value is 2014490.51
# Hence default hist will not work here as most people gave small amount
# For eg, only 10 contributions to Obama was above $5000 (you can verify by the line below)
# len(fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contb_receipt_amt > 5000)])
#There are many ways to fix it - including cut and qcut of pandas.
# I am going for a simple solution - which is to ignore the top-10 contributions (above 5000)
# and plt only the rest
plt.figure()
figObj5h = fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contb_receipt_amt <= 5000)].hist(bins = 50)
figObj5h.set_xlabel('Contribution Bins')
figObj5h.set_ylabel('Frequency')
figObj5h.set_title('Task 5h')
#########################end code for Task 5h
#Task 5i: Draw a histogram of actual contributions made for Romney. Set bin size as 50
#X-axis: contribution amount bin, Y-axis: frequency
#########################begin code for Task 5i
plt.figure()
figObj5i = fec.contb_receipt_amt[(fec.cand_nm == 'Romney, Mitt') & (fec.contb_receipt_amt <= 5000)].hist(bins = 50)
figObj5i.set_xlabel('Contribution Bins')
figObj5i.set_ylabel('Frequency')
figObj5i.set_title('Task 5i')
#########################end code for Task 5i
#Harder
#Task 5j: Draw a line chart showing how the campaign contribution of Obama and Romney varied every quarter
#Use blue for Obama and red for Romney
#This is a bit tricky because, you have to convert contribution date to quarter.
#You can either do it on your own or see if Pandas has some function
#########################begin code for Task 5j
#Ignore the warning
plt.figure()
t5jContbDateTime = pd.to_datetime(fec.contb_receipt_dt)
fec.loc[:,'contb_receipt_dt_quarter'] = map(lambda dateTime: dateTime.quarter, t5jContbDateTime)
fec.loc[:,'contb_receipt_dt_year'] = map(lambda dateTime: dateTime.year, t5jContbDateTime)
figObj5j = fec['contb_receipt_amt'][fec.cand_nm=='Obama, Barack'].groupby(
[fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).sum().plot(color='Blue')
fec['contb_receipt_amt'][fec.cand_nm=='Romney, Mitt'].groupby(
[fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).sum().plot(color='Red')
figObj5j.set_title('Task 5j')
figObj5j.set_xlabel('Quarters/Years')
figObj5j.set_ylabel('Per Quarter Contributions')
figObj5j.legend (['Obama', 'Romney'])
#########################end code for Task 5j
#Harder
#Task 5k: Draw a line chart showing the CUMULATIVE campaign contribution of Obama and Romney varied every quarter
# In other words, if Obama made X, Y, Z in first, second and third quarters
# then plot X for first quarter, X+Y for second quarter and X+Y+Z for third quarter.
#Use blue for Obama and red for Romney
#This is a bit tricky because, you have to convert contribution date to quarter.
#You can either do it on your own or see if Pandas has some function
#########################begin code for Task 5k
plt.figure()
figObj5k = fec['contb_receipt_amt'][fec.cand_nm=='Obama, Barack'].groupby(
[fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).sum().cumsum().plot(color='Blue')
fec['contb_receipt_amt'][fec.cand_nm=='Romney, Mitt'].groupby(
[fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).sum().cumsum().plot(color='Red')
figObj5k.set_title('Task 5j')
figObj5k.set_xlabel('Quarters/Years')
figObj5k.set_ylabel('Per Quarter Contributions')
figObj5k.legend (['Obama', 'Romney'])
#########################end code for Task 5k
#Tasks 5l and 5m
#Repeat 5j and 5k but do it for NUMBER of contributors
#In other words, 5l plots the number of contributors for Obama and Romney, quarter over quarter
#5m plots cumulative number of contributors quarter over quarter.
#########################begin code for Task 5l
plt.figure()
figObj5l = fec['contb_receipt_amt'][fec.cand_nm=='Obama, Barack'].groupby(
[fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).count().plot(color='Blue')
fec['contb_receipt_amt'][fec.cand_nm=='Romney, Mitt'].groupby(
[fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).count().plot(color='Red')
figObj5l.set_title('Tak 5l')
figObj5l.set_xlabel('Quarters/Years')
figObj5l.set_ylabel('#Contributions')
figObj5l.legend (('Obama', 'Romney'))
#########################end code for Task 5l
#########################begin code for Task 5m
plt.figure()
figObj5m = fec['contb_receipt_amt'][fec.cand_nm=='Obama, Barack'].groupby(
[fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).count().cumsum().plot(color='Blue')
fec['contb_receipt_amt'][fec.cand_nm=='Romney, Mitt'].groupby(
[fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).count().cumsum().plot(color='Red')
figObj5m.set_title('Tak 5m')
figObj5m.set_xlabel('Quarters/Years')
figObj5m.set_ylabel('#Contributions')
figObj5m.legend (('Obama', 'Romney'))
#########################end code for Task 5m
Recall the IMDB dataset that we discussed in the class where we were able to draw lot more interesting plots. The key challenge here is that most of the attributes have too many values (ie. the domain cardinality is very large). There are few ways to work around this issue. We already did something above (ie focussing only top-k). The other option is discretization where we create buckets and put contributions based on the buckets. Discretization in Pandas is acheived by cut function.
#The following set of tasks are a bit tricky:
# you need to use multiple commands to achieve. Specifically look at cut, groupby and unstack
#Task 6a: Discretize the contributions of Obama and Romney based on the bins given below.
# For example, if Obama got contributions such as 2, 6, 16, 18, 120, then he has
# 0 contribution in (0,1], 2 contributions in (1,10], 2 contributions in (10, 100] and 1 contribution in (100, 1000]
bins = np.array([0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
labels = pd.cut(fec.contb_receipt_amt, bins) # set the variable labels to the output of pd.cut
grouped = fec.groupby(['cand_nm', labels]) #Group the data based on labels and candidate names
#Replace None below in the print statement with grouped.size().unstack(0) .
# If your code for labels and grouped is correct, this should print number of people in each bin per candidate
print "Task 6a:\n", grouped.size().unstack(0)
#Task 6b: In Task 6a, we calculated the COUNT (i.e. the number of contributors in each bin)
# This by itself is not sufficient.
# In this task, let us compute the TOTAL AMOUNT in each bucket.
# Specifically, compute for each candidate the total amount of contributions that each got in each bucket.
# Continuing the example above, Obama's total contribution in each bucket is
# 0 in (0,1], 8 in (1,10], 34 in (10, 100] and 120 in (100, 1000]
#This could be done in 1 line from the variable grouped above
t6b_bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
print "\n\nTask 6b:\n", t6b_bucket_sums
#Task 6c: Even this does not fully specify the disparity in the funding scenario.
# To see this let us now compute the PROPORTION of total contribution in each bucket
# This is called normalization and is a common operation.
# Typically normalization is done over the same candidate.
# But for the point I am trying to make, let us normalize within bucket.
# For example, if obama made X in a bucket and Y in another bucket,
# then normalization will give X/(X+Y) for Obama and Y/(X+Y) for Romney
# This is also quite easy in Pandas and can be done in 1 line
# The key idea is to realize that sum function takes an argument called axis
# that does different things for axis=0 and axis=1 (figure out which does what)
t6c_normed_bucket_sums = t6b_bucket_sums.div(t6b_bucket_sums.sum(axis=1), axis=0)
print "\n\nTask 6c:\n", t6c_normed_bucket_sums
#Once you have done this , uncomment the following line to a horizontal stacked bar chart
t6c_normed_bucket_sums[:-2].plot(kind='barh', stacked=True, color=['blue','red'])
#Task 6d: Let us go back and try to do the other analysis
# Let us now try to see what PROPORTION of each candidate's amount came from each bucket.
# This is the classical case of normalization.
# Continuing the example above, Obama has made 0+8+34+120=162
# We can normalize it by diving each bucket by the total
# For example, 0/162, 8/162, 34/162 and 120/162.
# If you finished t6c, then this just a matter of playing with axis values.
t6d_normed_bucket_sums = t6b_bucket_sums.div(t6b_bucket_sums.sum(axis=0),axis=1)
print "\n\nTask 6d:\n", t6d_normed_bucket_sums
#Once you have done this , uncomment the following line to a horizontal stacked bar chart
t6d_normed_bucket_sums.plot(kind='barh', stacked=True,color=['blue','red'])
Using the tool of discretization, we were able to analyze how much people contributed in each bucket. While it showed some sense of imbalance, it did not show it clearly. Recently, the concept of income inequality is becoming a rallying cry among liberals. Let us see if there is any inequality in the contributions (spoiler alert: yes!). We can see if we find anything interesting, not by analyzing, total amount but by the amount contributed by top-X%.
#Task 7a: Write two functions: one for Obama and one for Romney that does the following:
# Given a value of N (N can be 1, 2, 5, 10 etc), sort the contributions made to the candidate in decreasing order
# Then find how much contribution the top-N% made
# Then compute the fraction to the overall campaign collection
# For example, if Obama collected 1 billion dollars and the top-1% gave 100 million then their contribution is 10%
def t7a_helper_contributions_by_top_N_pct(candidate_name, N):
pivotTblUniqContr = pd.tools.pivot.pivot_table(fec[fec.cand_nm == candidate_name],
values='contb_receipt_amt',
index=['contbr_nm', 'contbr_occupation', 'contbr_employer', 'contbr_zip', 'contbr_city','contbr_st'],
aggfunc=np.sum).order(ascending=False)
totalContrib = pivotTblUniqContr.sum()
numContrib = pivotTblUniqContr.count()
percent = int(N/100.0 * numContrib)
topNContrib = pivotTblUniqContr.iloc[:percent].sum()
return topNContrib/totalContrib
def t7a_contributions_by_top_N_pct_obama(N):
return t7a_helper_contributions_by_top_N_pct('Obama, Barack', N)
def t7a_contributions_by_top_N_pct_romney(N):
return t7a_helper_contributions_by_top_N_pct('Romney, Mitt', N)
for N in [1, 2, 5, 10, 20]:
print "N=%s, Obama proportion=%s and Romney proportion = %s" % (N,
t7a_contributions_by_top_N_pct_obama(N), t7a_contributions_by_top_N_pct_romney(N))
#Task 7b: Now let us see who these people in 1% are
# Compute the top-1% based on total contribution (ie. money they gave to Obama + Romney)
# Now let us see some information about them.
# For each of the folks in 1%, print the following:
# name, state, occupation, employer, total amount they gave to Obama, total amount they gave to Romney
contributors_and_contrib = fec.pivot_table(values= 'contb_receipt_amt',
index= ['contbr_nm', 'contbr_occupation', 'contbr_employer', 'contbr_city', 'contbr_st'],
columns='cand_nm', aggfunc=np.sum, fill_value=0)
num1Percenters = int(0.01 * len(contributors_and_contrib))
contributors_and_contrib["Total"] = contributors_and_contrib["Obama, Barack"] + contributors_and_contrib["Romney, Mitt"]
contributors_ordered_by_contrib = contributors_and_contrib.sort(["Total"], ascending = False)
t7b_1pcters = contributors_ordered_by_contrib.iloc[:num1Percenters]
display(contributors_ordered_by_contrib)
It has been argued that political polarization has dramatically increased in USA in the last 10-15 years. While there are many ways to analyze the dataset for polarization, let us use some simple measures.
#Task 8a: One way to measure polarity is to see how different some distribution are.
# For both Obama and Romney, print the top-10 states based on their total AMOUNT of contribution
# Do you see lot of common states?
t8a_top_10_states_obama = fec.contb_receipt_amt[fec.cand_nm == 'Obama, Barack'].groupby(fec.contbr_st).sum().order(ascending=False)[:10]
t8a_top_10_states_romney = fec.contb_receipt_amt[fec.cand_nm == 'Romney, Mitt'].groupby(fec.contbr_st).sum().order(ascending=False)[:10]
print "\n\nT8a: Top 10 states for Obama:\n",t8a_top_10_states_obama
print "\n\nT8a: Top 10 states for Romney:\n",t8a_top_10_states_romney
#Task 8b: For both Obama and Romney, print the top-10 occupation based on their total AMOUNT of contribution
# Do you see lot of common occupation?
t8b_top_10_occu_obama = fec['contb_receipt_amt'][fec.cand_nm == 'Obama, Barack'].groupby(fec.contbr_occupation).sum().order(ascending=False)[:10]
t8b_top_10_occu_romney = fec['contb_receipt_amt'][fec.cand_nm == 'Romney, Mitt'].groupby(fec.contbr_occupation).sum().order(ascending=False)[:10]
print "\n\nT8b: Top 10 Occupation for Obama:\n", t8b_top_10_occu_obama
print "\n\nT8b: Top 10 Occupation for Romney:\n", t8b_top_10_occu_romney
#Task 8c: For both Obama and Romney, print the top-10 employers based on their total AMOUNT of contribution
# Do you see lot of common employers?
t8c_top_10_emp_obama = fec['contb_receipt_amt'][fec.cand_nm == 'Obama, Barack'].groupby(fec.contbr_employer).sum().order(ascending=False)[:10]
t8c_top_10_emp_romney = fec['contb_receipt_amt'][fec.cand_nm == 'Romney, Mitt'].groupby(fec.contbr_employer).sum().order(ascending=False)[:10]
print "\n\nT8b: Top 10 Employers for Obama:\n", t8c_top_10_emp_obama
print "\n\nT8b: Top 10 Employers for Romney:\n", t8c_top_10_emp_romney
#Harder
#Task 8d: Here is another way to compute polarization
# Find the top-1000 contributors based on their TOTAL contribution (to both Obama and Romney)
# For each of the top-1000 folks count the number of people who donated to both, to Obama only and to Romney only
top1000Contributors = fec.pivot_table(values='contb_receipt_amt',
index=['contbr_nm','contbr_employer','contbr_st','cand_nm'],
aggfunc = np.sum).order(ascending = False).head(1000).unstack().fillna(0)
t8d_top_1000_both = len(top1000Contributors[ (top1000Contributors["Obama, Barack"] > 0) & (top1000Contributors["Romney, Mitt"] > 0)])
t8d_top_1000_BO_only = len(top1000Contributors[ (top1000Contributors["Obama, Barack"] > 0) & (top1000Contributors["Romney, Mitt"] == 0)])
t8d_top_1000_MR_only = len(top1000Contributors[ (top1000Contributors["Obama, Barack"] == 0) & (top1000Contributors["Romney, Mitt"] > 0)])
print "\n\nT8c:Both, Obama only, Romney only = ", t8d_top_1000_both, t8d_top_1000_BO_only, t8d_top_1000_MR_only
#Harder:
#Task 8e: Here is yet another way
# For each state, compute what fraction of amount went to Obama and what fraction went to Romney
# If there is no polarization, then both will get more or less equal amount.
# If there is polarization, then the amount will skewed.
# Let us use a very crude measure to compute polarization
# If X is what Obama got from a state and Y is what Romney got, compute the value of max(X,Y)/ (min(X,Y) + 1)
# For each state compute this value and sort the results in a descending order.
# Do you see any pattern?
t8e = fec.pivot_table(values = 'contb_receipt_amt', index = ['contbr_st'], columns = ['cand_nm'], aggfunc = np.sum, fill_value=0)
t8e["polarity"] = t8e.apply(
lambda row: max(row["Obama, Barack"], row["Romney, Mitt"]) / (min(row["Obama, Barack"], row["Romney, Mitt"])+1),
axis=1)
t8e_state_contr_ranked_by_polarity = t8e.sort(["polarity"], ascending=False)
print "\n\nt8e:States ordered by polarity \n", t8e_state_contr_ranked_by_polarity
#Harder:
#Task 8f: Repeat the above analysis for occupation
# However, instead of taking all occupations, let us only take the top-50 occupations based on TOTAL contributions made
# For each occupation compute this value and sort the results in a descending order and displ
# Do you see any pattern?
t8f = fec.pivot_table(values = 'contb_receipt_amt', index = ['contbr_occupation'], columns = ['cand_nm'], aggfunc = np.sum, fill_value=0)
t8f['Total'] = t8f['Obama, Barack'] + t8f['Romney, Mitt']
t8f = t8f.sort(['Total'], ascending=False)[:50]
t8f["polarity"] = t8f.apply(
lambda row: max(row["Obama, Barack"], row["Romney, Mitt"]) / (min(row["Obama, Barack"], row["Romney, Mitt"])+1),
axis=1)
t8f_occu_contr_ranked_by_polarity = t8f.sort(["polarity"], ascending=False)
print "\n\nt8f: Occupation ordered by polarity:\n", t8f_occu_contr_ranked_by_polarity
#Harder:
#Task 8g: A known variable of polarization is based on where a person lives.
# At the risk of too much generalization, liberals dominate cities while conservations dominate rural areas
# Let us see if this holds in Texas.
# Texas is a known solid red (i.e. conservative) state.
# For each city in Texas, compute the polarity and order them by polarity.
# Do you see any pattern?
t8g = fec[fec.contbr_st == 'TX'].pivot_table(values = 'contb_receipt_amt',
index = ['contbr_city'], columns=['cand_nm'], aggfunc=np.sum, fill_value=0)
t8g['Total'] = t8g['Obama, Barack'] + t8g['Romney, Mitt']
t8g["polarity"] = t8g.apply(
lambda row: max(row["Obama, Barack"], row["Romney, Mitt"]) / (min(row["Obama, Barack"], row["Romney, Mitt"])+1),
axis=1)
t8g_tx_city_contr_ranked_by_polarity = t8g.sort(['polarity'], ascending=False)
print "\n\nt8f: Texas cities ordered by polarity:\n", t8g_tx_city_contr_ranked_by_polarity