Programming Assignment I: Exploratory Analysis over 2012 FEC Presidential Election Dataset

Team Details

When submitting, fill your team details in this cell. Note that this is a markdown cell.

Student 1 Name and ID: Saravanan Thirumuruganathan, 1000XXXXXX

Student 2 Name and ID: Satishkumar Masilamani, 1000XXXXXX

Student 3 Name and ID:

Assignment Details

In this assignment, you will be learn more about three key steps in a data analytics process.

1. Data Collection

In the first set of tasks, you will put to use the various techniques we learnt in web scraping and use it to scrape three popular websites.

2. Exploratory Analysis

In the second and third set of tasks, you will conduct a guided exploration over 2012 FEC Presidential Election Dataset. You will learn and use some of the most common exploration/aggregation/descriptive operations. This should also help you learn most of the key functionalities in Pandas.

3. Visualization

In the third set of tasks (that is done in conjunction with #2), you will also learn how to use visualization libraries to identify patterns in data that will help in your further data analysis. You will also explore most popular chart types and how to use different libraries and styles to make your visualizations more attractive.

Python Packages

Here are the set of packages that you will use extensively in this assignment. Do NOT import any new packages without confirming with the instructor. The objective is to make you identify and use the various functions in the packages imported below. Besides, to my knowledge, the libraries below do offer a concise way to achieve the homework tasks anyway.

In [21]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

#Array processing
import numpy as np
#Data analysis, wrangling and common exploratory operations
import pandas as pd
from pandas import Series, DataFrame

#A sane way to get web data
import requests

#Packages for web scraping. No need to use both. Feel free to use one of them.
from pattern import web
from BeautifulSoup import BeautifulSoup

#For visualization. Matplotlib for basic viz and seaborn for more stylish figures + statistical figures not in MPL.
import matplotlib.pyplot as plt
import seaborn as sns

#For some of the date operations
import datetime


pd.set_option('display.notebook_repr_html',True)
from IPython.display import display

Part 1: Web Scraping

In the first series of tasks, we will seek to scrape information from diverse websites. To make your job easier, you can assume that you do not need to do any complex webdriver stuff. For each task, you will implement a function that accepts an url and produces output in the format that the question asks.

NOTE: Make sure that you follow the input and output requirements. The assignment will be evaluated and graded automatically by a script. Any error will result in you getting a score of 0 for that task.

(5 points) Website 1: Wikipedia

In the first task, we will seek to parse contents from a table in a Wikipedia page. You will implement a function that accepts two parameters. The first is the url and the second is the name of the table. For example, if the url was http://en.wikipedia.org/wiki/List_of_Test_cricket_records and the table name was "Team wins, losses and draws", then you have to parse the first table. Your code must produce as an output a Pandas data frame where the columns have same name as the table ('Team', 'First Test match', 'Matches', 'Won', 'Lost', 'Tied', 'Drawn' and '% Won') in this example.

Do NOT worry about the data format of the Pandas data frame. Simply treat each of them as a string.

In [22]:
#Input:
#    url: URL of a wikipedia page
#    table_id: Index of the table to scrape
#Output:
#    df is a Pandas data frame that contains a tabular representation of the table
#    The columns are named the same as the table columns
#    Each row of df corresponds to a row in the table
def scraping_wikipedia_table(url, table_id):
    website_html = requests.get(url).text
    dom = web.Element(website_html)
    tbls = [t for t in dom.by_tag('table')]
    tbl = tbls[table_id-1]  #correct the off by one error
    headerRow = tbl.by_tag("tr")[0] #simple trick - first row always gives the header
    #the title could be inside a th tag or td tag
    # use the in operator to determine which is the case and use the right filtering
    if len(headerRow.by_tag("th")) > 0:
        heading = [web.plaintext(th.content) for th in headerRow.by_tag('th')]
    else:
        heading = [web.plaintext(td.content) for td in headerRow.by_tag('td')]

    #DataFrame has many constructors. On a whim I am choosing the list of dicts
    arrayOfDict = [] 
    for row in tbl.by_tag("tr")[1:-1] : #ignore the last row about updated date 
        thContents = [web.plaintext(td.html) for td in row.by_tag("th")]
        tdContents = [web.plaintext(td.html) for td in row.by_tag("td")]
        contents = thContents + tdContents #big time hack - assumes th, if present is always the first column
        #Convert contents to a dictionary with appropriate headings
        rowAsDict = {heading[i] : contents[i] for i in range(len(contents))}
        arrayOfDict.append(rowAsDict)
    df = DataFrame(arrayOfDict,columns=heading)
    return df

print display(scraping_wikipedia_table("http://en.wikipedia.org/wiki/List_of_Test_cricket_records", 1))
Team First Test match Matches Won Lost Tied Drawn % Won
0 Australia 01877-03-15-000015 March 1877 773 362 205 2 204 46.83
1 Bangladesh 02000-11-10-000010 November 2000 88 7 70 0 11 7.95
2 England 01877-03-15-000015 March 1877 952 339 275 0 338 35.60
3 India 01932-06-25-000025 June 1932 487 122 156 1 208 25.05
4 New Zealand 01930-01-10-000010 January 1930 399 80 160 0 159 20.05
5 Pakistan 01952-10-16-000016 October 1952 387 121 110 0 156 31.26
6 South Africa 01889-03-12-000012 March 1889 390 144 129 0 117 36.92
7 Sri Lanka 01982-02-17-000017 February 1982 235 71 84 0 80 30.21
8 West Indies 01928-06-23-000023 June 1928 503 163 170 1 169 32.40
9 Zimbabwe 01992-10-18-000018 October 1992 97 11 60 0 26 11.34
10 ICC World XI 02005-10-14-000014 October 2005 1 0 1 0 0 0.00
None

(5 points) Website 2: Walmart

In the second task, we will seek to scrape the results of Walmart. Specifically, we will focus on Movies department. Your job is to implement a function that accepts a valid search result url (such as http://www.walmart.com/search/?query=marvel&cat_id=4096_530598 ), scrapes the 10-15 product listings in that url only and outputs a Pandas data frame with the following fields (with the EXACT names and data types as below):

  • Product Title: String. Title of the product (such as Samsung Red WB1100F Smart Camera with 16.2 Megapixels and 35x Optical Zoom)
  • Sale Price: Float. Price of the product (such as 249.99). This is the price that is highlighted in blue typically. Ignore the dollar sign and return the value only
  • Number of Ratings: Integer. The number of ratings that users have provided for the product. This is the number associated with the star.
  • Free Shipping: Boolean. Is True if the product has free shipping above $50 and false otherwise.
  • Free Store Pickup: Boolean. Is True if the product has free store pickup and false otherwise.
  • Starring: String. Contains the name of starring actors. Convert it to a simple string such as "Chris Pratt Zoe Saldana Vin Diesel" is okay. No need to parse it further.
  • Running: Integer. The running time of the movie. Only the integer value is required (ie. NNNN minutes then return only NNNN).
  • Format: String. Values such as Widescreen. Beware, some entries might not have this value.
  • Rating: String. MPAA rating.

Make sure that you observe the following:

  • If a data type is specified, then your data should be in that format. For eg, if a field is int, then ensure that it has an integer value.
  • Your code should not crash if some product did not have a valid value such as price. Instead you must fill it with NA (see Pandas tutorial for what NA is).
In [23]:
#Input:
#    url: URL of a Walmart results page for a search query in Movies department
#Output:
#    df is a Pandas data frame that contains a tabular representation of the results
#    The df must have 9 columns that must have the same name and data type as described above
#    Each row of df corresponds to a movie in the results table

def getValuesSafely(domObj, domSearchString):
    results = domObj(domSearchString)
    if len(results) > 0:
        return web.plaintext(results[0].content)
    else:
        return np.nan
    
def scraping_walmart_movies(url):
    website_html = requests.get(url).text
    dom = web.Element(website_html)
    heading = ['Product Title', 'Sale Price', 'Number of Ratings', 'Free Shipping', 'Free Store Pickup', 
                'Starring', 'Running', 'Format', 'Rating']

    arrayOfDict = []

    movies = dom("div.js-tile.tile-landscape")
    for movie in movies:
        productTitle = getValuesSafely(movie, 'h4 > a')
        salesPrice = getValuesSafely(movie, 'div.item-price-container > span.price.price-display')
        numRatings = getValuesSafely(movie,'span.stars-reviews')
        freeShipping = "shipping on orders over $50" in movie #Horrible hack
        freeStorePickup = "store pickup" in movie #Horrible Hack
        dl = movie('dl.media-details.dl-horizontal')
        dt = dl[0].by_tag("dt")
        dd = dl[0].by_tag("dd")

        tempOrder = ["Director:", "Starring:", "Running:", "Format:", "Release:", "Rating:"]
        tempDict = {key : np.nan for key in tempOrder} 
        index = 0
        for key in tempOrder:
            if key in movie:
                tempDict[key] = web.plaintext(dd[index].content)
                index = index + 1
                
        movieAsDict = {'Product Title': productTitle, 'Sale Price': salesPrice, 'Number of Ratings': numRatings, 
                       'Free Shipping':freeShipping,  'Free Store Pickup':freeStorePickup, 
                       'Starring': tempDict["Starring:"], 'Running': tempDict["Running:"], 
                       'Format': tempDict["Format:"], 'Rating': tempDict["Rating:"]}
        arrayOfDict.append(movieAsDict)
        

    df = DataFrame(arrayOfDict,columns=heading)
    return df 

print display(scraping_walmart_movies('http://www.walmart.com/search/?query=marvel&cat_id=4096_530598'))
Product Title Sale Price Number of Ratings Free Shipping Free Store Pickup Starring Running Format Rating
0 Marvel Heroes: Collection $33.98 (1)\nratings True True * Rebecca Romijn\n* Hugh Jackman\n* Chris Evan... 856 minutes NaN PG-13
1 Marvel's The Avengers (Widescreen) $19.96 (68)\nratings True True * Robert Downey Jr.\n* Chris Evans\n* Mark Ruf... 145 minutes Widescreen PG-13
2 Marvel: Guardians Of The Galaxy (Widescreen) $19.96 (44)\nratings True True * Chris Pratt\n* Zoe Saldana\n* Vin Diesel\n* ... 121 minutes Widescreen PG-13
3 Marvel Knights: Wolverine Versus Sabretooth - ... $13.47 NaN False False * Jeph Loeb\n* Simone Bianchi 44 minutes Widescreen Not Rated
4 Marvel's The Avengers (DVD + Blu-ray) (Widescr... $29.96 (5)\nratings True True * Robert Downey Jr.\n* Chris Evans\n* Mark Ruf... 145 minutes Widescreen PG-13
5 Marvel Complete Giftset (Widescreen) $26.96 NaN True True * Justin Gross\n* Grey Delisle\n* Michael Mass... 640 minutes Widescreen PG-13
6 Elektra / Fantastic Four / Daredevil (Director... $13.99 NaN True True * Jessica Alba\n* Chris Evans\n* Jennifer Garn... NaN NaN Not Rated
7 The Punisher (Extended Cut) (Widescreen) $7.00 (4)\nratings True True * John Travolta\n* Thomas Jane\n* Ben Foster\n... 140 minutes Widescreen R
8 Spider-Man 2 (Widescreen) $12.96 (3)\nratings True True * Tobey Maguire\n* Kirsten Dunst\n* James Fran... 128 minutes Widescreen PG-13
9 Superheroes Collection: The Incredible Hulk Re... $13.47 NaN True True * Bill Bixby\n* Jack Colvin\n* Lou Ferrigno\n*... 247 minutes Widescreen Not Rated
10 Marvel: Iron Man & Hulk - Heroes United (Wides... $19.96 (1)\nratings True True * Adrian Pasdar\n* Fred Tatasciore\n* David Kaye 72 minutes Widescreen PG
11 Captain America: The Winter Soldier (Widescreen) $19.96 (19)\nratings True True * Chris Evans\n* Scarlett Johansson\n* Anthony... 136 minutes Widescreen PG-13
12 Iron Man 3 (DVD + Digital Copy) (Widescreen) $19.96 (26)\nratings True True * Robert Downey Jr.\n* Gwyneth Paltrow\n* Don ... 135 minutes Widescreen PG-13
13 Thor: The Dark World (Widescreen) $19.96 (22)\nratings True True * Chris Hemsworth\n* Natalie Portman\n* Tom Hi... 112 minutes Widescreen PG-13
14 Spider-Man (2-Disc) (Special Edition) (Widescr... $12.96 (11)\nratings True True * Tobey Maguire\n* Willem Dafoe\n* Kirsten Dun... 121 minutes Widescreen PG-13
15 Spider-Man / Spider-Man 2 / Spider-Man 3 (Wide... $25.46 NaN False True * Tobey Maguire\n* Kirsten Dunst\n* James Fran... 523 minutes Widescreen PG-13
16 DC Showcase: Superman / Shazam!: The Return Of... $9.07 (10)\nratings True True NaN 63 minutes NaN PG-13
17 The Next Avengers: Heroes Of Tomorrow (Widescr... $5.00 (9)\nratings True True * Noah Crawford\n* Nicole Oliver\n* Shawn MacD... 78 minutes Widescreen PG
18 Ultimate Avengers Movie Collection (Widescreen) $11.17 (5)\nratings True True NaN 222 minutes Widescreen PG-13
19 Ultimate Avengers: The Movie (Widescreen) $5.39 (7)\nratings True True * Justin Gross\n* Grey DeLisle\n* Michael Mass... 71 minutes Widescreen PG-13
None

(2.5 * 4 = 10 points) Website 3: Facebook

In the third task, we will scrape some specific sub-pages of Facebook profiles. In order to avoid using sophisticated tools like Selenium, for this task you can assume that the input to the function is the DOM of the relevant page. You can then use it to parse relevant contents.

In [24]:
#Input: dom - DOM of the books page corresponding to an FB account's profile. Eg, DOM of https://www.facebook.com/zuck/books
#Output: An array (Python list) of books listed in the profile page. 
#    Note that this function requires a list as an output not a Pandas data frame
def scraping_facebook_books(dom):
      return [li("a._gx7")[0].title for li in dom("ul.uiList > li")]
    
#Input: dom - DOM of the groups page corresponding to an FB account's profile. Eg, DOM of https://www.facebook.com/zuck/groups
#Output: A Pandas data frame with one row per group. 
#    It must have three columns - 'Group Name', 'Number of Members', 'Group Description'
#    Note that all information of a group is in the same page (eg. https://www.facebook.com/zuck/groups)
#    You can collect all data from same page even if they are incomplete (such as group description)
#    Ensure that the column names as given above
def scraping_facebook_groups(dom):
    arrayOfDict = []
    for li in dom("ul.uiList > li"):
        groupAsDict = {}
        groupAsDict["Group Name"] = web.plaintext(li("div.mbs.fwb")[0].content)
        groupAsDict["Number of Members"] = web.plaintext(li("div.mbs.fcg")[0].content)
        groupAsDict["Group Description"] = web.plaintext(li("div.mtm > span")[0].content)
        arrayOfDict.append(groupAsDict)
    df = DataFrame(arrayOfDict,columns=["Group Name", "Number of Members", "Group Description"])
    return df

#Input: dom - DOM of the music page corresponding to an FB account's profile. Eg, DOM of https://www.facebook.com/zuck/music
#Output: A Pandas data frame with one row per group. 
#    It must have four columns 
#    'Name', 'Type' (eg. Musician/Band or Bands&Musicians) and 'Verified' (boolean: True/False), 'Profile Url'
#    Note that all information of a group is in the same page (eg. https://www.facebook.com/zuck/music)
#    Ensure that the column names as given above
def scraping_facebook_music(dom):
    arrayOfDict = []
    for li in dom("ul.uiList > li"):
        musicAsDict = {}
        musicAsDict["Name"] = li("a._gx7")[0].title
        musicAsDict["Type"] = web.plaintext(li("div.fsm.fwn.fcg")[0].content)
        musicAsDict["Verified"] = "Verified Page" in li
        musicAsDict["Profile Url"] =  li("a._gx7")[0].href
        arrayOfDict.append(musicAsDict)
    df = DataFrame(arrayOfDict,columns=["Name", "Type", "Verified", "Profile Url"])
    return df

#Input: dom - DOM of the music page corresponding to an FB account's profile. Eg, DOM of https://www.facebook.com/zuck/movies
#Output: A Pandas data frame with one row per group. 
#    It must have following columns - 
#        'Name', 'Type' (eg. Movie), 'Verified', 'Profile Url' - as before
#    Ensure that the column names as given above
def scraping_facebook_movies(dom):
    arrayOfDict = []
    for li in dom("ul.uiList > li"):
        movieAsDict = {}
        movieAsDict["Name"] = li("a._gx7")[0].title
        movieAsDict["Type"] = web.plaintext(li("div.fsm.fwn.fcg")[0].content)
        movieAsDict["Verified"] = "Verified Page" in li
        movieAsDict["Profile Url"] =  li("a._gx7")[0].href
        arrayOfDict.append(movieAsDict)
    df = DataFrame(arrayOfDict,columns=["Name", "Type", "Verified", "Profile Url"])
    return df

books_html = open("zuckBooks.html").read()
#Uncomment one of these lines based on whether you used Pattern or BeautifulSoup
dom = web.Element(books_html)
#dom = BeautifulSoup(books_html)
print "Books in Zuck's FB:\n", scraping_facebook_books(dom)

group_html = open("zuckGroups.html").read()
dom = web.Element(group_html)
#dom = BeautifulSoup(group_html)
print "\n\nGroups in Zuck's FB:\n"
display(scraping_facebook_groups(dom))


music_html = open("zuckMusic.html").read()
dom = web.Element(music_html)
#dom = BeautifulSoup(music_html)
print "\n\nMusic in Zuck's FB:\n"
display(scraping_facebook_music(dom))


movies_html = open("zuckMovies.html").read()
dom = web.Element(movies_html)
#dom = BeautifulSoup(movies_html)
print "\n\nMovies in Zuck's FB:\n"
display(scraping_facebook_movies(dom))
Books in Zuck's FB:
[u'Gang Leader for a Day: A Rogue Sociologist Takes to the Streets', u'The Better Angels of Our Nature', u'American Lion: Andrew Jackson in the White House', u"The House of Rothschild: Volume 1: Money's Prophets: 1798-1848", u'Open: An Autobiography', u'The Information: A History, a Theory, a Flood', u"Surely You're Joking, Mr. Feynman!", u'The Idea Factory: Bell Labs and the Great Age of American Innovation', u'Decoded', u'Einstein: His Life and Universe', u'Physics of the Impossible', u"Ender's Shadow", u'The Giving Tree', u"Ender's Game"]


Groups in Zuck's FB:


Group Name Number of Members Group Description
0 Pulis 30 members This is a group for our favorite Hungarian bre...


Music in Zuck's FB:


Name Type Verified Profile Url
0 Imagine Dragons Musician/Band True https://www.facebook.com/ImagineDragons?ref=pr...
1 Beyoncé Musician/Band True https://www.facebook.com/beyonce?ref=profile
2 Rihanna Artist True https://www.facebook.com/rihanna?ref=profile
3 The All-American Rejects Musician/Band True https://www.facebook.com/AllAmericanRejects?re...
4 LCD Soundsystem Musician/Band True https://www.facebook.com/lcdsoundsystem?ref=pr...
5 Daft Punk Musician/Band True https://www.facebook.com/daftpunk?ref=profile
6 Linkin Park Musician/Band True https://www.facebook.com/linkinPark?ref=profile
7 Ingrid Michaelson Musician/Band True https://www.facebook.com/ingridmichaelson?ref=...
8 Radiohead Musician/Band True https://www.facebook.com/radiohead?ref=profile
9 Shakira Musician/Band True https://www.facebook.com/shakira?ref=profile
10 Nirvana Musician/Band True https://www.facebook.com/Nirvana?ref=profile
11 JAY Z Musician/Band True https://www.facebook.com/JayZ?ref=profile
12 The Killers Musician/Band True https://www.facebook.com/Thekillers?ref=profile
13 Green Day Musician/Band True https://www.facebook.com/GreenDay?ref=profile
14 Taylor Swift Musician/Band True https://www.facebook.com/TaylorSwift?ref=profile
15 Lady Gaga Musician/Band True https://www.facebook.com/ladygaga?ref=profile
16 Katy Perry Musician/Band True https://www.facebook.com/katyperry?ref=profile
17 U2 Musician/Band True https://www.facebook.com/u2?ref=profile
18 John Mayer Musician/Band True https://www.facebook.com/johnmayer?ref=profile


Movies in Zuck's FB:


Name Type Verified Profile Url
0 Hero Movie True https://www.facebook.com/HeroMovie?ref=profile
1 Django Unchained Movie True https://www.facebook.com/unchainedmovie?ref=pr...
2 MoneyBall Movie True https://www.facebook.com/MoneyballMovie?ref=pr...
3 Gladiator Movie True https://www.facebook.com/gladiatormovie?ref=pr...
4 The Matrix Movie True https://www.facebook.com/TheMatrixMovie?ref=pr...
5 Star Wars Movie True https://www.facebook.com/StarWars?ref=profile
6 Batman: The Dark Knight Movie True https://www.facebook.com/darkknight?ref=profile
7 Waiting for "Superman" Movie True https://www.facebook.com/waitingforsuperman?re...
8 Iron Man Movie True https://www.facebook.com/ironman?ref=profile
9 Fearless Movie False https://www.facebook.com/pages/Fearless/112430...
10 The Godfather Movie True https://www.facebook.com/thegodfather?ref=profile
11 Disney Pixar Company True https://www.facebook.com/DisneyPixar?ref=profile

Part 2 and 3: Exploratory Analysis and Visualization

Data Set Description

In this assignment, you will use the FEC dataset. The US Federal Election Commission publishes data on contributions to political campaigns. This includes contributor names, occupation and employer, address, and contribution amount. Specifically, we will be using the FEC data from 2012 election between Barack Obama and Mitt Romney. This is widely considered as a landmark election as both sides spent an unprecedented amount of 1 Billion dollars each (i.e. more than 6000 Crores in INR each). If you are interested, you can download the entire list of contributor details at the FEC site. It is relatively large (150 MB compressed, 1 GB uncompressed). For our experiments, we will use a smaller subset of the data collected (and cleaned) by Wes McKinney (the creator of Pandas). For the download link, see the Assignments page. It is small by most standards (around 1.4 million records, 20 MB compressed, 160 MB uncompressed) but large enough to give a taste of why data mining is compute intensive. Hopefully, this will also give you an appreciation as to the awesomeness of Pandas/Numpy - you can do really cool stuff with 2-3 lines of code that runs in seconds.

Dataset Description: You can find the details (such as meaning of the column names) of the dataset in the FEC website.

While a knowledge of US Elections or FEC Campaign finance rules is not necessary for solving this assignment, you can check out the following links if you are curious.

How to Become the US President: A Step-by-Step Guide

US Election guide: how does the election work?

If you are fascinated with Campaign finance reform (as I am), here are some good links:

Super-PACs and Dark Money: ProPublica’s Guide to the New World of Campaign Finance

40 charts that explain money in politics

Visualization

Visualization is a key component of exploration. You will perform a number of 1-D and 2-D charts to get some intuition about the data. You can choose to use either Matplotlib or Seaborn for plotting. The default figures generated from Matplotlib might look a bit ugly. So you might want to try Seaborn to get better figures. The defaults in Seaborn are much saner and sometimes makes your life lot easier. Seaborn has number of styles - so feel free to experiment with them and choose the one you like. We have earmarked 10 points for the aesthetics of your visualizations.

Pandas

Almost all the tasks below could be purely solved using Pandas. Most analytic tasks should not require more than 2-3 lines of code (visualization on the other hand is a different matter - you might need 5-10 lines for each chart unless you use Seaborn). Here is a NON COMPREHENSIVE list of functions that you might want to know to solve this assignment : agg, apply, argmax, argmin, count, crosstab, cumsum, cut, describe, groupby, head, idxmax, idxmin, info, isin, map, max, min, pivot_table, size, sum, transform, unique, value_counts .

Exploration Tasks

You can find a set of exploratory analytics tasks below. Ensure that you clearly follow the instructions. The assignment will be graded automatically - so failure to follow might cause some issues. Also do NOT rename the functions or variables that the instructor has set. Of course, feel free to create new variables/functions that will hep your code.

Reading and Filtering Dataset

The FEC data contains information about all presidential candidates. As a sitting president, Barack Obama was the only candidate from the Democratic party. In the Republican party, there was a process called Primary (see links above) where number of candidates competed to be the nominee. Mitt Romney won the Republican primary and competed with Barack Obama in the elections, which Obama won.

The Python code below reads the FEC dataset into a Pandas data frame with the name fec_all. If your machine has less than 2 GB of RAM, then change the function argument low_memory to True. Once the frame is loaded, we remove all negative contributions (where the campaign refunded amount to a contributor for some reason). Finally, we create a new data frame called fec that contains the contributions to Barack Obama and Mitt Romney alone.

For this code to work, the file 'fec_2012_contribution_subset.csv' must be in the same folder as the notebook.

To reduce my typing, I might refer to Obama as BO and Romney as MR in the text below.

In [25]:
#read the csv file into a Pandas data frame
fec_all = pd.read_csv('fec_2012_contribution_subset.csv', low_memory=False)

#We will be doing party wise analysis later. So, we want to associate each candidate with their party
parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican',
           'Santorum, Rick': 'Republican'}

#create a new column called party that sets the value to the party of the candidate
# The way this line works is as follows:
#  1. fec_all.cand_nm gives a vector (or Series in Pandas terminology)
#  2. For each row, the code looks up the candidate name to the dictionary parties
#  3. If the name of the candidate (cand_nm) is in parties, it returns the value (i.e. Republican or Democrat)
#  4. This whole thing is done for each row and you get another vector as output
#  5. Finally, you create a new column in fec_all called 'party' and assign the vector you got to it.
#  6. All in a single line :)
fec_all['party'] = fec_all.cand_nm.map(parties)

#ignore the refunds
# Get the subset of dataset where contribution amount is positive
fec_all = fec_all[fec_all.contb_receipt_amt > 0]

#fec_all contains details about all presidential candidates. 
#fec contains the details about contributions to Barack Obama and Mitt Romney only
#for the rest of the tasks, unless explicitly specified, work on the fec data frame.
fec = fec_all[fec_all.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]

Task 1: Descriptive Statistics

Let us start with some easy stuff. As of now, you do not know anything about the dataset. So the first task will be to get some basic sense.

In [26]:
#Task 1a: print the details of the data frame. 
# Basically, this should contain information such as number of rows,columns, name of columns, #non-null values for each column etc
# Hint: there is a Pandas function to do this
#replace None with your code
print "Task 1a: Details of FEC data frame are: \n", fec.info() 

#Task 1b: finding the number of rows and columns in the data frame.
#Hint: find the property of the data frame that gives the number of rows and columns
#replace None with your code
t1b_num_rows = fec.shape[0]
t1b_num_cols = fec.shape[1]
print "\n\nTask 1b: #Rows=%s, #Columns=%s" % (t1b_num_rows, t1b_num_cols) 

#Task 1c: The only numeric data is 'contb_receipt_amt' which is the amount of contribution. 
# Print the descriptive details (min, max, quantiles etc) for 'contb_receipt_amt'
#Hint: as above there is a direct pandas command for it.
#replace None with your code
print "\n\nTask 1c: Descriptive details of contb_receipt_amt is \n", fec.contb_receipt_amt.describe()

#Task 1d: Let us now print the number of unique values for few columns
#Hint: Look for a Pandas function to do this.
t1d_num_uniq_cand_id = fec.cand_id.nunique()
t1d_num_uniq_cand_nm = fec.cand_nm.nunique()
t1d_num_uniq_contbr_city = fec.contbr_city.nunique()
t1d_num_uniq_contbr_st = fec.contbr_st.nunique()
print "\n\nTask 1d: #Uniq cand_id = ", t1d_num_uniq_cand_id
print "Task 1d: #Uniq cand_num = ", t1d_num_uniq_cand_nm
print "Task 1d: #Uniq contbr_city = ", t1d_num_uniq_contbr_city
print "Task 1d: #Uniq contbr_st = ", t1d_num_uniq_contbr_st
Task 1a: Details of FEC data frame are: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 694282 entries, 411 to 701385
Data columns (total 17 columns):
cmte_id              694282 non-null object
cand_id              694282 non-null object
cand_nm              694282 non-null object
contbr_nm            694282 non-null object
contbr_city          694275 non-null object
contbr_st            694278 non-null object
contbr_zip           694234 non-null object
contbr_employer      693607 non-null object
contbr_occupation    693524 non-null object
contb_receipt_amt    694282 non-null float64
contb_receipt_dt     694282 non-null object
receipt_desc         2345 non-null object
memo_cd              87387 non-null object
memo_text            90672 non-null object
form_tp              694282 non-null object
file_num             694282 non-null int64
party                694282 non-null object
dtypes: float64(1), int64(1), object(15)
memory usage: 95.3+ MB
None


Task 1b: #Rows=694282, #Columns=17


Task 1c: Descriptive details of contb_receipt_amt is 
count     694282.000000
mean         322.942745
std         4482.130250
min            0.010000
25%           35.000000
50%          100.000000
75%          250.000000
max      2014490.510000
Name: contb_receipt_amt, dtype: float64


Task 1d: #Uniq cand_id =  2
Task 1d: #Uniq cand_num =  2
Task 1d: #Uniq contbr_city =  11856
Task 1d: #Uniq contbr_st =  67

Task 2: Basic Filtering

In this task, we will perform some very high level filtering. Pandas has a convenient and powerful syntax for filtering (for eg, see above for how I filtered negative contributions and non-Obama, Romney candidates).

In [27]:
#Task 2a: Let us find out how much contributions did Obama and Romney made in this dataset
# Remember, that this is not the complete amount as it excludes other sources like PACs, Super PACs and 
#  spending by party committes.
#Hint: use cand_nm field
t2a_tot_amt_obama = fec.contb_receipt_amt[fec.cand_nm == 'Obama, Barack'].sum()
t2a_tot_amt_romney = fec.contb_receipt_amt[fec.cand_nm == 'Romney, Mitt'].sum()
print "Task 2a: Total Contribution for Obama is %s and for Romney is %s" % (t2a_tot_amt_obama, t2a_tot_amt_romney)

#Task 2b: How much contribution did folks from California, New York and Texas make totally (i.e. to both Obama and Romney).
#use contbr_st field
t2b_tot_amt_CA = fec.contb_receipt_amt[fec.contbr_st == 'CA'].sum()
t2b_tot_amt_NY = fec.contb_receipt_amt[fec.contbr_st == 'NY'].sum()
t2b_tot_amt_TX = fec.contb_receipt_amt[fec.contbr_st == 'TX'].sum()
print "\n\nTask 2b: Total contributions from California is %s, New York is %s and Texas is %s" % (t2b_tot_amt_CA, t2b_tot_amt_NY, t2b_tot_amt_TX)

#Task 2c: Let us now use multiple filtering criteria
# How much money did folks from Texas made to BO and MR?
# How much money did folks from UT Arlington made to BO and MR?
t2c_tot_contr_tx_bo = fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contbr_st == 'TX')].sum()
t2c_tot_contr_tx_mr = fec.contb_receipt_amt[(fec.cand_nm == 'Romney, Mitt') & (fec.contbr_st == 'TX')].sum()
t2c_tot_contr_uta_bo = fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contbr_employer == 'UT ARLINGTON') 
                                             & (fec.contbr_st == 'TX')].sum()
t2c_tot_contr_uta_mr = fec.contb_receipt_amt[(fec.cand_nm == 'Romney, Mitt') & (fec.contbr_employer == 'UT ARLINGTON') 
                                             & (fec.contbr_st == 'TX')].sum()

print "\n\nTask 2c: From TX, BO got %s and MR got %s dollars" % (t2c_tot_contr_tx_bo, t2c_tot_contr_tx_mr)
print "Task 2c: From UTA, BO got %s and MR got %s dollars" % (t2c_tot_contr_uta_bo, t2c_tot_contr_uta_mr)

#Task 2d: How much did Engineers from Google gave to BO and MR.
# This task is a bit tricky as there are many variations: eg, SOFTWARE ENGINEER vs ENGINEER and GOOGLE INC. vs GOOGLE
t2d_tot_engr_goog_bo = fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contbr_employer.str.contains("GOOGLE") == True) 
                                             & (fec.contbr_occupation.str.contains("ENGINEER") == True)].sum()
t2d_tot_engr_goog_mr = fec.contb_receipt_amt[(fec.cand_nm == 'Romney, Mitt') & (fec.contbr_employer.str.contains("GOOGLE") == True) 
                                             & (fec.contbr_occupation.str.contains("ENGINEER") == True)].sum()

print "\n\nTask 2d: From Google Engineers, BO got %s and MR got %s dollars" % (t2d_tot_engr_goog_bo, t2d_tot_engr_goog_mr)
Task 2a: Total Contribution for Obama is 135877427.24 and for Romney is 88335907.53


Task 2b: Total contributions from California is 35062620.84, New York is 24836131.14 and Texas is 12792822.13


Task 2c: From TX, BO got 6570832.45 and MR got 6221989.68 dollars
Task 2c: From UTA, BO got 750.0 and MR got 0 dollars


Task 2d: From Google Engineers, BO got 87212.4 and MR got 2850.0 dollars

Task 3: Basic Aggregation

In this task, we will perform some very high level aggregation. Pandas has some convenient functions for aggregation (do NOT write a for loop - Pandas has some very efficient, vectorized code).

In [28]:
#Task 3a: For each state, print the total contribution they made to both candidates. 
t3a_state_contr_both = fec.contb_receipt_amt.groupby(fec.contbr_st).sum()
print "Task 3a: Total contribution made to both candidates by each state are:\n", t3a_state_contr_both

#Task 3b: Now let us limit ourselves to TX. For each city in TX, print the total contribution made to both candidates
t3b_tx_city_contr_both = fec.contb_receipt_amt[fec.contbr_st == 'TX'].groupby(fec.contbr_city).sum()
print "\n\nTask 3b: Total contribution made to both candidates by each city in TX are:\n", t3b_tx_city_contr_both

#Task 3c: Now let us zoom into  Arlington, TX. For each zipcode in Arlington, print the total contribution made to both candidates
t3c_arlington_contr_both = fec.contb_receipt_amt[(fec.contbr_st == 'TX') & (fec.contbr_city == 'ARLINGTON')].groupby(fec.contbr_zip).sum()
print "\n\nTask 3c: Total contribution made to both candidates by each zipcode in Arlington are:\n", t3c_arlington_contr_both
Task 3a: Total contribution made to both candidates by each state are:
contbr_st
AA              56540.00
AB               2048.00
AE              48653.75
AK             368044.39
AL            1070426.99
AP              38785.50
AR             464803.28
AS               2955.00
AZ            3394913.21
CA           35062620.84
CO            3639143.61
CT            5567766.71
DC            5398676.30
DE             419381.14
FF              99030.00
...
SC            1033475.80
SD             253088.76
TN            2636233.06
TX           12792822.13
UK               2500.00
UT            4237151.85
VA            7725743.04
VI              84212.00
VT            1041740.03
WA            5592454.72
WI            1400471.78
WV             295879.59
WY             446642.58
XX             400250.00
ZZ               5963.00
Name: contb_receipt_amt, Length: 67, dtype: float64


Task 3b: Total contribution made to both candidates by each city in TX are:
contbr_city
ABERNATHY          210.00
ABILENE          45748.00
ADDISON          12244.25
ADKINS             600.00
ALAMO              152.00
ALAMO HEIGHTS     2500.00
ALBA                69.00
ALBANY            6500.00
ALEDO             5880.00
ALICE             2760.00
ALLEN            21066.00
ALLISON            474.00
ALPINE            5166.00
ALTO               120.00
ALTON              125.00
...
WIMBERLEY      19060
WINDCREST        525
WINDTHORST       400
WINFIELD        1000
WINNIE           230
WINNSBORO        375
WOODLANDS       5112
WOODVILLE        175
WOODWAY         2965
WORTHAM          250
WYLIE           8405
YANTIS           340
YOAKUM           100
YORKTOWN         100
ZAPATA           300
Name: contb_receipt_amt, Length: 729, dtype: float64


Task 3c: Total contribution made to both candidates by each zipcode in Arlington are:
contbr_zip
75237            85
76001          1130
760015368      5175
760015696       150
760017554       300
76002           904
760023037       550
760024202       215
760025010      2474
760025461        35
760041185      1000
76006         10050
760062026       400
760062725      1000
760062778       365
...
760172730      125
760172754     1000
760172768       80
760173004      135
760173541      636
760174554      200
760176032      210
760176250      145
760177925       25
760177973      650
76018          105
760181913      170
760182501      100
760184920      100
760185108       35
Name: contb_receipt_amt, Length: 121, dtype: float64

Task 4: Aggregation+Filtering+Ranking

In this task, you will try to combine aggregation with filtering and then rank the results based on the results. Pandas is often quite clever and might sort the data for you already.

Hint: Pandas has ready made functions for all the following.

In [29]:
#Task 4a: Print the number of contributors to Obama in each state.
t4a_num_contr_obama_per_state = fec.contbr_nm[fec.cand_nm == 'Obama, Barack'].groupby(fec.contbr_st).count()
print "Number of contributions to Obama in each state is \n", t4a_num_contr_obama_per_state

#Task 4b: Print the top-10 states (based on number of contributors) that contributed to Obama.
# print both state name and number of contributors
t4b_top10_obama_contr_states = fec[fec.cand_nm == 'Obama, Barack'].contbr_st.value_counts()[:10]
print "\n\nTop-10 states with most contributors to Obama are :\n", t4b_top10_obama_contr_states

#Task 4c: Print the top-20 occupations that contributed overall (to both BO and MR)
t4c_top20_contr_occupation = fec.contbr_occupation.value_counts()[:20]
print "\n\nTop-20 Occupations with most contributors are :\n", t4c_top20_contr_occupation

#Task 4d: Print the top-10 Employers that contributed overall (to both BO and MR)
t4d_top10_contr_employer_all = fec.contbr_employer.value_counts()[:10]
print "\n\nTop-10 Employers with most contributors are :\n", t4d_top10_contr_employer_all
Number of contributions to Obama in each state is 
contbr_st
AA               74
AB                4
AE              395
AK             2036
AL             3854
AP              158
AR             1747
AS               31
AZ            10509
CA           100182
CO            12289
CT             9977
DC            11491
DE             1782
FL            29797
...
QU               1
RI            2038
SC            4228
SD             713
TN            6534
TX           32292
UT            2790
VA           21451
VI             415
VT            3563
WA           20783
WI            8050
WV            1330
WY            1055
ZZ              15
Name: contbr_nm, Length: 64, dtype: int64


Top-10 states with most contributors to Obama are :
CA    100182
NY     50383
IL     33240
TX     32292
FL     29797
MA     24864
MD     22552
VA     21451
WA     20783
PA     19280
dtype: int64


Top-20 Occupations with most contributors are :
RETIRED                                   177473
ATTORNEY                                   30133
INFORMATION REQUESTED                      24747
HOMEMAKER                                  19626
PHYSICIAN                                  17206
INFORMATION REQUESTED PER BEST EFFORTS     12545
PROFESSOR                                  11804
TEACHER                                    11512
CONSULTANT                                 10061
NOT EMPLOYED                                9696
LAWYER                                      7438
ENGINEER                                    6107
PRESIDENT                                   4864
MANAGER                                     4745
WRITER                                      4439
SELF-EMPLOYED                               3763
SALES                                       3697
EXECUTIVE                                   3609
OWNER                                       3408
EDUCATOR                                    3360
dtype: int64


Top-10 Employers with most contributors are :
RETIRED                                   173287
SELF-EMPLOYED                              79638
NOT EMPLOYED                               45727
INFORMATION REQUESTED                      25524
INFORMATION REQUESTED PER BEST EFFORTS     13104
HOMEMAKER                                  12081
OBAMA FOR AMERICA                           1741
STUDENT                                     1417
SELF                                        1228
DISABLED                                    1096
dtype: int64

Task 5: Basic Visualization

Let us take a break from analytics to do some Visualization. Notice that the visualization tasks are NOT ordered based on their complexity. So some of them might be more challenging than others.

In [30]:
#Task 5a: Draw a "horizontal" bar chart with one bar each for Obama and Romney with the value corresponding to total amount they raised.
# Remember to make the bar chart into a horizontal bar chart
#########################begin code for Task 5a
plt.figure()
figObj5a = fec.contb_receipt_amt.groupby(fec.cand_nm).sum().plot(kind='barh', title='Task 5a',color=['blue','red'])
figObj5a.set_xlabel('Total Amount Raised')
figObj5a.set_ylabel('Candidate Names')
#########################end code for Task 5a


#Task 5b: Draw the "horizontal" bar chart of total NUMBER of contributions made per Candidate.
# ie Candidate name vs number of contributions for that candidate
#########################begin code for Task 5b
plt.figure()
figObj5b = fec.contbr_nm.groupby(fec.cand_nm).agg('count').plot(kind='barh', title='Task 5b',color=['blue','red'])
figObj5b.set_xlabel('#Contributors')
figObj5b.set_ylabel('Candidate Names')
#########################end code for Task 5b

#Task 5c: Draw the "horizontal" bar chart of average contributions made per Candidate.
# ie Candidate Name vs avg contribution
#########################begin code for Task 5c
plt.figure()
figObj5c = fec.contb_receipt_amt.groupby(fec.cand_nm).mean().plot(kind='barh',title='Task 5c',color=['blue','red'])
figObj5c.set_xlabel('Average Value of Contributions')
figObj5c.set_ylabel('Candidate Names') 
#########################end code for Task 5c


#Task 5d: Draw a "horizontal" bar chart that lists the top-10 states based on the TOTAL contribution they made to both candidates
# For each state, draw two lines - one in blue for Obama and one in red for Romney
# Display the proportion of the total contribution that came from the state.
# For eg, if Obama made 1 billion and TX gave 100 million of it, the proportion is 10% 
# Remember to make the bar chart into a horizontal bar chart
#########################begin code for Task 5d
plt.figure()
t5dTotalContrib = fec.groupby(['cand_nm', 'contbr_st']).contb_receipt_amt.sum().unstack(0).fillna(0)
t5dTotalContrib['Total'] = t5dTotalContrib['Obama, Barack'] + t5dTotalContrib['Romney, Mitt']
t5dPercentContrib = t5dTotalContrib / t5dTotalContrib.sum(axis=0)
t5dTotalContrib['Obama Pct'],  t5dTotalContrib['Romney Pct'] = t5dPercentContrib['Obama, Barack'], t5dPercentContrib['Romney, Mitt']
t5dTop10States = t5dTotalContrib.sort('Total', ascending=False)[:10]
figObj5d = t5dTop10States[['Obama Pct', 'Romney Pct']].plot(kind='barh', title='Task 5d', color=['blue','red'])
figObj5d.set_xlabel('Percent Contributions')
figObj5d.set_ylabel('States') 
#########################end code for Task 5d


#Task 5e: Now repeat the same process based on Occupation (again top-10)
#########################begin code for Task 5e
plt.figure()
t5eTotalContrib = fec.groupby(['cand_nm', 'contbr_occupation']).contb_receipt_amt.sum().unstack(0).fillna(0)
t5eTotalContrib['Total'] = t5eTotalContrib['Obama, Barack'] + t5eTotalContrib['Romney, Mitt']
t5ePercentContrib = t5eTotalContrib / t5eTotalContrib.sum(axis=0)
t5eTotalContrib['Obama Pct'],  t5eTotalContrib['Romney Pct'] = t5ePercentContrib['Obama, Barack'], t5ePercentContrib['Romney, Mitt']
t5eTop10Occuation = t5eTotalContrib.sort('Total', ascending=False)[:10]
figObj5e = t5eTop10Occuation[['Obama Pct', 'Romney Pct']].plot(kind='barh', title='Task 5e', color=['blue','red'])
figObj5e.set_xlabel('Percent Contributions')
figObj5e.set_ylabel('Occupations') 
#########################end code for Task 5e


#Task 5f: Now repeat the same process based on Employers (again top-10)
#########################begin code for Task 5f
plt.figure()
t5fTotalContrib = fec.groupby(['cand_nm', 'contbr_employer']).contb_receipt_amt.sum().unstack(0).fillna(0)
t5fTotalContrib['Total'] = t5fTotalContrib['Obama, Barack'] + t5fTotalContrib['Romney, Mitt']
t5fPercentContrib = t5fTotalContrib / t5fTotalContrib.sum(axis=0)
t5fTotalContrib['Obama Pct'],  t5fTotalContrib['Romney Pct'] = t5fPercentContrib['Obama, Barack'], t5fPercentContrib['Romney, Mitt']
t5fTop10Employers = t5fTotalContrib.sort('Total', ascending=False)[:10]
figObj5f = t5fTop10Employers[['Obama Pct', 'Romney Pct']].plot(kind='barh', title='Task 5f', color=['blue','red'])
figObj5f.set_xlabel('Percent Contributions')
figObj5f.set_ylabel('Employers')
#########################end code for Task 5f


#Task 5g: Draw the histogram of total NUMBER of contributions made per each state.
# X-axis : state, Y-axis : number of contribution from that state
#########################begin code for Task 5g
plt.figure(figsize=(16, 8)) 
figObj5g = fec.contbr_nm.groupby(fec.contbr_st).count().plot(kind = 'bar',title='Task 5g')
figObj5g.set_xlabel('State Name')
figObj5g.set_ylabel('Number of Contributors')
#########################end code for Task 5g


#Task 5h: Draw a histogram of actual contributions made for Obama. Set bin size as 50
#X-axis: contribution amount bin, Y-axis: frequency
#########################begin code for Task 5h

#####Note: Hist by default, tries to create bins that are neatly arranged from min to max. 
# However, the contribution amounts are very skewed. 
# If you run the command fec.contb_receipt_amt.describe() you can see that 75% of them are below 250 but max value is 2014490.51
# Hence default hist will not work here as most people gave small amount
# For eg, only 10 contributions to Obama was above $5000 (you can verify by the line below)
# len(fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contb_receipt_amt > 5000)])
#There are many ways to fix it - including cut and qcut of pandas.
# I am going for a simple solution - which is to ignore the top-10 contributions (above 5000)
# and plt only the rest
plt.figure()
figObj5h = fec.contb_receipt_amt[(fec.cand_nm == 'Obama, Barack') & (fec.contb_receipt_amt <= 5000)].hist(bins = 50)
figObj5h.set_xlabel('Contribution Bins')
figObj5h.set_ylabel('Frequency')
figObj5h.set_title('Task 5h')
#########################end code for Task 5h


#Task 5i: Draw a histogram of actual contributions made for Romney. Set bin size as 50
#X-axis: contribution amount bin, Y-axis: frequency
#########################begin code for Task 5i
plt.figure()
figObj5i = fec.contb_receipt_amt[(fec.cand_nm == 'Romney, Mitt') & (fec.contb_receipt_amt <= 5000)].hist(bins = 50)
figObj5i.set_xlabel('Contribution Bins')
figObj5i.set_ylabel('Frequency')
figObj5i.set_title('Task 5i')

#########################end code for Task 5i


#Harder
#Task 5j: Draw a line chart showing how the campaign contribution of Obama and Romney varied every quarter
#Use blue for Obama and red for Romney
#This is a bit tricky because, you have to convert contribution date to quarter.
#You can either do it on your own or see if Pandas has some function
#########################begin code for Task 5j

#Ignore the warning 
plt.figure()
t5jContbDateTime = pd.to_datetime(fec.contb_receipt_dt)
fec.loc[:,'contb_receipt_dt_quarter'] = map(lambda dateTime: dateTime.quarter, t5jContbDateTime)
fec.loc[:,'contb_receipt_dt_year'] = map(lambda dateTime: dateTime.year, t5jContbDateTime)
figObj5j = fec['contb_receipt_amt'][fec.cand_nm=='Obama, Barack'].groupby(
                [fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).sum().plot(color='Blue')
fec['contb_receipt_amt'][fec.cand_nm=='Romney, Mitt'].groupby(
                [fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).sum().plot(color='Red')
figObj5j.set_title('Task 5j')
figObj5j.set_xlabel('Quarters/Years')
figObj5j.set_ylabel('Per Quarter Contributions')
figObj5j.legend (['Obama', 'Romney'])
#########################end code for Task 5j


#Harder
#Task 5k: Draw a line chart showing the CUMULATIVE campaign contribution of Obama and Romney varied every quarter
# In other words, if Obama made X, Y, Z in first, second and third quarters
#  then plot X for first quarter, X+Y for second quarter and X+Y+Z for third quarter.
#Use blue for Obama and red for Romney
#This is a bit tricky because, you have to convert contribution date to quarter.
#You can either do it on your own or see if Pandas has some function
#########################begin code for Task 5k
plt.figure()
figObj5k = fec['contb_receipt_amt'][fec.cand_nm=='Obama, Barack'].groupby(
                [fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).sum().cumsum().plot(color='Blue')
fec['contb_receipt_amt'][fec.cand_nm=='Romney, Mitt'].groupby(
                [fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).sum().cumsum().plot(color='Red')
figObj5k.set_title('Task 5j')
figObj5k.set_xlabel('Quarters/Years')
figObj5k.set_ylabel('Per Quarter Contributions')
figObj5k.legend (['Obama', 'Romney'])
#########################end code for Task 5k

#Tasks 5l and 5m
#Repeat 5j and 5k but do it for NUMBER of contributors
#In other words, 5l plots the number of contributors for Obama and Romney, quarter over quarter
#5m plots cumulative number of contributors quarter over quarter.

#########################begin code for Task 5l
plt.figure()
figObj5l = fec['contb_receipt_amt'][fec.cand_nm=='Obama, Barack'].groupby(
                        [fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).count().plot(color='Blue')
fec['contb_receipt_amt'][fec.cand_nm=='Romney, Mitt'].groupby(
                        [fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).count().plot(color='Red')
figObj5l.set_title('Tak 5l')
figObj5l.set_xlabel('Quarters/Years')
figObj5l.set_ylabel('#Contributions')
figObj5l.legend (('Obama', 'Romney'))
#########################end code for Task 5l


#########################begin code for Task 5m
plt.figure()
figObj5m = fec['contb_receipt_amt'][fec.cand_nm=='Obama, Barack'].groupby(
                        [fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).count().cumsum().plot(color='Blue')
fec['contb_receipt_amt'][fec.cand_nm=='Romney, Mitt'].groupby(
                        [fec.contb_receipt_dt_year,fec.contb_receipt_dt_quarter]).count().cumsum().plot(color='Red')
figObj5m.set_title('Tak 5m')
figObj5m.set_xlabel('Quarters/Years')
figObj5m.set_ylabel('#Contributions')
figObj5m.legend (('Obama', 'Romney'))
#########################end code for Task 5m
/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py:245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = np.nan
/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py:415: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Out[30]:
<matplotlib.legend.Legend at 0x7fa6502e7bd0>
<matplotlib.figure.Figure at 0x7fa65f4b8790>
<matplotlib.figure.Figure at 0x7fa65e4fab90>
<matplotlib.figure.Figure at 0x7fa61942d550>

Task 6: Discretization

Recall the IMDB dataset that we discussed in the class where we were able to draw lot more interesting plots. The key challenge here is that most of the attributes have too many values (ie. the domain cardinality is very large). There are few ways to work around this issue. We already did something above (ie focussing only top-k). The other option is discretization where we create buckets and put contributions based on the buckets. Discretization in Pandas is acheived by cut function.

In [31]:
#The following set of tasks are a bit tricky: 
#    you need to use multiple commands to achieve. Specifically look at cut, groupby and unstack

#Task 6a: Discretize the contributions of Obama and Romney based on the bins given below.
# For example, if Obama got contributions such as 2, 6, 16, 18, 120, then he has 
#  0 contribution in (0,1], 2 contributions in (1,10], 2 contributions in (10, 100] and 1 contribution in (100, 1000]
bins = np.array([0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
labels = pd.cut(fec.contb_receipt_amt, bins) # set the variable labels to the output of pd.cut
grouped = fec.groupby(['cand_nm', labels]) #Group the data based on labels and candidate names
#Replace None below in the print statement with grouped.size().unstack(0) . 
#    If your code for labels and grouped is correct, this should print number of people in each bin per candidate
print "Task 6a:\n", grouped.size().unstack(0)

#Task 6b: In Task 6a, we calculated the COUNT (i.e. the number of contributors in each bin)
# This by itself is not sufficient.
# In this task, let us compute the TOTAL AMOUNT in each bucket.
# Specifically, compute for each candidate the total amount of contributions that each got in each bucket.
# Continuing the example above, Obama's total contribution in each bucket is
#  0 in (0,1], 8 in (1,10], 34 in (10, 100] and 120 in (100, 1000]
#This could be done in 1 line from the variable grouped above
t6b_bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
print "\n\nTask 6b:\n", t6b_bucket_sums


#Task 6c: Even this does not fully specify the disparity in the funding scenario.
# To see this let us now compute the PROPORTION of total contribution in each bucket
# This is called normalization and is a common operation. 
# Typically normalization is done over the same candidate.
# But for the point I am trying to make, let us normalize within bucket.
# For example, if obama made X in a bucket and Y in another bucket, 
#    then normalization will give X/(X+Y) for Obama and Y/(X+Y) for Romney
# This is also quite easy in Pandas and can be done in 1 line
# The key idea is to realize that sum function takes an argument called axis
#  that does different things for axis=0 and axis=1 (figure out which does what)
t6c_normed_bucket_sums = t6b_bucket_sums.div(t6b_bucket_sums.sum(axis=1), axis=0)
print "\n\nTask 6c:\n", t6c_normed_bucket_sums

#Once you have done this , uncomment the following line to a horizontal stacked bar chart
t6c_normed_bucket_sums[:-2].plot(kind='barh', stacked=True, color=['blue','red'])


#Task 6d: Let us go back and try to do the other analysis
# Let us now try to see what PROPORTION of each candidate's amount came from each bucket.
#  This is the classical case of normalization.
#  Continuing the example above, Obama has made 0+8+34+120=162
#  We can normalize it by diving each bucket by the total
#   For example, 0/162, 8/162, 34/162 and 120/162.
#  If you finished t6c, then this just a matter of playing with axis values.
t6d_normed_bucket_sums = t6b_bucket_sums.div(t6b_bucket_sums.sum(axis=0),axis=1)
print "\n\nTask 6d:\n", t6d_normed_bucket_sums


#Once you have done this , uncomment the following line to a horizontal stacked bar chart
t6d_normed_bucket_sums.plot(kind='barh', stacked=True,color=['blue','red'])
Task 6a:
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                         493            77
(1, 10]                      40070          3681
(10, 100]                   372280         31853
(100, 1000]                 153991         43357
(1000, 10000]                22284         26186
(10000, 100000]                  2             1
(100000, 1000000]                3           NaN
(1000000, 10000000]              4           NaN


Task 6b:
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                      318.24         77.00
(1, 10]                  337267.62      29819.66
(10, 100]              20288981.41    1987783.76
(100, 1000]            54798531.46   22363381.69
(1000, 10000]          51753705.67   63942145.42
(10000, 100000]           59100.00      12700.00
(100000, 1000000]       1490683.08           NaN
(1000000, 10000000]     7148839.76           NaN


Task 6c:
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                    0.805182      0.194818
(1, 10]                   0.918767      0.081233
(10, 100]                 0.910769      0.089231
(100, 1000]               0.710176      0.289824
(1000, 10000]             0.447326      0.552674
(10000, 100000]           0.823120      0.176880
(100000, 1000000]         1.000000           NaN
(1000000, 10000000]       1.000000           NaN


Task 6d:
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                    0.000002      0.000001
(1, 10]                   0.002482      0.000338
(10, 100]                 0.149318      0.022503
(100, 1000]               0.403294      0.253163
(1000, 10000]             0.380885      0.723852
(10000, 100000]           0.000435      0.000144
(100000, 1000000]         0.010971           NaN
(1000000, 10000000]       0.052612           NaN

Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa65eb48090>

Task 7: Big Money in Politics

Using the tool of discretization, we were able to analyze how much people contributed in each bucket. While it showed some sense of imbalance, it did not show it clearly. Recently, the concept of income inequality is becoming a rallying cry among liberals. Let us see if there is any inequality in the contributions (spoiler alert: yes!). We can see if we find anything interesting, not by analyzing, total amount but by the amount contributed by top-X%.

In [32]:
#Task 7a: Write two functions: one for Obama and one for Romney that does the following:
# Given a value of N (N can be 1, 2, 5, 10 etc), sort the contributions made to the candidate in decreasing order
#  Then find how much contribution the top-N% made
#  Then compute the fraction to the overall campaign collection
# For example, if Obama collected 1 billion dollars and the top-1% gave 100 million then their contribution is 10%

def t7a_helper_contributions_by_top_N_pct(candidate_name, N):
    pivotTblUniqContr = pd.tools.pivot.pivot_table(fec[fec.cand_nm == candidate_name], 
                        values='contb_receipt_amt', 
                        index=['contbr_nm', 'contbr_occupation', 'contbr_employer', 'contbr_zip', 'contbr_city','contbr_st'], 
                        aggfunc=np.sum).order(ascending=False)
    totalContrib = pivotTblUniqContr.sum()
    numContrib = pivotTblUniqContr.count()
    percent = int(N/100.0 *  numContrib)
    topNContrib = pivotTblUniqContr.iloc[:percent].sum()
    
    return topNContrib/totalContrib


def t7a_contributions_by_top_N_pct_obama(N):
    return t7a_helper_contributions_by_top_N_pct('Obama, Barack', N)
    
def t7a_contributions_by_top_N_pct_romney(N):
    return t7a_helper_contributions_by_top_N_pct('Romney, Mitt', N)


for N in [1, 2, 5, 10, 20]:
    print "N=%s, Obama proportion=%s and Romney proportion = %s" % (N, 
                t7a_contributions_by_top_N_pct_obama(N), t7a_contributions_by_top_N_pct_romney(N))

    

#Task 7b: Now let us see who these people in 1% are
# Compute the top-1% based on total contribution (ie. money they gave to Obama + Romney)
# Now let us see some information about them.
# For each of the folks in 1%, print the following:
#    name, state, occupation, employer, total amount they gave to Obama, total amount they gave to Romney
contributors_and_contrib = fec.pivot_table(values= 'contb_receipt_amt', 
                        index= ['contbr_nm', 'contbr_occupation', 'contbr_employer', 'contbr_city', 'contbr_st'],
                        columns='cand_nm', aggfunc=np.sum, fill_value=0)
num1Percenters = int(0.01 * len(contributors_and_contrib))
contributors_and_contrib["Total"] = contributors_and_contrib["Obama, Barack"] + contributors_and_contrib["Romney, Mitt"]
contributors_ordered_by_contrib = contributors_and_contrib.sort(["Total"], ascending = False)
t7b_1pcters = contributors_ordered_by_contrib.iloc[:num1Percenters]

display(contributors_ordered_by_contrib)
N=1, Obama proportion=0.150775678905 and Romney proportion = 0.0463774533432
N=2, Obama proportion=0.233644483744 and Romney proportion = 0.078658416654
N=5, Obama proportion=0.424594528848 and Romney proportion = 0.138822888029
N=10, Obama proportion=0.5666857722 and Romney proportion = 0.2390652498
N=20, Obama proportion=0.709150146108 and Romney proportion = 0.439549973343

cand_nm Obama, Barack Romney, Mitt Total
contbr_nm contbr_occupation contbr_employer contbr_city contbr_st
OBAMA VICTORY FUND 2012 - UNITEMIZED ZOOLOGY EDUCATION ] CHICAGO IL 8187796.84 0 8187796.84
WASHINGTON DC 451726.00 0 451726.00
MURPHY, CYNTHIA C. PUBLIC RELATIONS MURPHY GROUP LITTLE ROCK AR 35800.00 0 35800.00
DAVIS, STEPHEN JAMES ATTORNEY BANNEKER PARTNERS SAN FRANCISCO CA 30800.00 0 30800.00
BOULIND, JEANNETTE NOT EMPLOYED NOT EMPLOYED PHILADELPHIA PA 20000.00 0 20000.00
HIEMSTRA, SHERRON RETIRED RETIRED WASHINGTON DC 20000.00 0 20000.00
PEROT, FRANCIS KINCAID CONSULTANT SELF-EMPLOYED WARREN VT 20000.00 0 20000.00
KLEIN, PATRICIA RETIRED RETIRED ZIONSVILLE PA 17750.00 0 17750.00
MEDORE, MARK INVESTOR SELF-EMPLOYED OAKLAND PARK FL 15195.00 0 15195.00
ABRAMS, EDWIN RETIRED RETIRED VERO BEACH FL 15000.00 0 15000.00
ANDERSON, SALLY ADMINISTRATOR ROSWELL ARTIST IN RESIDENCE FOUNDATION ROSWELL NM 15000.00 0 15000.00
TRAUB, JANET E. NOT EMPLOYED NOT EMPLOYED SAN FRANCISCO CA 15000.00 0 15000.00
ROLFE, RONALD LAWYER SELF-EMPLOYED NEW YORK NY 15000.00 0 15000.00
BISHOP, ROBERT C. RETIRED RETIRED REDWOOD CITY CA 15000.00 0 15000.00
STEINBRING, YVONNE RETIRED RETIRED GREEN VALLEY AZ 15000.00 0 15000.00
COHN, NANCY VOLUNTEER VOLUNTEER TALLAHASSEE FL 15000.00 0 15000.00
ANSTROM, DECKER RETIRED RETIRED WASHINGTON DC 15000.00 0 15000.00
DENNE, CONSTANCE AYERS RETIRED RETIRED GREENPORT NY 15000.00 0 15000.00
LANDIS, HOWARD KEL FINANCIAL SERVICES PLEXUS CAPITAL RALEIGH NC 15000.00 0 15000.00
WEST, RUTH ANN INVESTOR SELF-EMPLOYED DURANGO CO 15000.00 0 15000.00
FRANKLIN, VIRGINIA RETIRED RETIRED TULSA OK 14000.00 0 14000.00
CARY, LAURA RETIRED RETIRED DENVER CO 13000.00 0 13000.00
WEISSLER, STUART C. RETIRED RETIRED COCONUT CREEK FL 13000.00 0 13000.00
NORPAC ZOOLOGY EDUCATION ] ENGLEWOOD CLIFFS NJ 0.00 12700 12700.00
BRANDT, RICK MR. EXECUTIVE BRANDT CONSOLUDATED INC TAMPA FL 0.00 12500 12500.00
BROCKMAN, DOROTHY K. MRS. INVESTMENTS SELF-EMPLOYED HOUSTON TX 0.00 12500 12500.00
OKUNO, DALE MR. ANGEL INVESTOR OKUNO ASSOCIATES INC. PASADENA CA 0.00 12500 12500.00
FRYE, MEGAN MS. COMMODITY BROKER WATER STREET ADVISORY PEORIA IL 0.00 12500 12500.00
GREEN, JEFFREY MR. OWNER GREEN CHEVROLET PEORIA IL 0.00 12500 12500.00
BRINCAT, JEFF MR. PRESIDENT CONSUMER FINANCIAL SERVICES LAKE FOREST IL 0.00 12500 12500.00
... ... ... ... ... ... ... ...
GREENBAUM MD, SCOTT DR. PHYSICIAN SELF-EMPLOYED KINGS POINT NY 0.00 3 3.00
BERKOWITZ, PAUL NOT EMPLOYED NOT EMPLOYED FOREST GROVE OR 3.00 0 3.00
SNYDER, KAREN A. MRS. HOMEMAKER HOMEMAKER MCLEAN VA 0.00 3 3.00
GRAVES, CRYSTAL ADMINISTRATION/INSTRUCTOR INTER-CITY SERVICES BERKELEY CA 3.00 0 3.00
ENGLANDER, EVELYN RETIRED RETIRED ESCONDIDO CA 3.00 0 3.00
ARD, COURTNEY SPRINKLER FITTER FIRE PROTECTION SPECI LOCAL 281 SPRINKLER FITTERS UN EMPL.3Y CHICAGO IL 3.00 0 3.00
MANTONE, MARC L. MR. SALES MERRILL LYNCH TOMS RIVER NJ 0.00 3 3.00
LARKIN, KATHLEEN UNDERGRADUATE STUDENT PENN STATE UNIVERSITY WEST ISLIP NY 3.00 0 3.00
PRESSLEY, NORM MR. FARMER SELF-EMPLOYED SAN DIEGO CA 0.00 3 3.00
KUNESH, JASON LEAD UI/UX DEVELOPER OBAMA FOR AMERICA CHICAGO IL 3.00 0 3.00
RICHO, JOHN H ZOOLOGY EDUCATION RETIRED LOS ANGELES CA 3.00 0 3.00
SCHNEIDER, THOMAS DIRECTOR COPS TRUST ROCHESTER HILLS MI 3.00 0 3.00
SCOTT, MATTHEW STUDENT GWU WEST ORANGE NJ 3.00 0 3.00
GREYF, JANE MS. ATTORNEY GOODWIN PROCTER NEW YORK NY 0.00 3 3.00
ODLE, TERESA EDITOR SELF-EMPLOYED ALBUQUERQUE NM 3.00 0 3.00
MCCARTHY, JAMES PROGRAMMER SELF BUTTE MT 3.00 0 3.00
GALLAGHER, PATRICK 3RD GENERATION FAMILY BUSINESS OWNER GALLAGHER ASPHALT CORPORATION CHICAGO IL 0.00 3 3.00
KERBER, JOHN BERNARD NOT EMPLOYED NOT EMPLOYED EDEN PRAIRIE MN 3.00 0 3.00
GREEN, PETER FILM AND TELEVISION PRODUCER NOT EMPLOYED SHERMAN OAKS CA 3.00 0 3.00
PLEAS, OCTAVIA RETIRED MSW RETIRED OVERLAND PARK KS 3.00 0 3.00
WHEELER, LINDA L PHYSICIAN CASS REGIONAL MED. CENTER HARRISONVILLE MO 2.50 0 2.50
BROWN, SHERESE MONIQUE MONIQUE DATA MANAGER SEIU HYATTSVILLE MD 2.00 0 2.00
GOULT, RODERICK CONSULTANT THE VICTORIA GROUP, INC. SALEM NH 2.00 0 2.00
JOYCE, LOUIS FIELD ORGANIZER ORGANIZING FOR AMERICA - VIRGINIA LEESBURG VA 2.00 0 2.00
WINTERSTEINER, PETER P. SCIENTIST ARCON CORP WABAN MA 2.00 0 2.00
FOSS, MARNA NOT EMPLOYED NOT EMPLOYED APEX NC 2.00 0 2.00
TRAN, KHIEM INFORMATION REQUESTED PER BEST EFFORTS INFORMATION REQUESTED PER BEST EFFORTS QUINCY MA 0.00 2 2.00
MICHAL, JUNE B LEGAL SECRETARY RETIRED PUYALLUP WA 1.00 0 1.00
SHUBIN, ALAN CONSULTANT TOWERS WATSON CHARLOTTE NC 1.00 0 1.00
WILSON, LAURA WRITER OBAMA FOR AMERICA CHICAGO IL 0.08 0 0.08

289862 rows × 3 columns

Task 8: Political Polarization in USA

It has been argued that political polarization has dramatically increased in USA in the last 10-15 years. While there are many ways to analyze the dataset for polarization, let us use some simple measures.

In [33]:
#Task 8a: One way to measure polarity is to see how different some distribution are. 
# For both Obama and Romney, print the top-10 states based on their total AMOUNT of contribution
# Do you see lot of common states? 
t8a_top_10_states_obama = fec.contb_receipt_amt[fec.cand_nm == 'Obama, Barack'].groupby(fec.contbr_st).sum().order(ascending=False)[:10]
t8a_top_10_states_romney = fec.contb_receipt_amt[fec.cand_nm == 'Romney, Mitt'].groupby(fec.contbr_st).sum().order(ascending=False)[:10]

print "\n\nT8a: Top 10 states for Obama:\n",t8a_top_10_states_obama
print "\n\nT8a: Top 10 states for Romney:\n",t8a_top_10_states_romney

#Task 8b: For both Obama and Romney, print the top-10 occupation based on their total AMOUNT of contribution
# Do you see lot of common occupation? 
t8b_top_10_occu_obama = fec['contb_receipt_amt'][fec.cand_nm == 'Obama, Barack'].groupby(fec.contbr_occupation).sum().order(ascending=False)[:10]
t8b_top_10_occu_romney = fec['contb_receipt_amt'][fec.cand_nm == 'Romney, Mitt'].groupby(fec.contbr_occupation).sum().order(ascending=False)[:10]

print "\n\nT8b: Top 10 Occupation for Obama:\n", t8b_top_10_occu_obama
print "\n\nT8b: Top 10 Occupation for Romney:\n", t8b_top_10_occu_romney



#Task 8c: For both Obama and Romney, print the top-10 employers based on their total AMOUNT of contribution
# Do you see lot of common employers? 
t8c_top_10_emp_obama = fec['contb_receipt_amt'][fec.cand_nm == 'Obama, Barack'].groupby(fec.contbr_employer).sum().order(ascending=False)[:10]
t8c_top_10_emp_romney = fec['contb_receipt_amt'][fec.cand_nm == 'Romney, Mitt'].groupby(fec.contbr_employer).sum().order(ascending=False)[:10]

print "\n\nT8b: Top 10 Employers for Obama:\n", t8c_top_10_emp_obama
print "\n\nT8b: Top 10 Employers for Romney:\n", t8c_top_10_emp_romney


#Harder
#Task 8d: Here is another way to compute polarization
# Find the top-1000 contributors based on their TOTAL contribution (to both Obama and Romney)
# For each of the top-1000 folks count the number of people who donated to both, to Obama only and to Romney only
top1000Contributors = fec.pivot_table(values='contb_receipt_amt',
                                      index=['contbr_nm','contbr_employer','contbr_st','cand_nm'],
                                      aggfunc = np.sum).order(ascending = False).head(1000).unstack().fillna(0)
t8d_top_1000_both = len(top1000Contributors[ (top1000Contributors["Obama, Barack"] > 0) & (top1000Contributors["Romney, Mitt"] > 0)])
t8d_top_1000_BO_only = len(top1000Contributors[ (top1000Contributors["Obama, Barack"] > 0) & (top1000Contributors["Romney, Mitt"] == 0)])
t8d_top_1000_MR_only = len(top1000Contributors[ (top1000Contributors["Obama, Barack"] == 0) & (top1000Contributors["Romney, Mitt"] > 0)])

print "\n\nT8c:Both, Obama only, Romney only = ", t8d_top_1000_both, t8d_top_1000_BO_only, t8d_top_1000_MR_only

#Harder:
#Task 8e: Here is yet another way
# For each state, compute what fraction of amount went to Obama and what fraction went to Romney
# If there is no polarization, then both will get more or less equal amount. 
# If there is polarization, then the amount will skewed.
# Let us use a very crude measure to compute polarization
#    If X is what Obama got from a state and Y is what Romney got, compute the value of max(X,Y)/ (min(X,Y) + 1)
# For each state compute this value and sort the results in a descending order. 
# Do you see any pattern?
t8e = fec.pivot_table(values = 'contb_receipt_amt', index = ['contbr_st'], columns = ['cand_nm'],  aggfunc = np.sum, fill_value=0)
t8e["polarity"] = t8e.apply(
                        lambda row: max(row["Obama, Barack"], row["Romney, Mitt"])  / (min(row["Obama, Barack"], row["Romney, Mitt"])+1), 
                        axis=1)
t8e_state_contr_ranked_by_polarity = t8e.sort(["polarity"], ascending=False)
print "\n\nt8e:States ordered by polarity \n", t8e_state_contr_ranked_by_polarity


#Harder:
#Task 8f: Repeat the above analysis for occupation
# However, instead of taking all occupations, let us only take the top-50 occupations based on TOTAL contributions made
# For each occupation compute this value and sort the results in a descending order and displ
# Do you see any pattern?
t8f = fec.pivot_table(values = 'contb_receipt_amt', index = ['contbr_occupation'], columns = ['cand_nm'],  aggfunc = np.sum, fill_value=0)
t8f['Total'] = t8f['Obama, Barack'] + t8f['Romney, Mitt']
t8f = t8f.sort(['Total'], ascending=False)[:50]
t8f["polarity"] = t8f.apply(
                        lambda row: max(row["Obama, Barack"], row["Romney, Mitt"])  / (min(row["Obama, Barack"], row["Romney, Mitt"])+1), 
                        axis=1)
t8f_occu_contr_ranked_by_polarity = t8f.sort(["polarity"], ascending=False)
print "\n\nt8f: Occupation ordered by polarity:\n", t8f_occu_contr_ranked_by_polarity


#Harder:
#Task 8g: A known variable of polarization is based on where a person lives.
# At the risk of too much generalization, liberals dominate cities while conservations dominate rural areas
# Let us see if this holds in Texas.
# Texas is a known solid red (i.e. conservative) state.
# For each city in Texas, compute the polarity and order them by polarity.
# Do you see any pattern?
t8g = fec[fec.contbr_st == 'TX'].pivot_table(values = 'contb_receipt_amt', 
                                            index = ['contbr_city'], columns=['cand_nm'], aggfunc=np.sum, fill_value=0)
t8g['Total'] = t8g['Obama, Barack'] + t8g['Romney, Mitt']	
t8g["polarity"] = t8g.apply(
                        lambda row: max(row["Obama, Barack"], row["Romney, Mitt"])  / (min(row["Obama, Barack"], row["Romney, Mitt"])+1), 
                        axis=1)
t8g_tx_city_contr_ranked_by_polarity = t8g.sort(['polarity'], ascending=False)
print "\n\nt8f: Texas cities ordered by polarity:\n", t8g_tx_city_contr_ranked_by_polarity


T8a: Top 10 states for Obama:
contbr_st
CA           23824984.24
IL           16443895.84
NY           14651918.51
FL            7318178.58
MA            6649015.25
TX            6570832.45
MD            4832663.93
DC            4373538.80
VA            4259977.19
WA            4250933.16
Name: contb_receipt_amt, dtype: float64


T8a: Top 10 states for Romney:
contbr_st
CA           11237636.60
NY           10184212.63
FL            8338458.81
TX            6221989.68
MA            4710542.30
UT            3717300.48
IL            3628571.53
CT            3499475.45
VA            3465765.85
NJ            3333320.20
Name: contb_receipt_amt, dtype: float64


T8b: Top 10 Occupation for Obama:
contbr_occupation
RETIRED                  25305116.38
ATTORNEY                 11141982.97
INFORMATION REQUESTED     4866973.96
HOMEMAKER                 4248875.80
PHYSICIAN                 3735124.94
LAWYER                    3160478.87
CONSULTANT                2459912.71
PROFESSOR                 2165071.08
CEO                       2073284.79
PRESIDENT                 1878509.95
Name: contb_receipt_amt, dtype: float64


T8b: Top 10 Occupation for Romney:
contbr_occupation
RETIRED                                   11508473.59
INFORMATION REQUESTED PER BEST EFFORTS    11396894.84
HOMEMAKER                                  8147446.22
ATTORNEY                                   5364718.82
PRESIDENT                                  2491244.89
EXECUTIVE                                  2300947.03
C.E.O.                                     1968386.11
INVESTOR                                   1537595.12
CONSULTANT                                 1424894.01
PHYSICIAN                                  1368023.96
Name: contb_receipt_amt, dtype: float64


T8b: Top 10 Employers for Obama:
contbr_employer
RETIRED                  22694358.85
SELF-EMPLOYED            17080985.96
NOT EMPLOYED              8586308.70
INFORMATION REQUESTED     5053480.37
HOMEMAKER                 2605408.54
SELF                      1076531.20
SELF EMPLOYED              469290.00
STUDENT                    318831.45
VOLUNTEER                  257104.00
MICROSOFT                  215585.36
Name: contb_receipt_amt, dtype: float64


T8b: Top 10 Employers for Romney:
contbr_employer
INFORMATION REQUESTED PER BEST EFFORTS    12059527.24
RETIRED                                   11506225.71
HOMEMAKER                                  8147196.22
SELF-EMPLOYED                              7409860.98
STUDENT                                     496490.94
CREDIT SUISSE                               281150.00
MORGAN STANLEY                              267266.00
GOLDMAN SACH & CO.                          238250.00
BARCLAYS CAPITAL                            162750.00
H.I.G. CAPITAL                              139500.00
Name: contb_receipt_amt, dtype: float64


T8c:Both, Obama only, Romney only =  0 747 253


t8e:States ordered by polarity 
cand_nm    Obama, Barack  Romney, Mitt       polarity
contbr_st                                            
XX                  0.00        400250  400250.000000
FF                  0.00         99030   99030.000000
ZZ               5963.00             0    5963.000000
AS               2955.00             0    2955.000000
UK                  0.00          2500    2500.000000
AB               2048.00             0    2048.000000
ON               1955.00             0    1955.000000
FM                600.00             0     600.000000
QU                500.00             0     500.000000
AA              56405.00           135     414.742647
VI              80712.00          3500      23.053985
AP              37130.50          1655      22.421800
VT             986510.59         55229      17.861861
ME            1167760.12        117152       9.967821
AE              42973.75          5680       7.564469
PR             216019.00         29125       7.416707
UT             519851.37       3717300       7.150684
HI             795212.64        111763       7.115105
MN            1744387.14        293656       5.940220
NM             906162.36        168011       5.393438
IL           16443895.84       3628572       4.531780
DC            4373538.80       1025137       4.266293
WI            1130155.46        270316       4.180852
DE             336669.14         82712       4.070329
ID             197538.06        787158       3.984822
AR             359247.28        105556       3.403349
AK             281840.15         86204       3.269418
WA            4250933.16       1341522       3.168737
MP               3855.00          1250       3.081535
GU              11581.50          3850       3.007401
...                  ...           ...            ...
MT             300225.24        161629       1.857485
NC            2357067.63       1273604       1.850705
LA             548013.54        991237       1.808779
CT            2068291.26       3499475       1.691963
MS             195197.17        330183       1.691527
IN             883691.81        542086       1.630166
SC             630732.94        402743       1.566089
ND              58999.90         39392       1.497725
NH             616994.85        424839       1.452299
NY           14651918.51      10184213       1.438689
CO            2132429.49       1506714       1.415284
OK             594342.52        839484       1.412456
MA            6649015.25       4710542       1.411518
NE             251311.97        178600       1.407114
GA            2786399.49       1995726       1.396183
KS             448038.57        326634       1.371680
TN            1119315.02       1516918       1.355219
WV             169154.47        126725       1.334805
WY             194046.74        252596       1.301721
AZ            1506476.98       1888436       1.253544
VA            4259977.19       3465766       1.229159
FL            7318178.58       8338459       1.139417
NV             710693.67        630049       1.127996
KY             714954.32        666903       1.072050
TX            6570832.45       6221990       1.056066
MI            2570307.25       2448110       1.049915
OH            1822728.83       1901561       1.043249
NJ            3203257.93       3333320       1.040603
MO            1320780.71       1371333       1.038274
AL             543123.48        527304       1.029999

[67 rows x 3 columns]


t8f: Occupation ordered by polarity:
cand_nm                                 Obama, Barack  Romney, Mitt  \
contbr_occupation                                                     
INFORMATION REQUESTED PER BEST EFFORTS           0.00   11396894.84   
INFORMATION REQUESTED                      4866973.96          0.00   
NOT EMPLOYED                               1709188.20          0.00   
PRESIDENT & C.E.O.                               0.00     574161.28   
C.P.A.                                          50.00     412107.11   
C.E.O.                                        1690.00    1968386.11   
LAWYER                                     3160478.87       7705.20   
SOFTWARE ENGINEER                           396985.65      19589.54   
PSYCHOLOGIST                                427299.92      21341.00   
PROFESSOR                                  2165071.08     161362.12   
WRITER                                     1084188.88      96613.12   
EDUCATOR                                    436600.89      40735.00   
ARTIST                                      763125.00      84920.12   
TEACHER                                    1250969.15     141338.39   
PRIVATE EQUITY                               60374.00     418240.00   
EXECUTIVE DIRECTOR                          348180.94      54498.00   
ARCHITECT                                   483859.89      82980.00   
CEO                                        2073284.79     355910.92   
BANKER                                      224084.40    1009186.24   
INVESTMENT BANKER                           164011.70     661035.00   
INVESTMENTS                                 160480.00     634922.00   
FINANCE                                     296031.40    1143592.25   
ENGINEER                                    951525.55     326049.24   
INVESTMENT MANAGER                          136840.00     382826.00   
INVESTMENT MANAGEMENT                       138224.00     382332.12   
PHYSICIAN                                  3735124.94    1368023.96   
DIRECTOR                                    471741.73     186255.12   
REAL ESTATE DEVELOPER                       139454.00     317940.00   
RETIRED                                   25305116.38   11508473.59   
ATTORNEY                                  11141982.97    5364718.82   
FINANCIAL ADVISOR                           169075.50     344951.00   
REAL ESTATE                                 528902.09    1054801.00   
HOMEMAKER                                  4248875.80    8147446.22   
BUSINESS EXECUTIVE                          155780.00     278301.48   
INVESTOR                                    884133.00    1537595.12   
CONSULTANT                                 2459912.71    1424894.01   
EXECUTIVE                                  1355161.05    2300947.03   
SELF-EMPLOYED                               672393.40    1092538.44   
CHAIRMAN                                    496547.00     805131.75   
VICE PRESIDENT                              325647.15     525540.89   
MANAGING DIRECTOR                           329688.25     519665.12   
PRINCIPAL                                   250005.44     162740.00   
BUSINESS OWNER                              449979.30     617389.86   
PRESIDENT                                  1878509.95    2491244.89   
MANAGER                                     762883.22     577924.94   
SALES                                       392886.91     501946.86   
STUDENT                                     628099.75     496590.94   
PARTNER                                     395759.50     488100.11   
OWNER                                      1001567.36     875186.24   
ACCOUNTANT                                  249660.24     262555.62   

cand_nm                                       Total         polarity  
contbr_occupation                                                     
INFORMATION REQUESTED PER BEST EFFORTS  11396894.84  11396894.840000  
INFORMATION REQUESTED                    4866973.96   4866973.960000  
NOT EMPLOYED                             1709188.20   1709188.200000  
PRESIDENT & C.E.O.                        574161.28    574161.280000  
C.P.A.                                    412157.11      8080.531569  
C.E.O.                                   1970076.11      1164.036730  
LAWYER                                   3168184.07       410.121574  
SOFTWARE ENGINEER                         416575.19        20.264150  
PSYCHOLOGIST                              448640.92        20.021550  
PROFESSOR                                2326433.20        13.417385  
WRITER                                   1180802.00        11.221847  
EDUCATOR                                  477335.89        10.717814  
ARTIST                                    848045.12         8.986280  
TEACHER                                  1392307.54         8.850818  
PRIVATE EQUITY                            478614.00         6.927371  
EXECUTIVE DIRECTOR                        402678.94         6.388758  
ARCHITECT                                 566839.89         5.830972  
CEO                                      2429195.71         5.825275  
BANKER                                   1233270.64         4.503579  
INVESTMENT BANKER                         825046.70         4.030389  
INVESTMENTS                               795402.00         3.956369  
FINANCE                                  1439623.65         3.863064  
ENGINEER                                 1277574.79         2.918340  
INVESTMENT MANAGER                        519666.00         2.797597  
INVESTMENT MANAGEMENT                     520556.12         2.766013  
PHYSICIAN                                5103148.90         2.730305  
DIRECTOR                                  657996.85         2.532758  
REAL ESTATE DEVELOPER                     457394.00         2.279875  
RETIRED                                 36813589.97         2.198825  
ATTORNEY                                16506701.79         2.076899  
FINANCIAL ADVISOR                         514026.50         2.040207  
REAL ESTATE                              1583703.09         1.994318  
HOMEMAKER                               12396322.02         1.917553  
BUSINESS EXECUTIVE                        434081.48         1.786492  
INVESTOR                                 2421728.12         1.739097  
CONSULTANT                               3884806.72         1.726382  
EXECUTIVE                                3656108.08         1.697913  
SELF-EMPLOYED                            1764931.84         1.624848  
CHAIRMAN                                 1301678.75         1.621458  
VICE PRESIDENT                            851188.04         1.613830  
MANAGING DIRECTOR                         849353.37         1.576227  
PRINCIPAL                                 412745.44         1.536217  
BUSINESS OWNER                           1067369.16         1.372038  
PRESIDENT                                4369754.84         1.326181  
MANAGER                                  1340808.16         1.320036  
SALES                                     894833.77         1.277583  
STUDENT                                  1124690.69         1.264821  
PARTNER                                   883859.61         1.233322  
OWNER                                    1876753.60         1.144404  
ACCOUNTANT                                512215.86         1.051648  


t8f: Texas cities ordered by polarity:
cand_nm           Obama, Barack  Romney, Mitt       Total      polarity
contbr_city                                                            
STAFFORD               29982.25          0.00    29982.25  29982.250000
WHEELER                11622.00          0.00    11622.00  11622.000000
DAINGERFIELD           10125.00          0.00    10125.00  10125.000000
LUMBERTON              10000.00          0.00    10000.00  10000.000000
FARMERS BRANCH          9225.00          0.00     9225.00   9225.000000
DUNCANVILLE             9089.20          0.00     9089.20   9089.200000
FLORESVILLE             8898.50          0.00     8898.50   8898.500000
TROUP                   8686.00          0.00     8686.00   8686.000000
BUDA                    7056.00          0.00     7056.00   7056.000000
DEER PARK               6693.00          0.00     6693.00   6693.000000
ALBANY                     0.00       6500.00     6500.00   6500.000000
ROSHARON                5830.00          0.00     5830.00   5830.000000
HOCKLEY                 5696.00          0.00     5696.00   5696.000000
FREDERICKSBRG           5607.00          0.00     5607.00   5607.000000
NAPLES                  5300.00          0.00     5300.00   5300.000000
UVALDE                  5080.00          0.00     5080.00   5080.000000
GILMER                     0.00       5055.00     5055.00   5055.000000
STERLING CITY              0.00       5000.00     5000.00   5000.000000
MOORE                   5000.00          0.00     5000.00   5000.000000
DANGERFIELD             5000.00          0.00     5000.00   5000.000000
GAIL                       0.00       5000.00     5000.00   5000.000000
FULSHEAR                   0.00       5000.00     5000.00   5000.000000
GRAPELAND                  0.00       5000.00     5000.00   5000.000000
JONESBORO                  0.00       5000.00     5000.00   5000.000000
HENRIETTA               3980.00          0.00     3980.00   3980.000000
CARTHAGE                3910.00          0.00     3910.00   3910.000000
FRESNO                  3905.00          0.00     3905.00   3905.000000
ELGIN                   3780.00          0.00     3780.00   3780.000000
LLANO                   3550.00          0.00     3550.00   3550.000000
HIGHLAND VILLAGE        3231.00          0.00     3231.00   3231.000000
...                         ...           ...         ...           ...
BAYTOWN                10847.00       8600.00    19447.00      1.261132
HUNTSVILLE              2750.00       2200.00     4950.00      1.249432
SUNRISE BEACH            200.00        250.00      450.00      1.243781
SHENANDOAH               160.00        200.00      360.00      1.242236
COMFORT                  205.00        250.00      455.00      1.213592
CORSICANA               2200.00       2650.00     4850.00      1.203998
SPRING BRANCH           2960.00       3555.00     6515.00      1.200608
KATY                   28411.30      33605.12    62016.42      1.182767
BASTROP                 5923.75       5050.00    10973.75      1.172788
BELTON                  2713.00       2340.12     5053.12      1.158847
FAIRVIEW                3499.00       3030.00     6529.00      1.154404
MANSFIELD               7785.00       8850.00    16635.00      1.136656
PLANO                  92154.00      81357.12   173511.12      1.132696
SAN ANTONIO           373400.76     422340.48   795741.24      1.131062
GRAPEVINE              13723.00      12188.00    25911.00      1.125851
DALLAS               1068911.89    1186395.96  2255307.85      1.109909
WESTLAKE                7500.00       8255.00    15755.00      1.100520
PROSPER                 2631.00       2895.36     5526.36      1.100061
DENTON                 20477.00      21916.36    42393.36      1.070239
GRANBURY                3114.00       2925.00     6039.00      1.064252
HARLINGEN               7374.00       7750.00    15124.00      1.050847
BOERNE                 15968.09      16657.00    32625.09      1.043078
HEMPSTEAD               1395.00       1350.00     2745.00      1.032568
KINGWOOD               18729.75      18176.00    36905.75      1.030409
SUGAR LAND             57750.56      58951.65   116702.21      1.020780
MELISSA                  750.00        735.00     1485.00      1.019022
MCALLEN                10397.00      10436.00    20833.00      1.003655
COLLEGE STA             1000.00       1000.00     2000.00      0.999001
CHINA SPRING            1000.00       1000.00     2000.00      0.999001
HORIZON CITY             250.00        250.00      500.00      0.996016

[729 rows x 4 columns]

In [33]: