A Machine Learning Project focused on the 2020 Presidential Election
Introduction
During the pandemic, it has become glaringly obvious that COVID-19 has affected us all, for better or worse. However, it seems many Americans have vastly different opinions on the issue; covid has become a political issue. So, I decided to dig deeper into the relation, using topics from data science to build a machine learning algorithm that makes predictions based upon this information. An algorithm with high accuracy will signal that there is a likely relationship between the demographics and COVID-19 statistics, and the political affiliation of a county.
Abstract
In this notebook, I will create a k-NN (k-nearest neighbors) classifier which takes as its features information about COVID-19 in a certain county and demographic information about that county. The classifier will return a prediction for who won the 2020 Presidential election in this county. The techniques I will use in this project include, but are not limited to: markdown, importing libraries and .csv files, table manipulation, data cleaning and filtering, defining functions, statistical distribution analysis, for loops, standardization, and basic machine learning. Implementation of the notebook on this site and style formatting has been done with raw HTML and CSS.
Note: I will use first-person-plural pronouns, such as “we” and “our”. In doing so, I refer only to myself and the reader.
Definitions:
Classifiers: A classifier in machine learning is an algorithm that automatically orders or categorizes data into one or more of a set of “classes.” One of the most common examples is an email classifier that scans emails to filter them by class label: Spam or Not Spam.
Per Capita: Per capita means per person. It is a Latin term that translates to “by the head.” It’s commonly used in statistics, economics, and business to report an average per person. It tells you how a country, state, or city affects its residents.
Note: For further clarification of any terms in this project, or for information on any basic topic in data science, please refer to Computational and Inferential Thinking: The Foundations of Data Science, by Ani Adhikari, and John DeNero, with contributions from David Wagner and Henry Milner.
Set-up
Here, we import all of the libraries necessary to complete this project.
In [1]:
from datascience import *
import numpy as np
from math import *
import math as math
import scipy.stats
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import pandas as pd
from IPython.display import *
Next, we will load 3 .csv files with the information needed to build the classifier (and drop unnecessary data), acquired from New York Times, Politico and New York Times, and The Census Bureau, respectively. We will then import a .csv file from the EPA to validate population data from our other tables. This validation will be done by randomly selecting fips numbers, and seeing if the data from our two courses on population looks plausible; however, we will expect to see a small difference, because the data are from two different years (2015 and 2017).
In [2]:
covid_county = Table().read_table('us-counties.csv').drop(0,6,7,8,9)
covid_county.show(5)
| county | state | fips | cases | deaths |
|---|---|---|---|---|
| Autauga | Alabama | 1001 | 7150 | 111 |
| Baldwin | Alabama | 1003 | 21661 | 311 |
| Barbour | Alabama | 1005 | 2337 | 59 |
| Bibb | Alabama | 1007 | 2665 | 64 |
| Blount | Alabama | 1009 | 6887 | 139 |
… (3242 rows omitted)
In [3]:
elect_county = Table().read_table('county-elections.csv').select(1,7,8).relabel(0,'fips')
elect_county.show(5)
| fips | per_gop | per_dem |
|---|---|---|
| 1001 | 0.714368 | 0.270184 |
| 1003 | 0.761714 | 0.22409 |
| 1005 | 0.534512 | 0.457882 |
| 1007 | 0.784263 | 0.206983 |
| 1009 | 0.895716 | 0.0956938 |
… (3147 rows omitted)
In [4]:
complete = Table().read_table('county_complete.csv').drop(3,4,5,6,7,8,9,10).select(0,1,2,3,'median_age_2019', 'white_2019','hs_grad_2019','bachelors_2019','median_household_income_2019','poverty_2019','unemployment_rate_2019','white_2019').drop(1,2)
complete.show(5)
| fips | pop2017 | median_ age_2019 | white _2019 | hs_grad _2019 | bachelors _2019 | median_ household_ income_2019 | poverty _2019 | unemployment _rate_2019 |
|---|---|---|---|---|---|---|---|---|
| 1001 | 55504 | 38.2 | 76.8 | 88.5 | 26.6 | 58731 | 15.2 | 3.5 |
| 1003 | 212628 | 43 | 86.2 | 90.8 | 31.9 | 58320 | 10.4 | 4 |
| 1005 | 25270 | 40.4 | 46.8 | 73.2 | 11.6 | 32525 | 30.7 | 9.4 |
| 1007 | 22668 | 40.9 | 76.8 | 79.1 | 10.4 | 47542 | nan | 7 |
| 1009 | 58013 | 40.7 | 95.5 | 80.5 | 13.1 | 49358 | 13.6 | 3.1 |
… (3137 rows omitted)
In [5]:
pop_county = Table().read_table('county-population.csv').select(7,5)
pop_county.show(5)
| fips | 2015 POPULATION |
|---|---|
| 1001 | 55,347 |
| 1003 | 203,709 |
| 1005 | 26,489 |
| 1007 | 22,583 |
| 1009 | 57,673 |
We can see below that the population data we have are very similar, so it is fair to say that the data is likely accurate.
In [6]:
pop_county.join('fips', complete.select(0,1)).sample(10).show(10)
| fips | 2015 POPULATION | pop2017 |
|---|---|---|
| 42093 | 18,557 | 18272 |
| 12039 | 46,036 | 46071 |
| 6103 | 63,308 | 63926 |
| 13155 | 9,245 | 9410 |
| 51600 | 24,013 | 24097 |
| 21121 | 31,730 | 31227 |
| 40005 | 13,793 | 13887 |
| 8015 | 18,658 | 19638 |
| 48485 | 131,705 | 132000 |
| 48469 | 92,382 | 92084 |
… (3229 rows omitted)
Now, we’ll clean up the data from these tables, and join them into one table which can be used to build the classifer.
In [7]:
county = covid_county
county = county.where('cases', are.above_or_equal_to(0)).where('deaths', are.above_or_equal_to(0)).where('fips', are.below(60000))
counties_1 = county.join('fips', elect_county)
counties = counties_1.join('fips',complete)
counties.show(5)
| fips | county | state | cases | deaths | per_gop | per_dem | pop2017 | median_ age_2019 | white _2019 | hs_grad _2019 | bachelors _2019 | median_ household _income_2019 | poverty _2019 | unemployment _rate_2019 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Autauga | Alabama | 7150 | 111 | 0.714368 | 0.270184 | 55504 | 38.2 | 76.8 | 88.5 | 26.6 | 58731 | 15.2 | 3.5 |
| 1003 | Baldwin | Alabama | 21661 | 311 | 0.761714 | 0.22409 | 212628 | 43 | 86.2 | 90.8 | 31.9 | 58320 | 10.4 | 4 |
| 1005 | Barbour | Alabama | 2337 | 59 | 0.534512 | 0.457882 | 25270 | 40.4 | 46.8 | 73.2 | 11.6 | 32525 | 30.7 | 9.4 |
| 1007 | Bibb | Alabama | 2665 | 64 | 0.784263 | 0.206983 | 22668 | 40.9 | 76.8 | 79.1 | 10.4 | 47542 | nan | 7 |
| 1009 | Blount | Alabama | 6887 | 139 | 0.895716 | 0.0956938 | 58013 | 40.7 | 95.5 | 80.5 | 13.1 | 49358 | 13.6 | 3.1 |
… (3102 rows omitted)
Next, we will relabel the columns of the counties table.
In [8]:
counties = counties.drop('population')
counties = counties.relabel('pop2017','pop').relabel('median_age_2019', 'median age').relabel('white_2019','white').relabel('hs_grad_2019','hs grad').relabel('bachelors_2019','bachelors')
counties = counties.relabel('median_household_income_2019','median household income').relabel('poverty_2019','poverty').relabel('unemployment_rate_2019','unemployment')
counties.show(5)
| fips | county | state | cases | deaths | per_gop | per_dem | pop | median age | white | hs grad | bachelors | median household income | poverty | unemployment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Autauga | Alabama | 7150 | 111 | 0.714368 | 0.270184 | 55504 | 38.2 | 76.8 | 88.5 | 26.6 | 58731 | 15.2 | 3.5 |
| 1003 | Baldwin | Alabama | 21661 | 311 | 0.761714 | 0.22409 | 212628 | 43 | 86.2 | 90.8 | 31.9 | 58320 | 10.4 | 4 |
| 1005 | Barbour | Alabama | 2337 | 59 | 0.534512 | 0.457882 | 25270 | 40.4 | 46.8 | 73.2 | 11.6 | 32525 | 30.7 | 9.4 |
| 1007 | Bibb | Alabama | 2665 | 64 | 0.784263 | 0.206983 | 22668 | 40.9 | 76.8 | 79.1 | 10.4 | 47542 | nan | 7 |
| 1009 | Blount | Alabama | 6887 | 139 | 0.895716 | 0.0956938 | 58013 | 40.7 | 95.5 | 80.5 | 13.1 | 49358 | 13.6 | 3.1 |
… (3102 rows omitted)
In this cell, we will format the trump and biden columns into percentages and not proportions. The fix_votes function does this.
In [9]:
def fix_votes(x):
"""Return proportion in rouned percentage form"""
return round(x*100,3)
counties['trump'] = counties.apply(fix_votes,'per_gop')
counties['biden'] = counties.apply(fix_votes,'per_dem')
data_1 = counties.drop('poverty', 'per_gop', 'per_dem')
data = data_1.select(0,1,2,5,3,4,12,13,10,11,6,7,8,9).relabel('unemployment', 'unemployed')
data = data.with_columns('cases1',data.column('cases')/data.column('pop'),'deaths1',data.column('deaths')/data.column('pop')).drop('cases', 'deaths')
data = data.relabel('cases1','cases').relabel( 'deaths1', 'deaths')
c = data.apply(fix_votes, 'cases')
d = data.apply(fix_votes, 'deaths')
data = data.drop(12,13).with_columns('cases', c, 'deaths', d)
data = data.select(0,1,2,3,'cases', 'deaths', 4,5,6,7,8,9,10,11)
The Data
In the data table below, the following columns are represented:
Note: All Alaskan counties have been excluded, as Alaskan election information is not collected by county
fips: An identification number given to each county by the Federal Government
county: The county name
state: The state that county is in, or the District of Columbia
pop: The population of the county (as of 2017)
cases: The number of cumulative COVID-19 cases in that county per capita (as of June 2nd, 2021)
deaths: The number of cumulative COVID-19 deaths in that county per capita (as of June 2nd, 2021)
trump: The percent of votes that were cast for Donald J. Trump in that county in the 2020 Presidential Election
biden: The percent of votes that were cast for Joesph R. Biden in that county in the 2020 Presidential Election
median household income: The median household income of that county (as of 2019)
unemployed: The unemployment rate in that county (as of 2019)
median age: The median age of that county (as of 2019)
white: The percent of residents of that county that are white (as of 2019)
hs grad: The percent of residents of that county that have graduated high school (as of 2019)
bachelors: The percent of residents of that county that have graduated college with a bachelor’s degree (as of 2019)
In [10]:
data.show(5)
| fips | county | state | pop | cases | deaths | trump | biden | median household income | unemployed | median age | white | hs grad | bachelors |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Autauga | Alabama | 55504 | 12.882 | 0.2 | 71.437 | 27.018 | 58731 | 3.5 | 38.2 | 76.8 | 88.5 | 26.6 |
| 1003 | Baldwin | Alabama | 212628 | 10.187 | 0.146 | 76.171 | 22.409 | 58320 | 4 | 43 | 86.2 | 90.8 | 31.9 |
| 1005 | Barbour | Alabama | 25270 | 9.248 | 0.233 | 53.451 | 45.788 | 32525 | 9.4 | 40.4 | 46.8 | 73.2 | 11.6 |
| 1007 | Bibb | Alabama | 22668 | 11.757 | 0.282 | 78.426 | 20.698 | 47542 | 7 | 40.9 | 76.8 | 79.1 | 10.4 |
| 1009 | Blount | Alabama | 58013 | 11.871 | 0.24 | 89.572 | 9.569 | 49358 | 3.1 | 40.7 | 95.5 | 80.5 | 13.1 |
… (3101 rows omitted)
We will now modify the data table to make our job easier. The first step we will make is identifying which candidate won in each county. To do this, we will define the votes function as below and create an election table with only the voting percentages. Then, we will save this new array.
In [11]:
def vote(x,y):
"""Return plurality winner in Presidential Election"""
if x>y:
return 'Trump'
elif y>x:
return 'Biden'
else:
return 'Tie'
votes = data.select('trump', 'biden')
winner = votes.apply(vote,0,1)
fips = data.column('fips')
We now create 2 tables, one with the categorical data, and one with the numerical data. We will also drop the vote percentage of each candidate, as we have the winner array, so they are no longer relevant.
In [12]:
cat = data.select(0,1,2).with_column('winner', winner)
num = data.select(3,4,5,8,9,10,11,12,13)
Now, because the units in each column are different, we want to convert the data into standard units or z-score so that all the data has the same weight in our classifier. We will do this by defining the s_u and standardize functions.
In [13]:
def s_u(array):
"""Return array in standard units"""
return (array-np.mean(array))/np.std(array)
def standardize(table):
"""Return a table in standard units"""
t = Table()
for column in np.array(table.labels):
col = s_u(table.column(column))
t = t.with_column(column,col)
return t
Now, each value represents how many standard deviations it is above or below the mean of that column. We can now call the standardize function on the num table, and add the fips column so we can join all the data back into one table.
In [14]:
stan_num = standardize(num)
stan_num_fips = stan_num.with_column('fips', fips)
full = cat.join('fips', stan_num_fips)
full.show(5)
| fips | county | state | winner | pop | cases | deaths | median household income | unemployed | median age | white | hs grad | bachelors |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Autauga | Alabama | Trump | -0.141733 | 0.909967 | -0.0287858 | 0.383516 | -0.550973 | -0.608552 | -0.399307 | 0.251456 | 0.487353 |
| 1003 | Baldwin | Alabama | Trump | 0.338787 | -0.0038243 | -0.510629 | 0.354379 | -0.354464 | 0.283253 | 0.176396 | 0.618838 | 1.04212 |
| 1005 | Barbour | Alabama | Trump | -0.234195 | -0.32221 | 0.265674 | -1.47431 | 1.76784 | -0.199808 | -2.23666 | -2.19243 | -1.08275 |
| 1007 | Bibb | Alabama | Trump | -0.242152 | 0.528514 | 0.702903 | -0.409707 | 0.824593 | -0.106912 | -0.399307 | -1.25002 | -1.20836 |
| 1009 | Blount | Alabama | Trump | -0.134059 | 0.567168 | 0.328135 | -0.280965 | -0.708181 | -0.14407 | 0.745974 | -1.02639 | -0.925743 |
… (3101 rows omitted)
As is standard practice in machine learning, we will now randomly split the data into 2 sets, train which will be used to train our algorithm, and test which will be used to evaluate the efficiency of our classifier.
In [15]:
shuf_full = full.sample(with_replacement = False) training = shuf_full.take(np.arange(1606)) test = shuf_full.take(np.arange(1606,3106))
We’ll now only work with the training table until we have built our classifier. But first, let’s define some functions we will need.
distance: finds euclidean distance between two points
row_distance: uses previous function to find distance between two row objects
distances: creates a new table: training with one more column, which contains the distance from each row to the given example
closest: creates a new table with only the k closest rows
majority_class: finds the class of the majority of the rows in a given table
classify_1: calls majority_class on the table from closest to return our classifiers prediction
In [16]:
def distance(pt1, pt2):
"""Return the distance between two points, represented as arrays"""
return np.sqrt(sum((pt1 - pt2)**2))
def row_distance(row1, row2):
"""Return the distance between two numerical rows of a table"""
return distance(np.array(row1), np.array(row2))
def distances(training, example):
"""Return training table with distances column"""
distances = make_array()
attributes_only = training.drop('winner',0,1,2)
for row in attributes_only.rows:
distances = np.append(distances, row_distance(row, example))
return training.with_column('Distance', distances)
def closest(training, example, k):
"""
Return a table of the k closest neighbors to example
"""
return distances(training, example).sort('Distance').take(np.arange(k))
def majority_class(topk):
"""Return the class with the highest count"""
return topk.group('winner').sort('count', descending=True).column(0).item(0)
def classify_1(training, example, k):
"""Return the majority class among the k nearest neighbors of example"""
return majority_class(closest(training, example, k))
Here, we define one more function, accuracy, which evaluates the efficiency of classifier on the test set, by comparing the predictions and the actual, and returning the percentage of times classify was correct.
In [17]:
def accuracy(training, test, k):
"""Return the proportion of correctly classified examples
in the test set"""
test_attributes = test.drop('winner',0,1,2)
num_correct = 0
for i in np.arange(test.num_rows):
c = classify_1(training, test_attributes.row(i), k)
num_correct = num_correct + (c == test.column('winner').item(i))
return (num_correct / test.num_rows)*100
With some trial and error, we can find that the optimal amount of points to base our prediction on is k=13, this k leads to an accuracy, on the test set, of 92.6%.
In [18]:
accuracy(training, test, 13)
Out [18]:
92.6
Thus, we will define the final k-NN function classify, which makes its prediction using k=13 inherently, and the full table.
In [19]:
def classify(example):
return classify_1(full, example ,13)
Finally, we can create a new function predict which, given a state and county name, will make a prediction using k=13 and return the actual result for comparison. This will use the full table.
In [20]:
def predict(state, county):
ex = full.where('state', state).where('county', county)
example = ex.drop(0,1,2,3).row(0)
result = classify(example)
print('My algorithm predicts that', result, 'won the 2020 Presidential election in', county,',',state)
return full.where('state',state).where('county',county).select(0,2,1,3)
Below we show a few examples of the predict function in action.
In [21]:
predict('California','Yuba')
Out [21]:
My algorithm predicts that Trump won the 2020 Presidential election in Yuba , California
| fips | state | county | winner |
|---|---|---|---|
| 6115 | California | Yuba | Trump |
In [22]:
predict('Texas','Frio')
Out [22]:
My algorithm predicts that Trump won the 2020 Presidential election in Frio , Texas
| fips | state | county | winner |
|---|---|---|---|
| 48163 | Texas | Frio | Trump |
In [23]:
predict('New York','Nassau')
Out [23]:
My algorithm predicts that Biden won the 2020 Presidential election in Nassau , New York
| fips | state | county | winner |
|---|---|---|---|
| 36059 | New York | Nassau | Biden |
In [24]:
predict('Nebraska','Brown')
Out [24]:
My algorithm predicts that Trump won the 2020 Presidential election in Brown , Nebraska
| fips | state | county | winner |
|---|---|---|---|
| 31017 | Nebraska | Brown | Trump |
In [25]:
predict('Illinois','Lee')
Out [25]:
My algorithm predicts that Trump won the 2020 Presidential election in Lee , Illinois
| fips | state | county | winner |
|---|---|---|---|
| 17103 | Illinois | Lee | Trump |
In [26]:
predict('Mississippi','Bolivar')
Out [26]:
My algorithm predicts that Biden won the 2020 Presidential election in Bolivar , Mississippi
| fips | state | county | winner |
|---|---|---|---|
| 28011 | Mississippi | Bolivar | Biden |
Conclusion and Discussion
As we can see from the 92.6% accuracy of the classify function, given basic demographic information about a county, and its COVID-19 total case and total death count per capita, we can conclude that it is likely that the identity of a county’s residents and its reaction to the COVID-19 pandemic does correlate with their political affiliation.
It is possible that the ability of the algorithm to predict the winner of each county was due to only the demographics, or only the COVID-19 data. A further project which created classifiers for each type of data could evaluate the accuracy of both to identify where the decision border is clarified. If both have relatively low accuracy, then the association is with both; in the case that only one is high, it can be inferred that the classifier in this project was successful largely because of that set of variables.
Another, further, application of this algorithm is evaluating the accuracy of this classifier on other Presidential Elections, for instance, any past elections, or even to future elections, such as the 2024 Presidential Election, or the 2022 Midterm Election.
It is clear to see that while this project is limited to the scope of both sets of variables, and the 2020 Presidential Election, there are endless variations that could give even deeper insight into the nature of American politics in the context of the two-party-system.
If you have any questions about this project or require any clarification on concepts/topics, or need a walkthrough of any of my code, do not hesitate to reach out to me at jonathanferrari@berkeley.com. Cheers!