COVID-19, Demographics, and Political Affiliations

A Machine Learning Project focused on the 2020 Presidential Election

Introduction

During the pandemic, it has become glaringly obvious that COVID-19 has affected us all, for better or worse. However, it seems many Americans have vastly different opinions on the issue; covid has become a political issue. So, I decided to dig deeper into the relation, using topics from data science to build a machine learning algorithm that makes predictions based upon this information. An algorithm with high accuracy will signal that there is a likely relationship between the demographics and COVID-19 statistics, and the political affiliation of a county.

Abstract

In this notebook, I will create a k-NN (k-nearest neighbors) classifier which takes as its features information about COVID-19 in a certain county and demographic information about that county. The classifier will return a prediction for who won the 2020 Presidential election in this county. The techniques I will use in this project include, but are not limited to: markdown, importing libraries and .csv files, table manipulation, data cleaning and filtering, defining functions, statistical distribution analysis, for loops, standardization, and basic machine learning. Implementation of the notebook on this site and style formatting has been done with raw HTML and CSS.

Note: I will use first-person-plural pronouns, such as “we” and “our”. In doing so, I refer only to myself and the reader.

Definitions:

Classifiers: A classifier in machine learning is an algorithm that automatically orders or categorizes data into one or more of a set of “classes.” One of the most common examples is an email classifier that scans emails to filter them by class label: Spam or Not Spam.

Per Capita: Per capita means per person. It is a Latin term that translates to “by the head.” It’s commonly used in statistics, economics, and business to report an average per person. It tells you how a country, state, or city affects its residents.

Note: For further clarification of any terms in this project, or for information on any basic topic in data science, please refer to Computational and Inferential Thinking: The Foundations of Data Science, by Ani Adhikari, and John DeNero, with contributions from David Wagner and Henry Milner.

Set-up

Here, we import all of the libraries necessary to complete this project.

In [1]:

from datascience import *
import numpy as np
from math import *
import math as math
import scipy.stats
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import pandas as pd
from IPython.display import *

Next, we will load 3 .csv files with the information needed to build the classifier (and drop unnecessary data), acquired from New York Times, Politico and New York Times, and The Census Bureau, respectively. We will then import a .csv file from the EPA to validate population data from our other tables. This validation will be done by randomly selecting fips numbers, and seeing if the data from our two courses on population looks plausible; however, we will expect to see a small difference, because the data are from two different years (2015 and 2017).

In [2]:

covid_county = Table().read_table('us-counties.csv').drop(0,6,7,8,9)
covid_county.show(5)

county	state	fips	cases	deaths
Autauga	Alabama	1001	7150	111
Baldwin	Alabama	1003	21661	311
Barbour	Alabama	1005	2337	59
Bibb	Alabama	1007	2665	64
Blount	Alabama	1009	6887	139

… (3242 rows omitted)

In [3]:

elect_county = Table().read_table('county-elections.csv').select(1,7,8).relabel(0,'fips')
elect_county.show(5)

fips	per_gop	per_dem
1001	0.714368	0.270184
1003	0.761714	0.22409
1005	0.534512	0.457882
1007	0.784263	0.206983
1009	0.895716	0.0956938

… (3147 rows omitted)

In [4]:

complete = Table().read_table('county_complete.csv').drop(3,4,5,6,7,8,9,10).select(0,1,2,3,'median_age_2019', 'white_2019','hs_grad_2019','bachelors_2019','median_household_income_2019','poverty_2019','unemployment_rate_2019','white_2019').drop(1,2)
complete.show(5)

fips	pop2017	median_ age_2019	white _2019	hs_grad _2019	bachelors _2019	median_ household_ income_2019	poverty _2019	unemployment _rate_2019
1001	55504	38.2	76.8	88.5	26.6	58731	15.2	3.5
1003	212628	43	86.2	90.8	31.9	58320	10.4	4
1005	25270	40.4	46.8	73.2	11.6	32525	30.7	9.4
1007	22668	40.9	76.8	79.1	10.4	47542	nan	7
1009	58013	40.7	95.5	80.5	13.1	49358	13.6	3.1

… (3137 rows omitted)

In [5]:

pop_county = Table().read_table('county-population.csv').select(7,5)
pop_county.show(5)

fips	2015 POPULATION
1001	55,347
1003	203,709
1005	26,489
1007	22,583
1009	57,673

We can see below that the population data we have are very similar, so it is fair to say that the data is likely accurate.

In [6]:

pop_county.join('fips', complete.select(0,1)).sample(10).show(10)

fips	2015 POPULATION	pop2017
42093	18,557	18272
12039	46,036	46071
6103	63,308	63926
13155	9,245	9410
51600	24,013	24097
21121	31,730	31227
40005	13,793	13887
8015	18,658	19638
48485	131,705	132000
48469	92,382	92084

… (3229 rows omitted)

Now, we’ll clean up the data from these tables, and join them into one table which can be used to build the classifer.

In [7]:

county = covid_county
county = county.where('cases', are.above_or_equal_to(0)).where('deaths', are.above_or_equal_to(0)).where('fips', are.below(60000))
counties_1 = county.join('fips', elect_county)
counties = counties_1.join('fips',complete)
counties.show(5)

fips	county	state	cases	deaths	per_gop	per_dem	pop2017	median_ age_2019	white _2019	hs_grad _2019	bachelors _2019	median_ household _income_2019	poverty _2019	unemployment _rate_2019
1001	Autauga	Alabama	7150	111	0.714368	0.270184	55504	38.2	76.8	88.5	26.6	58731	15.2	3.5
1003	Baldwin	Alabama	21661	311	0.761714	0.22409	212628	43	86.2	90.8	31.9	58320	10.4	4
1005	Barbour	Alabama	2337	59	0.534512	0.457882	25270	40.4	46.8	73.2	11.6	32525	30.7	9.4
1007	Bibb	Alabama	2665	64	0.784263	0.206983	22668	40.9	76.8	79.1	10.4	47542	nan	7
1009	Blount	Alabama	6887	139	0.895716	0.0956938	58013	40.7	95.5	80.5	13.1	49358	13.6	3.1

… (3102 rows omitted)

Next, we will relabel the columns of the counties table.

In [8]:

counties = counties.drop('population')
counties = counties.relabel('pop2017','pop').relabel('median_age_2019', 'median age').relabel('white_2019','white').relabel('hs_grad_2019','hs grad').relabel('bachelors_2019','bachelors')
counties = counties.relabel('median_household_income_2019','median household income').relabel('poverty_2019','poverty').relabel('unemployment_rate_2019','unemployment')
counties.show(5)

fips	county	state	cases	deaths	per_gop	per_dem	pop	median age	white	hs grad	bachelors	median household income	poverty	unemployment
1001	Autauga	Alabama	7150	111	0.714368	0.270184	55504	38.2	76.8	88.5	26.6	58731	15.2	3.5
1003	Baldwin	Alabama	21661	311	0.761714	0.22409	212628	43	86.2	90.8	31.9	58320	10.4	4
1005	Barbour	Alabama	2337	59	0.534512	0.457882	25270	40.4	46.8	73.2	11.6	32525	30.7	9.4
1007	Bibb	Alabama	2665	64	0.784263	0.206983	22668	40.9	76.8	79.1	10.4	47542	nan	7
1009	Blount	Alabama	6887	139	0.895716	0.0956938	58013	40.7	95.5	80.5	13.1	49358	13.6	3.1

… (3102 rows omitted)

In this cell, we will format the trump and biden columns into percentages and not proportions. The fix_votes function does this.

In [9]:

def fix_votes(x):
    """Return proportion in rouned percentage form"""
    return round(x*100,3)
counties['trump'] = counties.apply(fix_votes,'per_gop')
counties['biden'] = counties.apply(fix_votes,'per_dem')
data_1 = counties.drop('poverty', 'per_gop', 'per_dem')
data = data_1.select(0,1,2,5,3,4,12,13,10,11,6,7,8,9).relabel('unemployment', 'unemployed')
data = data.with_columns('cases1',data.column('cases')/data.column('pop'),'deaths1',data.column('deaths')/data.column('pop')).drop('cases', 'deaths')
data = data.relabel('cases1','cases').relabel( 'deaths1', 'deaths')
c = data.apply(fix_votes, 'cases')
d = data.apply(fix_votes, 'deaths')
data = data.drop(12,13).with_columns('cases', c, 'deaths', d)
data = data.select(0,1,2,3,'cases', 'deaths', 4,5,6,7,8,9,10,11)

The Data

In the data table below, the following columns are represented:

Note: All Alaskan counties have been excluded, as Alaskan election information is not collected by county

fips: An identification number given to each county by the Federal Government

county: The county name

state: The state that county is in, or the District of Columbia

pop: The population of the county (as of 2017)

cases: The number of cumulative COVID-19 cases in that county per capita (as of June 2nd, 2021)

deaths: The number of cumulative COVID-19 deaths in that county per capita (as of June 2nd, 2021)

trump: The percent of votes that were cast for Donald J. Trump in that county in the 2020 Presidential Election

biden: The percent of votes that were cast for Joesph R. Biden in that county in the 2020 Presidential Election

median household income: The median household income of that county (as of 2019)

unemployed: The unemployment rate in that county (as of 2019)

median age: The median age of that county (as of 2019)

white: The percent of residents of that county that are white (as of 2019)

hs grad: The percent of residents of that county that have graduated high school (as of 2019)

bachelors: The percent of residents of that county that have graduated college with a bachelor’s degree (as of 2019)

In [10]:

data.show(5)

fips	county	state	pop	cases	deaths	trump	biden	median household income	unemployed	median age	white	hs grad	bachelors
1001	Autauga	Alabama	55504	12.882	0.2	71.437	27.018	58731	3.5	38.2	76.8	88.5	26.6
1003	Baldwin	Alabama	212628	10.187	0.146	76.171	22.409	58320	4	43	86.2	90.8	31.9
1005	Barbour	Alabama	25270	9.248	0.233	53.451	45.788	32525	9.4	40.4	46.8	73.2	11.6
1007	Bibb	Alabama	22668	11.757	0.282	78.426	20.698	47542	7	40.9	76.8	79.1	10.4
1009	Blount	Alabama	58013	11.871	0.24	89.572	9.569	49358	3.1	40.7	95.5	80.5	13.1

… (3101 rows omitted)

We will now modify the data table to make our job easier. The first step we will make is identifying which candidate won in each county. To do this, we will define the votes function as below and create an election table with only the voting percentages. Then, we will save this new array.

In [11]:

def vote(x,y):
    """Return plurality winner in Presidential Election"""
    if x>y:
        return 'Trump'
    elif y>x:
        return 'Biden'
    else:
        return 'Tie'
votes = data.select('trump', 'biden')
winner = votes.apply(vote,0,1)
fips = data.column('fips')

We now create 2 tables, one with the categorical data, and one with the numerical data. We will also drop the vote percentage of each candidate, as we have the winner array, so they are no longer relevant.

In [12]:

cat = data.select(0,1,2).with_column('winner', winner)
num = data.select(3,4,5,8,9,10,11,12,13)

Now, because the units in each column are different, we want to convert the data into standard units or z-score so that all the data has the same weight in our classifier. We will do this by defining the s_u and standardize functions.

In [13]:

def s_u(array):
    """Return array in standard units"""
    return (array-np.mean(array))/np.std(array)
def standardize(table):
    """Return a table in standard units"""
    t = Table()
    for column in np.array(table.labels):
        col = s_u(table.column(column))
        t = t.with_column(column,col)
    return t

Now, each value represents how many standard deviations it is above or below the mean of that column. We can now call the standardize function on the num table, and add the fips column so we can join all the data back into one table.

In [14]:

stan_num = standardize(num)
stan_num_fips = stan_num.with_column('fips', fips)
full = cat.join('fips', stan_num_fips)
full.show(5)

fips	county	state	winner	pop	cases	deaths	median household income	unemployed	median age	white	hs grad	bachelors
1001	Autauga	Alabama	Trump	-0.141733	0.909967	-0.0287858	0.383516	-0.550973	-0.608552	-0.399307	0.251456	0.487353
1003	Baldwin	Alabama	Trump	0.338787	-0.0038243	-0.510629	0.354379	-0.354464	0.283253	0.176396	0.618838	1.04212
1005	Barbour	Alabama	Trump	-0.234195	-0.32221	0.265674	-1.47431	1.76784	-0.199808	-2.23666	-2.19243	-1.08275
1007	Bibb	Alabama	Trump	-0.242152	0.528514	0.702903	-0.409707	0.824593	-0.106912	-0.399307	-1.25002	-1.20836
1009	Blount	Alabama	Trump	-0.134059	0.567168	0.328135	-0.280965	-0.708181	-0.14407	0.745974	-1.02639	-0.925743

… (3101 rows omitted)

As is standard practice in machine learning, we will now randomly split the data into 2 sets, train which will be used to train our algorithm, and test which will be used to evaluate the efficiency of our classifier.

In [15]:

shuf_full = full.sample(with_replacement = False)
training = shuf_full.take(np.arange(1606))
test = shuf_full.take(np.arange(1606,3106))

We’ll now only work with the training table until we have built our classifier. But first, let’s define some functions we will need.

distance: finds euclidean distance between two points

row_distance: uses previous function to find distance between two row objects

distances: creates a new table: training with one more column, which contains the distance from each row to the given example

closest: creates a new table with only the k closest rows

majority_class: finds the class of the majority of the rows in a given table

classify_1: calls majority_class on the table from closest to return our classifiers prediction

In [16]:

def distance(pt1, pt2):
    """Return the distance between two points, represented as arrays"""
    return np.sqrt(sum((pt1 - pt2)**2))

def row_distance(row1, row2):
    """Return the distance between two numerical rows of a table"""
    return distance(np.array(row1), np.array(row2))

def distances(training, example):
    """Return training table with distances column"""
    distances = make_array()
    attributes_only = training.drop('winner',0,1,2)    
    for row in attributes_only.rows:
        distances = np.append(distances, row_distance(row, example))
    return training.with_column('Distance', distances)

def closest(training, example, k):
    """
    Return a table of the k closest neighbors to example
    """
    return distances(training, example).sort('Distance').take(np.arange(k))

def majority_class(topk):
    """Return the class with the highest count"""
    return topk.group('winner').sort('count', descending=True).column(0).item(0)

def classify_1(training, example, k):
    """Return the majority class among the k nearest neighbors of example"""
    return majority_class(closest(training, example, k))

Here, we define one more function, accuracy, which evaluates the efficiency of classifier on the test set, by comparing the predictions and the actual, and returning the percentage of times classify was correct.

In [17]:

def accuracy(training, test, k):
    """Return the proportion of correctly classified examples 
    in the test set"""
    test_attributes = test.drop('winner',0,1,2)
    num_correct = 0
    for i in np.arange(test.num_rows):
        c = classify_1(training, test_attributes.row(i), k)
        num_correct = num_correct + (c == test.column('winner').item(i))
    return (num_correct / test.num_rows)*100

With some trial and error, we can find that the optimal amount of points to base our prediction on is k=13, this k leads to an accuracy, on the test set, of 92.6%.

In [18]:

accuracy(training, test, 13)

Out [18]:

92.6

Thus, we will define the final k-NN function classify, which makes its prediction using k=13 inherently, and the full table.

In [19]:

def classify(example):
    return classify_1(full, example ,13)

Finally, we can create a new function predict which, given a state and county name, will make a prediction using k=13 and return the actual result for comparison. This will use the full table.

In [20]:

def predict(state, county):
    ex = full.where('state', state).where('county', county)
    example = ex.drop(0,1,2,3).row(0)
    result = classify(example)
    print('My algorithm predicts that', result, 'won the 2020 Presidential election in', county,',',state)
    return full.where('state',state).where('county',county).select(0,2,1,3)

Below we show a few examples of the predict function in action.

In [21]:

predict('California','Yuba')

Out [21]:

My algorithm predicts that Trump won the 2020 Presidential election in Yuba , California

fips	state	county	winner
6115	California	Yuba	Trump

In [22]:

predict('Texas','Frio')

Out [22]:

My algorithm predicts that Trump won the 2020 Presidential election in Frio , Texas

fips	state	county	winner
48163	Texas	Frio	Trump

In [23]:

predict('New York','Nassau')

Out [23]:

My algorithm predicts that Biden won the 2020 Presidential election in Nassau , New York

fips	state	county	winner
36059	New York	Nassau	Biden

In [24]:

predict('Nebraska','Brown')

Out [24]:

My algorithm predicts that Trump won the 2020 Presidential election in Brown , Nebraska

fips	state	county	winner
31017	Nebraska	Brown	Trump

In [25]:

predict('Illinois','Lee')

Out [25]:

My algorithm predicts that Trump won the 2020 Presidential election in Lee , Illinois

fips	state	county	winner
17103	Illinois	Lee	Trump

In [26]:

predict('Mississippi','Bolivar')

Out [26]:

My algorithm predicts that Biden won the 2020 Presidential election in Bolivar , Mississippi

fips	state	county	winner
28011	Mississippi	Bolivar	Biden

Conclusion and Discussion

As we can see from the 92.6% accuracy of the classify function, given basic demographic information about a county, and its COVID-19 total case and total death count per capita, we can conclude that it is likely that the identity of a county’s residents and its reaction to the COVID-19 pandemic does correlate with their political affiliation.

It is possible that the ability of the algorithm to predict the winner of each county was due to only the demographics, or only the COVID-19 data. A further project which created classifiers for each type of data could evaluate the accuracy of both to identify where the decision border is clarified. If both have relatively low accuracy, then the association is with both; in the case that only one is high, it can be inferred that the classifier in this project was successful largely because of that set of variables.

Another, further, application of this algorithm is evaluating the accuracy of this classifier on other Presidential Elections, for instance, any past elections, or even to future elections, such as the 2024 Presidential Election, or the 2022 Midterm Election.

It is clear to see that while this project is limited to the scope of both sets of variables, and the 2020 Presidential Election, there are endless variations that could give even deeper insight into the nature of American politics in the context of the two-party-system.

If you have any questions about this project or require any clarification on concepts/topics, or need a walkthrough of any of my code, do not hesitate to reach out to me at jonathanferrari@berkeley.com. Cheers!