Skip to content

Jonathan Ferrari

  • Home
  • About Me
  • Resume
  • ProjectsExpand
    • You Snooze You Lose
    • Screw This
  • Photography
  • Contact Me
Jonathan Ferrari

COVID-19, Demographics, and Political Affiliations

A Machine Learning Project focused on the 2020 Presidential Election

by Jonathan Ferrari

Introduction

During the pandemic, it has become glaringly obvious that COVID-19 has affected us all, for better or worse. However, it seems many Americans have vastly different opinions on the issue; covid has become a political issue. So, I decided to dig deeper into the relation, using topics from data science to build a machine learning algorithm that makes predictions based upon this information. An algorithm with high accuracy will signal that there is a likely relationship between the demographics and COVID-19 statistics, and the political affiliation of a county.

Abstract

In this notebook, I will create a k-NN (k-nearest neighbors) classifier which takes as its features information about COVID-19 in a certain county and demographic information about that county. The classifier will return a prediction for who won the 2020 Presidential election in this county. The techniques I will use in this project include, but are not limited to: markdown, importing libraries and .csv files, table manipulation, data cleaning and filtering, defining functions, statistical distribution analysis, for loops, standardization, and basic machine learning. Implementation of the notebook on this site and style formatting has been done with raw HTML and CSS.

Note: I will use first-person-plural pronouns, such as “we” and “our”. In doing so, I refer only to myself and the reader.

Definitions:

Classifiers: A classifier in machine learning is an algorithm that automatically orders or categorizes data into one or more of a set of “classes.” One of the most common examples is an email classifier that scans emails to filter them by class label: Spam or Not Spam.

Per Capita: Per capita means per person. It is a Latin term that translates to “by the head.” It’s commonly used in statistics, economics, and business to report an average per person. It tells you how a country, state, or city affects its residents.

Note: For further clarification of any terms in this project, or for information on any basic topic in data science, please refer to Computational and Inferential Thinking: The Foundations of Data Science, by Ani Adhikari, and John DeNero, with contributions from David Wagner and Henry Milner.

Set-up

Here, we import all of the libraries necessary to complete this project.

In [1]:

from datascience import *
import numpy as np
from math import *
import math as math
import scipy.stats
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import pandas as pd
from IPython.display import *

Next, we will load 3 .csv files with the information needed to build the classifier (and drop unnecessary data), acquired from New York Times, Politico and New York Times, and The Census Bureau, respectively. We will then import a .csv file from the EPA to validate population data from our other tables. This validation will be done by randomly selecting fips numbers, and seeing if the data from our two courses on population looks plausible; however, we will expect to see a small difference, because the data are from two different years (2015 and 2017).

In [2]:

covid_county = Table().read_table('us-counties.csv').drop(0,6,7,8,9)
covid_county.show(5)
countystatefipscasesdeaths
AutaugaAlabama10017150111
BaldwinAlabama100321661311
BarbourAlabama1005233759
BibbAlabama1007266564
BlountAlabama10096887139

… (3242 rows omitted)

In [3]:

elect_county = Table().read_table('county-elections.csv').select(1,7,8).relabel(0,'fips')
elect_county.show(5)
fipsper_gopper_dem
10010.7143680.270184
10030.7617140.22409
10050.5345120.457882
10070.7842630.206983
10090.8957160.0956938

… (3147 rows omitted)

In [4]:

complete = Table().read_table('county_complete.csv').drop(3,4,5,6,7,8,9,10).select(0,1,2,3,'median_age_2019', 'white_2019','hs_grad_2019','bachelors_2019','median_household_income_2019','poverty_2019','unemployment_rate_2019','white_2019').drop(1,2)
complete.show(5)
fipspop2017median_ age_2019white _2019hs_grad _2019bachelors _2019median_
household_ income_2019
poverty _2019unemployment _rate_2019
10015550438.276.888.526.65873115.23.5
10032126284386.290.831.95832010.44
10052527040.446.873.211.63252530.79.4
10072266840.976.879.110.447542nan7
10095801340.795.580.513.14935813.63.1

… (3137 rows omitted)

In [5]:

pop_county = Table().read_table('county-population.csv').select(7,5)
pop_county.show(5)
fips2015 POPULATION
100155,347
1003203,709
100526,489
100722,583
100957,673

We can see below that the population data we have are very similar, so it is fair to say that the data is likely accurate.

In [6]:

pop_county.join('fips', complete.select(0,1)).sample(10).show(10)
fips2015 POPULATIONpop2017
4209318,55718272
1203946,03646071
610363,30863926
131559,2459410
5160024,01324097
2112131,73031227
4000513,79313887
801518,65819638
48485131,705132000
4846992,38292084

… (3229 rows omitted)

Now, we’ll clean up the data from these tables, and join them into one table which can be used to build the classifer.

In [7]:

county = covid_county
county = county.where('cases', are.above_or_equal_to(0)).where('deaths', are.above_or_equal_to(0)).where('fips', are.below(60000))
counties_1 = county.join('fips', elect_county)
counties = counties_1.join('fips',complete)
counties.show(5)
fipscountystatecasesdeathsper_gopper_dempop2017median_ age_2019white _2019hs_grad _2019bachelors _2019median_ household _income_2019poverty _2019unemployment _rate_2019
1001AutaugaAlabama71501110.7143680.2701845550438.276.888.526.65873115.23.5
1003BaldwinAlabama216613110.7617140.224092126284386.290.831.95832010.44
1005BarbourAlabama2337590.5345120.4578822527040.446.873.211.63252530.79.4
1007BibbAlabama2665640.7842630.2069832266840.976.879.110.447542nan7
1009BlountAlabama68871390.8957160.09569385801340.795.580.513.14935813.63.1

… (3102 rows omitted)

Next, we will relabel the columns of the counties table.

In [8]:

counties = counties.drop('population')
counties = counties.relabel('pop2017','pop').relabel('median_age_2019', 'median age').relabel('white_2019','white').relabel('hs_grad_2019','hs grad').relabel('bachelors_2019','bachelors')
counties = counties.relabel('median_household_income_2019','median household income').relabel('poverty_2019','poverty').relabel('unemployment_rate_2019','unemployment')
counties.show(5)
fipscountystatecasesdeathsper_gopper_dempopmedian agewhitehs gradbachelorsmedian household incomepovertyunemployment
1001AutaugaAlabama71501110.7143680.2701845550438.276.888.526.65873115.23.5
1003BaldwinAlabama216613110.7617140.224092126284386.290.831.95832010.44
1005BarbourAlabama2337590.5345120.4578822527040.446.873.211.63252530.79.4
1007BibbAlabama2665640.7842630.2069832266840.976.879.110.447542nan7
1009BlountAlabama68871390.8957160.09569385801340.795.580.513.14935813.63.1

… (3102 rows omitted)

In this cell, we will format the trump and biden columns into percentages and not proportions. The fix_votes function does this.

In [9]:

def fix_votes(x):
    """Return proportion in rouned percentage form"""
    return round(x*100,3)
counties['trump'] = counties.apply(fix_votes,'per_gop')
counties['biden'] = counties.apply(fix_votes,'per_dem')
data_1 = counties.drop('poverty', 'per_gop', 'per_dem')
data = data_1.select(0,1,2,5,3,4,12,13,10,11,6,7,8,9).relabel('unemployment', 'unemployed')
data = data.with_columns('cases1',data.column('cases')/data.column('pop'),'deaths1',data.column('deaths')/data.column('pop')).drop('cases', 'deaths')
data = data.relabel('cases1','cases').relabel( 'deaths1', 'deaths')
c = data.apply(fix_votes, 'cases')
d = data.apply(fix_votes, 'deaths')
data = data.drop(12,13).with_columns('cases', c, 'deaths', d)
data = data.select(0,1,2,3,'cases', 'deaths', 4,5,6,7,8,9,10,11)

The Data

In the data table below, the following columns are represented:

Note: All Alaskan counties have been excluded, as Alaskan election information is not collected by county

fips: An identification number given to each county by the Federal Government

county: The county name

state: The state that county is in, or the District of Columbia

pop: The population of the county (as of 2017)

cases: The number of cumulative COVID-19 cases in that county per capita (as of June 2nd, 2021)

deaths: The number of cumulative COVID-19 deaths in that county per capita (as of June 2nd, 2021)

trump: The percent of votes that were cast for Donald J. Trump in that county in the 2020 Presidential Election

biden: The percent of votes that were cast for Joesph R. Biden in that county in the 2020 Presidential Election

median household income: The median household income of that county (as of 2019)

unemployed: The unemployment rate in that county (as of 2019)

median age: The median age of that county (as of 2019)

white: The percent of residents of that county that are white (as of 2019)

hs grad: The percent of residents of that county that have graduated high school (as of 2019)

bachelors: The percent of residents of that county that have graduated college with a bachelor’s degree (as of 2019)

In [10]:

data.show(5)
fipscountystatepopcasesdeathstrumpbidenmedian household incomeunemployedmedian agewhitehs gradbachelors
1001AutaugaAlabama5550412.8820.271.43727.018587313.538.276.888.526.6
1003BaldwinAlabama21262810.1870.14676.17122.4095832044386.290.831.9
1005BarbourAlabama252709.2480.23353.45145.788325259.440.446.873.211.6
1007BibbAlabama2266811.7570.28278.42620.69847542740.976.879.110.4
1009BlountAlabama5801311.8710.2489.5729.569493583.140.795.580.513.1

… (3101 rows omitted)

We will now modify the data table to make our job easier. The first step we will make is identifying which candidate won in each county. To do this, we will define the votes function as below and create an election table with only the voting percentages. Then, we will save this new array.

In [11]:

def vote(x,y):
    """Return plurality winner in Presidential Election"""
    if x>y:
        return 'Trump'
    elif y>x:
        return 'Biden'
    else:
        return 'Tie'
votes = data.select('trump', 'biden')
winner = votes.apply(vote,0,1)
fips = data.column('fips')

We now create 2 tables, one with the categorical data, and one with the numerical data. We will also drop the vote percentage of each candidate, as we have the winner array, so they are no longer relevant.

In [12]:

cat = data.select(0,1,2).with_column('winner', winner)
num = data.select(3,4,5,8,9,10,11,12,13)

Now, because the units in each column are different, we want to convert the data into standard units or z-score so that all the data has the same weight in our classifier. We will do this by defining the s_u and standardize functions.

In [13]:

def s_u(array):
    """Return array in standard units"""
    return (array-np.mean(array))/np.std(array)
def standardize(table):
    """Return a table in standard units"""
    t = Table()
    for column in np.array(table.labels):
        col = s_u(table.column(column))
        t = t.with_column(column,col)
    return t

Now, each value represents how many standard deviations it is above or below the mean of that column. We can now call the standardize function on the num table, and add the fips column so we can join all the data back into one table.

In [14]:

stan_num = standardize(num)
stan_num_fips = stan_num.with_column('fips', fips)
full = cat.join('fips', stan_num_fips)
full.show(5)
fipscountystatewinnerpopcasesdeathsmedian household incomeunemployedmedian agewhitehs gradbachelors
1001AutaugaAlabamaTrump-0.1417330.909967-0.02878580.383516-0.550973-0.608552-0.3993070.2514560.487353
1003BaldwinAlabamaTrump0.338787-0.0038243-0.5106290.354379-0.3544640.2832530.1763960.6188381.04212
1005BarbourAlabamaTrump-0.234195-0.322210.265674-1.474311.76784-0.199808-2.23666-2.19243-1.08275
1007BibbAlabamaTrump-0.2421520.5285140.702903-0.4097070.824593-0.106912-0.399307-1.25002-1.20836
1009BlountAlabamaTrump-0.1340590.5671680.328135-0.280965-0.708181-0.144070.745974-1.02639-0.925743

… (3101 rows omitted)

As is standard practice in machine learning, we will now randomly split the data into 2 sets, train which will be used to train our algorithm, and test which will be used to evaluate the efficiency of our classifier.

In [15]:

shuf_full = full.sample(with_replacement = False)
training = shuf_full.take(np.arange(1606))
test = shuf_full.take(np.arange(1606,3106))

We’ll now only work with the training table until we have built our classifier. But first, let’s define some functions we will need.

distance: finds euclidean distance between two points

row_distance: uses previous function to find distance between two row objects

distances: creates a new table: training with one more column, which contains the distance from each row to the given example

closest: creates a new table with only the k closest rows

majority_class: finds the class of the majority of the rows in a given table

classify_1: calls majority_class on the table from closest to return our classifiers prediction

In [16]:

def distance(pt1, pt2):
    """Return the distance between two points, represented as arrays"""
    return np.sqrt(sum((pt1 - pt2)**2))

def row_distance(row1, row2):
    """Return the distance between two numerical rows of a table"""
    return distance(np.array(row1), np.array(row2))

def distances(training, example):
    """Return training table with distances column"""
    distances = make_array()
    attributes_only = training.drop('winner',0,1,2)    
    for row in attributes_only.rows:
        distances = np.append(distances, row_distance(row, example))
    return training.with_column('Distance', distances)

def closest(training, example, k):
    """
    Return a table of the k closest neighbors to example
    """
    return distances(training, example).sort('Distance').take(np.arange(k))

def majority_class(topk):
    """Return the class with the highest count"""
    return topk.group('winner').sort('count', descending=True).column(0).item(0)

def classify_1(training, example, k):
    """Return the majority class among the k nearest neighbors of example"""
    return majority_class(closest(training, example, k))

Here, we define one more function, accuracy, which evaluates the efficiency of classifier on the test set, by comparing the predictions and the actual, and returning the percentage of times classify was correct.

In [17]:

def accuracy(training, test, k):
    """Return the proportion of correctly classified examples 
    in the test set"""
    test_attributes = test.drop('winner',0,1,2)
    num_correct = 0
    for i in np.arange(test.num_rows):
        c = classify_1(training, test_attributes.row(i), k)
        num_correct = num_correct + (c == test.column('winner').item(i))
    return (num_correct / test.num_rows)*100

With some trial and error, we can find that the optimal amount of points to base our prediction on is k=13, this k leads to an accuracy, on the test set, of 92.6%.

In [18]:

accuracy(training, test, 13)

Out [18]:

92.6

Thus, we will define the final k-NN function classify, which makes its prediction using k=13 inherently, and the full table.

In [19]:

def classify(example):
    return classify_1(full, example ,13)

Finally, we can create a new function predict which, given a state and county name, will make a prediction using k=13 and return the actual result for comparison. This will use the full table.

In [20]:

def predict(state, county):
    ex = full.where('state', state).where('county', county)
    example = ex.drop(0,1,2,3).row(0)
    result = classify(example)
    print('My algorithm predicts that', result, 'won the 2020 Presidential election in', county,',',state)
    return full.where('state',state).where('county',county).select(0,2,1,3)

Below we show a few examples of the predict function in action.

In [21]:

predict('California','Yuba')

Out [21]:

My algorithm predicts that Trump won the 2020 Presidential election in Yuba , California

fipsstatecountywinner
6115CaliforniaYubaTrump

In [22]:

predict('Texas','Frio')

Out [22]:

My algorithm predicts that Trump won the 2020 Presidential election in Frio , Texas

fipsstatecountywinner
48163TexasFrioTrump

In [23]:

predict('New York','Nassau')

Out [23]:

My algorithm predicts that Biden won the 2020 Presidential election in Nassau , New York

fipsstatecountywinner
36059New YorkNassauBiden

In [24]:

predict('Nebraska','Brown')

Out [24]:

My algorithm predicts that Trump won the 2020 Presidential election in Brown , Nebraska

fipsstatecountywinner
31017NebraskaBrownTrump

In [25]:

predict('Illinois','Lee')

Out [25]:

My algorithm predicts that Trump won the 2020 Presidential election in Lee , Illinois

fipsstatecountywinner
17103IllinoisLeeTrump

In [26]:

predict('Mississippi','Bolivar')

Out [26]:

My algorithm predicts that Biden won the 2020 Presidential election in Bolivar , Mississippi

fipsstatecountywinner
28011MississippiBolivarBiden

Conclusion and Discussion

As we can see from the 92.6% accuracy of the classify function, given basic demographic information about a county, and its COVID-19 total case and total death count per capita, we can conclude that it is likely that the identity of a county’s residents and its reaction to the COVID-19 pandemic does correlate with their political affiliation.

It is possible that the ability of the algorithm to predict the winner of each county was due to only the demographics, or only the COVID-19 data. A further project which created classifiers for each type of data could evaluate the accuracy of both to identify where the decision border is clarified. If both have relatively low accuracy, then the association is with both; in the case that only one is high, it can be inferred that the classifier in this project was successful largely because of that set of variables.

Another, further, application of this algorithm is evaluating the accuracy of this classifier on other Presidential Elections, for instance, any past elections, or even to future elections, such as the 2024 Presidential Election, or the 2022 Midterm Election.

It is clear to see that while this project is limited to the scope of both sets of variables, and the 2020 Presidential Election, there are endless variations that could give even deeper insight into the nature of American politics in the context of the two-party-system.

If you have any questions about this project or require any clarification on concepts/topics, or need a walkthrough of any of my code, do not hesitate to reach out to me at jonathanferrari@berkeley.com. Cheers!

Social Media

Linkedin Email Phone Facebook Twitter Instagram Github
  • Home
  • About Me
  • Resume
  • Projects
  • Photography
  • Contact Me

© 2025 Jonathan Ferrari

Scroll to top
  • Home
  • About Me
  • Resume
  • Projects
    • You Snooze You Lose
    • Screw This
  • Photography
  • Contact Me
Search