OPTIONAL: U.S. Census data (Python)
This section is optional. It provides an example of how to acquire potentially interesting predictors of COVID-19 cases from the U.S. Census Bureau.
The COVID-19 dataset we accessed above provides daily COVID-19 case counts for each U.S State, together with population counts from the 2010 Decennial Census. This should be enough information to produce some interesting visualizations. For modeling, however, we really only have one useful predictor in the dataset — time. This section describes some options for acquiring other potentially interesting predictors of COVID-19 cases.
U.S. Census Bureau API
We may want to use additional demographic information in our visualizations and analysis of the COVID-19 cases. An obvious place to source this information is from the U.S. Census Bureau. There are three U.S. Census Bureau data sources, each with their own API:
- Decennial Census: survey of every household in the U.S. every 10 years — used to calculate population of U.S. geographic areas.
- American Community Survey: yearly representative sample of 3.5 million households — used to calculate population estimates of U.S. geographic areas.
- Population Estimates: yearly population estimates of U.S. geographic areas.
The COVID-19 data from Dataverse already contains population values from the 2010 decennial census. But, using the Census Bureau’s Population Estimates API, we can get updated population data for 2019 as well as population data stratified by age groups, race, and sex.
We’re going to use the requests
package as an interface to the Census Bureau API.
The first step is to sign-up for an API key: http://api.census.gov/data/key_signup.html. Then give the key a name.
Next, we can use the .get()
method to access the Population Estimates API and extract variables of interest into a pandas data frame:
# access the Population Estimates API and extract variables of interest
# provides overall population estimates and population densities
pop_url = f'https://api.census.gov/data/2019/pep/population?get=NAME,POP,DENSITY&for=state:*&key={API_key}'
response = requests.get(pop_url)
pop_data = response.json()
pop_df = pd.DataFrame(pop_data[1:], columns=pop_data[0]).rename(columns={'NAME':'state', 'state':'GEOID'})
print(pop_df.head(20))
## state POP DENSITY GEOID
## 0 Mississippi 2976149 63.42279731700000 28
## 1 Missouri 6137428 89.27757816400000 29
## 2 Montana 1068778 7.34314521650000 30
## 3 Nebraska 1934408 25.18219543000000 31
## 4 Nevada 3080156 28.03700636500000 32
## 5 New Hampshire 1359711 151.86532690000000 33
## 6 New Jersey 8882190 1207.67221710000000 34
## 7 New Mexico 2096829 17.28456209900000 35
## 8 New York 19453561 412.81842864000000 36
## 9 North Carolina 10488084 215.71420423000000 37
## 10 North Dakota 762062 11.04521346200000 38
## 11 Ohio 11689100 286.08552119000000 39
## 12 Oklahoma 3956971 57.68520257700000 40
## 13 Oregon 4217737 43.94025129200000 41
## 14 Pennsylvania 12801989 286.13105806000000 42
## 15 Rhode Island 1059361 1024.62672880000000 44
## 16 South Carolina 5148714 171.26007636000000 45
## 17 South Dakota 884659 11.66947080600000 46
## 18 Tennessee 6829174 165.60404231000000 47
## 19 Texas 28995881 110.98346670000000 48
Get population estimates by age group:
age_url = f'https://api.census.gov/data/2019/pep/charagegroups?get=NAME,AGEGROUP,POP&for=state:*&key={API_key}'
response = requests.get(age_url)
age_data = response.json()
age_df = pd.DataFrame(age_data[1:], columns=age_data[0]).rename(columns={'NAME':'state', 'state':'GEOID'})
print(age_df.head(20))
## state AGEGROUP POP GEOID
## 0 Mississippi 0 2976149 28
## 1 Mississippi 1 183478 28
## 2 Mississippi 2 189377 28
## 3 Missouri 31 38.9 29
## 4 Mississippi 3 206282 28
## 5 Mississippi 4 201350 28
## 6 Mississippi 5 201517 28
## 7 Mississippi 6 206989 28
## 8 Mississippi 7 186758 28
## 9 Mississippi 8 189068 28
## 10 Mississippi 9 176686 28
## 11 Mississippi 10 181833 28
## 12 Mississippi 11 180207 28
## 13 Mississippi 12 197314 28
## 14 Mississippi 13 188597 28
## 15 Mississippi 14 160611 28
## 16 Mississippi 15 128344 28
## 17 Mississippi 16 87629 28
## 18 Mississippi 17 57811 28
## 19 Mississippi 18 52298 28
Get population estimates by sex:
sex_url = f'https://api.census.gov/data/2019/pep/charagegroups?get=NAME,SEX,POP&for=state:*&key={API_key}'
response = requests.get(sex_url)
sex_data = response.json()
sex_df = pd.DataFrame(sex_data[1:], columns=sex_data[0]).rename(columns={'NAME':'state', 'state':'GEOID'})
print(sex_df.head(20))
## state SEX POP GEOID
## 0 Mississippi 0 2976149 28
## 1 Mississippi 1 1442292 28
## 2 Mississippi 2 1533857 28
## 3 Missouri 0 6137428 29
## 4 Missouri 1 3012662 29
## 5 Missouri 2 3124766 29
## 6 Montana 0 1068778 30
## 7 Montana 1 538066 30
## 8 Montana 2 530712 30
## 9 Nebraska 0 1934408 31
## 10 Nebraska 1 966449 31
## 11 Nevada 0 3080156 32
## 12 Nevada 1 1544827 32
## 13 New Hampshire 0 1359711 33
## 14 New Hampshire 1 673793 33
## 15 New Jersey 0 8882190 34
## 16 New Jersey 1 4340587 34
## 17 New Mexico 0 2096829 35
## 18 New Mexico 1 1037432 35
## 19 New York 0 19453561 36
Get population estimates by race:
race_url = f'https://api.census.gov/data/2019/pep/charagegroups?get=NAME,RACE,POP&for=state:*&key={API_key}'
response = requests.get(race_url)
race_data = response.json()
race_df = pd.DataFrame(race_data[1:], columns=race_data[0]).rename(columns={'NAME':'state', 'state':'GEOID'})
print(race_df.head(20))
## state RACE POP GEOID
## 0 Mississippi 0 2976149 28
## 1 Mississippi 1 1758081 28
## 2 Mississippi 2 1124559 28
## 3 Mississippi 3 18705 28
## 4 Mississippi 4 33032 28
## 5 Mississippi 5 1806 28
## 6 Mississippi 6 39966 28
## 7 Mississippi 7 1792535 28
## 8 Mississippi 8 1148593 28
## 9 Mississippi 9 31669 28
## 10 Mississippi 10 42550 28
## 11 Mississippi 11 3821 28
## 12 Missouri 0 6137428 29
## 13 Missouri 1 5086188 29
## 14 Missouri 2 725579 29
## 15 Missouri 3 35839 29
## 16 Missouri 4 133312 29
## 17 Missouri 5 9857 29
## 18 Missouri 6 146653 29
## 19 Missouri 7 5222409 29
Clean U.S. Census data
The Census data we extracted contain population estimates for multiple categories of age, race, and sex. It will be useful to simplify these data by creating some derived variables that may be of interest when visualizing and analyzing the data. For example, for each state, we may want to calculate:
- Overall population count and density
- Proportion of people that are 65 years and older
- Proportion of people that are female (or male)
- Proportion of people that are black (or white, or other race)
Overall population estimates:
# order by GEOID (same as state FIPS code)
pop_df[["GEOID", "POP"]] = pop_df[["GEOID", "POP"]].apply(pd.to_numeric)
pop_wide = pop_df.sort_values(['GEOID']).reset_index(drop=True)
# exclude Puerto Rico and rename some variables
# data are already in wide format - no need to reshape
pop_wide = pop_wide[pop_wide.state != "Puerto Rico"].rename(columns={'POP':'pop_count_2019', 'DENSITY':'pop_density_2019'})
print(pop_wide.head(20))
## state pop_count_2019 pop_density_2019 GEOID
## 0 Alabama 4903185 96.81165243500000 1
## 1 Alaska 731545 1.28112659200000 2
## 2 Arizona 7278717 64.04325175200000 4
## 3 Arkansas 3017804 57.99283630500000 5
## 4 California 39512223 253.52068818000000 6
## 5 Colorado 5758736 55.56614283700000 8
## 6 Connecticut 3565287 736.22104642000000 9
## 7 Delaware 973764 499.74775877000000 10
## 8 District of Columbia 705749 11543.99993900000000 11
## 9 Florida 21477737 400.34642766000000 12
## 10 Georgia 10617423 183.95873142000000 13
## 11 Hawaii 1415872 220.45750094000000 15
## 12 Idaho 1787065 21.62335488800000 16
## 13 Illinois 12671821 228.26467808000000 17
## 14 Indiana 6732219 187.91203111000000 18
## 15 Iowa 3155070 56.48808963000000 19
## 16 Kansas 2913314 35.63317347300000 20
## 17 Kentucky 4467673 113.13031412000000 21
## 18 Louisiana 4648794 107.59979292000000 22
## 19 Maine 1344212 43.57985940900000 23
Population estimates by age group:
# convert some variables to numeric
age_df[["GEOID", "AGEGROUP", "POP"]] = age_df[["GEOID", "AGEGROUP", "POP"]].apply(pd.to_numeric)
# order by GEOID (same as state FIPS code)
age_df = age_df.sort_values(['GEOID', 'AGEGROUP']).reset_index(drop=True)
# reshape the age groups to wide format
age_wide = age_df.pivot_table(index=["GEOID", "state"], columns='AGEGROUP', values="POP").reset_index()
# create variable for percent of people that are 65 years and older
age_wide["percent_age65over"] = (age_wide[26] / age_wide[0]) * 100
# select columns of interest
age_wide = age_wide.loc[:, ['GEOID', 'state', 'percent_age65over']]
# exclude Puerto Rico
age_wide = age_wide[age_wide.state != "Puerto Rico"]
print(age_wide.head(20))
## AGEGROUP GEOID state percent_age65over
## 0 1 Alabama 17.332346
## 1 2 Alaska 12.519804
## 2 4 Arizona 17.978897
## 3 5 Arkansas 17.359709
## 4 6 California 14.775466
## 5 8 Colorado 14.628418
## 6 9 Connecticut 17.677230
## 7 10 Delaware 19.399567
## 8 11 District of Columbia 12.375930
## 9 12 Florida 20.939529
## 10 13 Georgia 14.287403
## 11 15 Hawaii 18.959906
## 12 16 Idaho 16.265217
## 13 17 Illinois 16.124218
## 14 18 Indiana 16.127565
## 15 19 Iowa 17.525887
## 16 20 Kansas 16.321172
## 17 21 Kentucky 16.799887
## 18 22 Louisiana 15.940091
## 19 23 Maine 21.221727
Population estimates by sex:
# convert some variables to numeric
sex_df[["GEOID", "SEX", "POP"]] = sex_df[["GEOID", "SEX", "POP"]].apply(pd.to_numeric)
# order by GEOID (same as state FIPS code)
sex_df = sex_df.sort_values(['GEOID', 'SEX']).reset_index(drop=True)
# reshape the sex groups to wide format
sex_wide = sex_df.pivot_table(index=["GEOID", "state"], columns='SEX', values="POP").reset_index()
# create variable for percent of people that are female
sex_wide["percent_female"] = (sex_wide[2] / sex_wide[0]) * 100
# select columns of interest
sex_wide = sex_wide.loc[:, ['GEOID', 'state', 'percent_female']]
# exclude Puerto Rico
sex_wide = sex_wide[sex_wide.state != "Puerto Rico"]
print(sex_wide.head(20))
## SEX GEOID state percent_female
## 0 1 Alabama 51.673922
## 1 2 Alaska 47.861307
## 2 4 Arizona 50.303096
## 3 5 Arkansas 50.904167
## 4 6 California 50.281577
## 5 8 Colorado 49.618319
## 6 9 Connecticut 51.212427
## 7 10 Delaware 51.652248
## 8 11 District of Columbia 52.573649
## 9 12 Florida 51.125875
## 10 13 Georgia 51.396643
## 11 15 Hawaii 49.992655
## 12 16 Idaho 49.865953
## 13 17 Illinois 50.862958
## 14 18 Indiana 50.679560
## 15 19 Iowa 50.204750
## 16 20 Kansas 50.170253
## 17 21 Kentucky 50.734331
## 18 22 Louisiana 51.233589
## 19 23 Maine 51.041056
Population estimates by race:
# convert some variables to numeric
race_df[["GEOID", "RACE", "POP"]] = race_df[["GEOID", "RACE", "POP"]].apply(pd.to_numeric)
# order by GEOID (same as state FIPS code)
race_df = race_df.sort_values(['GEOID', 'RACE']).reset_index(drop=True)
# reshape the race categories to wide format
race_wide = race_df.pivot_table(index=["GEOID", "state"], columns='RACE', values="POP").reset_index()
# create variables for percentages of people that are black and white
race_wide["percent_white"] = (race_wide[1] / race_wide[0]) * 100
race_wide["percent_black"] = (race_wide[2] / race_wide[0]) * 100
# select columns of interest
race_wide = race_wide.loc[:, ['GEOID', 'state', 'percent_white', 'percent_black']]
# exclude Puerto Rico
race_wide = race_wide[race_wide.state != "Puerto Rico"]
print(race_wide.head(20))
## RACE GEOID state percent_white percent_black
## 0 1 Alabama 69.126415 26.784447
## 1 2 Alaska 65.271173 3.705582
## 2 4 Arizona 82.616785 5.179443
## 3 5 Arkansas 79.039527 15.675239
## 4 6 California 71.939104 6.460677
## 5 8 Colorado 86.939426 4.592657
## 6 9 Connecticut 79.700204 12.189341
## 7 10 Delaware 69.223652 23.162080
## 8 11 District of Columbia 45.990997 45.977253
## 9 12 Florida 77.277192 16.917588
## 10 13 Georgia 60.198270 32.570493
## 11 15 Hawaii 25.519044 2.186497
## 12 16 Idaho 93.022078 0.914684
## 13 17 Illinois 76.760199 14.619177
## 14 18 Indiana 84.809956 9.946141
## 15 19 Iowa 90.618592 4.060290
## 16 20 Kansas 86.288193 6.134766
## 17 21 Kentucky 87.515872 8.471502
## 18 22 Louisiana 62.787101 32.798463
## 19 23 Maine 94.429004 1.688052
We can now merge all the cleaned Census data into one object called demographics
:
data_frames = [pop_wide, age_wide, sex_wide, race_wide]
demographics = reduce(lambda left,right: pd.merge(left,right,on=['GEOID', 'state'], how='left'), data_frames)
print(demographics.head(20))
## state ... percent_black
## 0 Alabama ... 26.784447
## 1 Alaska ... 3.705582
## 2 Arizona ... 5.179443
## 3 Arkansas ... 15.675239
## 4 California ... 6.460677
## 5 Colorado ... 4.592657
## 6 Connecticut ... 12.189341
## 7 Delaware ... 23.162080
## 8 District of Columbia ... 45.977253
## 9 Florida ... 16.917588
## 10 Georgia ... 32.570493
## 11 Hawaii ... 2.186497
## 12 Idaho ... 0.914684
## 13 Illinois ... 14.619177
## 14 Indiana ... 9.946141
## 15 Iowa ... 4.060290
## 16 Kansas ... 6.134766
## 17 Kentucky ... 8.471502
## 18 Louisiana ... 32.798463
## 19 Maine ... 1.688052
##
## [20 rows x 8 columns]
Combine Census and COVID-19 data
Merge the COVID-19 cases data with Census demographic data:
# merge COVID-19 cases with demographics
data_frames = [US_states_cases_selected, demographics]
US_cases_long_demogr = reduce(lambda left,right: pd.merge(left,right,on=['GEOID', 'state'], how='left'), data_frames)
# update the case rate variables to use population estimates from 2019
US_cases_long_demogr["cases_cum_rate_100K"] = (US_cases_long_demogr["cases_cum"] / US_cases_long_demogr["pop_count_2019"]) * 1e5
US_cases_long_demogr["cases_rate_100K"] = (US_cases_long_demogr["cases_count_pos"] / US_cases_long_demogr["pop_count_2019"]) * 1e5
print(US_cases_long_demogr.head(20))
## GEOID state ... percent_white percent_black
## 0 1 Alabama ... 69.126415 26.784447
## 1 1 Alabama ... 69.126415 26.784447
## 2 1 Alabama ... 69.126415 26.784447
## 3 1 Alabama ... 69.126415 26.784447
## 4 1 Alabama ... 69.126415 26.784447
## 5 1 Alabama ... 69.126415 26.784447
## 6 1 Alabama ... 69.126415 26.784447
## 7 1 Alabama ... 69.126415 26.784447
## 8 1 Alabama ... 69.126415 26.784447
## 9 1 Alabama ... 69.126415 26.784447
## 10 1 Alabama ... 69.126415 26.784447
## 11 1 Alabama ... 69.126415 26.784447
## 12 1 Alabama ... 69.126415 26.784447
## 13 1 Alabama ... 69.126415 26.784447
## 14 1 Alabama ... 69.126415 26.784447
## 15 1 Alabama ... 69.126415 26.784447
## 16 1 Alabama ... 69.126415 26.784447
## 17 1 Alabama ... 69.126415 26.784447
## 18 1 Alabama ... 69.126415 26.784447
## 19 1 Alabama ... 69.126415 26.784447
##
## [20 rows x 18 columns]
Aggregate to weekly-level
Once again, for the purposes of modeling, it may be useful to aggregate to the weekly-level:
# COVID-19 data and demographic data
aggs_by_col = {'pop_count_2019': lambda x: np.mean(x), 'percent_age65over': lambda x: np.mean(x), 'percent_female': lambda x: np.mean(x), \
'percent_white': lambda x: np.mean(x), 'percent_black': lambda x: np.mean(x),'cases_cum': 'sum', 'cases_cum_rate_100K': 'sum', \
'cases_count_pos': 'sum', 'cases_rate_100K': 'sum'}
US_cases_long_demogr_week = US_cases_long_demogr.groupby(['state', 'week_of_year'], as_index=False).agg(aggs_by_col)
print(US_cases_long_demogr_week.head(20))
## state week_of_year ... cases_count_pos cases_rate_100K
## 0 Alabama 4 ... 0.0 0.000000
## 1 Alabama 5 ... 0.0 0.000000
## 2 Alabama 6 ... 0.0 0.000000
## 3 Alabama 7 ... 0.0 0.000000
## 4 Alabama 8 ... 0.0 0.000000
## 5 Alabama 9 ... 0.0 0.000000
## 6 Alabama 10 ... 0.0 0.000000
## 7 Alabama 11 ... 23.0 0.469083
## 8 Alabama 12 ... 134.0 2.732917
## 9 Alabama 13 ... 673.0 13.725772
## 10 Alabama 14 ... 1010.0 20.598856
## 11 Alabama 15 ... 1743.0 35.548322
## 12 Alabama 16 ... 1320.0 26.921277
## 13 Alabama 17 ... 1518.0 30.959468
## 14 Alabama 18 ... 1467.0 29.919328
## 15 Alabama 19 ... 2001.0 40.810208
## 16 Alabama 20 ... 1882.0 38.383214
## 17 Alabama 21 ... 2707.0 55.209012
## 18 Alabama 22 ... 3474.0 70.851905
## 19 Alabama 23 ... 2548.0 51.966222
##
## [20 rows x 11 columns]
Let’s store the data frame in a .csv
file so that we can easily access it later: