OPTIONAL: U.S. Census data (Python)

This section is optional. It provides an example of how to acquire potentially interesting predictors of COVID-19 cases from the U.S. Census Bureau.

The COVID-19 dataset we accessed above provides daily COVID-19 case counts for each U.S State, together with population counts from the 2010 Decennial Census. This should be enough information to produce some interesting visualizations. For modeling, however, we really only have one useful predictor in the dataset — time. This section describes some options for acquiring other potentially interesting predictors of COVID-19 cases.

U.S. Census Bureau API

We may want to use additional demographic information in our visualizations and analysis of the COVID-19 cases. An obvious place to source this information is from the U.S. Census Bureau. There are three U.S. Census Bureau data sources, each with their own API:

Decennial Census: survey of every household in the U.S. every 10 years — used to calculate population of U.S. geographic areas.
American Community Survey: yearly representative sample of 3.5 million households — used to calculate population estimates of U.S. geographic areas.
Population Estimates: yearly population estimates of U.S. geographic areas.

The COVID-19 data from Dataverse already contains population values from the 2010 decennial census. But, using the Census Bureau’s Population Estimates API, we can get updated population data for 2019 as well as population data stratified by age groups, race, and sex.

We’re going to use the requests package as an interface to the Census Bureau API.

The first step is to sign-up for an API key: http://api.census.gov/data/key_signup.html. Then give the key a name.

# store API key
API_key = "your-API-key-here"

Next, we can use the .get() method to access the Population Estimates API and extract variables of interest into a pandas data frame:

# access the Population Estimates API and extract variables of interest
# provides overall population estimates and population densities
pop_url = f'https://api.census.gov/data/2019/pep/population?get=NAME,POP,DENSITY&for=state:*&key={API_key}'
response = requests.get(pop_url)
pop_data = response.json()
pop_df = pd.DataFrame(pop_data[1:], columns=pop_data[0]).rename(columns={'NAME':'state', 'state':'GEOID'})

print(pop_df.head(20))

##              state       POP              DENSITY GEOID
## 0      Mississippi   2976149    63.42279731700000    28
## 1         Missouri   6137428    89.27757816400000    29
## 2          Montana   1068778     7.34314521650000    30
## 3         Nebraska   1934408    25.18219543000000    31
## 4           Nevada   3080156    28.03700636500000    32
## 5    New Hampshire   1359711   151.86532690000000    33
## 6       New Jersey   8882190  1207.67221710000000    34
## 7       New Mexico   2096829    17.28456209900000    35
## 8         New York  19453561   412.81842864000000    36
## 9   North Carolina  10488084   215.71420423000000    37
## 10    North Dakota    762062    11.04521346200000    38
## 11            Ohio  11689100   286.08552119000000    39
## 12        Oklahoma   3956971    57.68520257700000    40
## 13          Oregon   4217737    43.94025129200000    41
## 14    Pennsylvania  12801989   286.13105806000000    42
## 15    Rhode Island   1059361  1024.62672880000000    44
## 16  South Carolina   5148714   171.26007636000000    45
## 17    South Dakota    884659    11.66947080600000    46
## 18       Tennessee   6829174   165.60404231000000    47
## 19           Texas  28995881   110.98346670000000    48

Get population estimates by age group:

age_url = f'https://api.census.gov/data/2019/pep/charagegroups?get=NAME,AGEGROUP,POP&for=state:*&key={API_key}'
response = requests.get(age_url)
age_data = response.json()
age_df = pd.DataFrame(age_data[1:], columns=age_data[0]).rename(columns={'NAME':'state', 'state':'GEOID'})

print(age_df.head(20))

##           state AGEGROUP      POP GEOID
## 0   Mississippi        0  2976149    28
## 1   Mississippi        1   183478    28
## 2   Mississippi        2   189377    28
## 3      Missouri       31     38.9    29
## 4   Mississippi        3   206282    28
## 5   Mississippi        4   201350    28
## 6   Mississippi        5   201517    28
## 7   Mississippi        6   206989    28
## 8   Mississippi        7   186758    28
## 9   Mississippi        8   189068    28
## 10  Mississippi        9   176686    28
## 11  Mississippi       10   181833    28
## 12  Mississippi       11   180207    28
## 13  Mississippi       12   197314    28
## 14  Mississippi       13   188597    28
## 15  Mississippi       14   160611    28
## 16  Mississippi       15   128344    28
## 17  Mississippi       16    87629    28
## 18  Mississippi       17    57811    28
## 19  Mississippi       18    52298    28

Get population estimates by sex:

sex_url = f'https://api.census.gov/data/2019/pep/charagegroups?get=NAME,SEX,POP&for=state:*&key={API_key}'
response = requests.get(sex_url)
sex_data = response.json()
sex_df = pd.DataFrame(sex_data[1:], columns=sex_data[0]).rename(columns={'NAME':'state', 'state':'GEOID'})

print(sex_df.head(20))

##             state SEX       POP GEOID
## 0     Mississippi   0   2976149    28
## 1     Mississippi   1   1442292    28
## 2     Mississippi   2   1533857    28
## 3        Missouri   0   6137428    29
## 4        Missouri   1   3012662    29
## 5        Missouri   2   3124766    29
## 6         Montana   0   1068778    30
## 7         Montana   1    538066    30
## 8         Montana   2    530712    30
## 9        Nebraska   0   1934408    31
## 10       Nebraska   1    966449    31
## 11         Nevada   0   3080156    32
## 12         Nevada   1   1544827    32
## 13  New Hampshire   0   1359711    33
## 14  New Hampshire   1    673793    33
## 15     New Jersey   0   8882190    34
## 16     New Jersey   1   4340587    34
## 17     New Mexico   0   2096829    35
## 18     New Mexico   1   1037432    35
## 19       New York   0  19453561    36

Get population estimates by race:

race_url = f'https://api.census.gov/data/2019/pep/charagegroups?get=NAME,RACE,POP&for=state:*&key={API_key}'
response = requests.get(race_url)
race_data = response.json()
race_df = pd.DataFrame(race_data[1:], columns=race_data[0]).rename(columns={'NAME':'state', 'state':'GEOID'})

print(race_df.head(20))

##           state RACE      POP GEOID
## 0   Mississippi    0  2976149    28
## 1   Mississippi    1  1758081    28
## 2   Mississippi    2  1124559    28
## 3   Mississippi    3    18705    28
## 4   Mississippi    4    33032    28
## 5   Mississippi    5     1806    28
## 6   Mississippi    6    39966    28
## 7   Mississippi    7  1792535    28
## 8   Mississippi    8  1148593    28
## 9   Mississippi    9    31669    28
## 10  Mississippi   10    42550    28
## 11  Mississippi   11     3821    28
## 12     Missouri    0  6137428    29
## 13     Missouri    1  5086188    29
## 14     Missouri    2   725579    29
## 15     Missouri    3    35839    29
## 16     Missouri    4   133312    29
## 17     Missouri    5     9857    29
## 18     Missouri    6   146653    29
## 19     Missouri    7  5222409    29

Clean U.S. Census data

The Census data we extracted contain population estimates for multiple categories of age, race, and sex. It will be useful to simplify these data by creating some derived variables that may be of interest when visualizing and analyzing the data. For example, for each state, we may want to calculate:

Overall population count and density
Proportion of people that are 65 years and older
Proportion of people that are female (or male)
Proportion of people that are black (or white, or other race)

Overall population estimates:

# order by GEOID (same as state FIPS code)
pop_df[["GEOID", "POP"]] = pop_df[["GEOID", "POP"]].apply(pd.to_numeric)
pop_wide = pop_df.sort_values(['GEOID']).reset_index(drop=True)
# exclude Puerto Rico and rename some variables
# data are already in wide format - no need to reshape
pop_wide = pop_wide[pop_wide.state != "Puerto Rico"].rename(columns={'POP':'pop_count_2019', 'DENSITY':'pop_density_2019'})

print(pop_wide.head(20))

##                    state  pop_count_2019      pop_density_2019  GEOID
## 0                Alabama         4903185     96.81165243500000      1
## 1                 Alaska          731545      1.28112659200000      2
## 2                Arizona         7278717     64.04325175200000      4
## 3               Arkansas         3017804     57.99283630500000      5
## 4             California        39512223    253.52068818000000      6
## 5               Colorado         5758736     55.56614283700000      8
## 6            Connecticut         3565287    736.22104642000000      9
## 7               Delaware          973764    499.74775877000000     10
## 8   District of Columbia          705749  11543.99993900000000     11
## 9                Florida        21477737    400.34642766000000     12
## 10               Georgia        10617423    183.95873142000000     13
## 11                Hawaii         1415872    220.45750094000000     15
## 12                 Idaho         1787065     21.62335488800000     16
## 13              Illinois        12671821    228.26467808000000     17
## 14               Indiana         6732219    187.91203111000000     18
## 15                  Iowa         3155070     56.48808963000000     19
## 16                Kansas         2913314     35.63317347300000     20
## 17              Kentucky         4467673    113.13031412000000     21
## 18             Louisiana         4648794    107.59979292000000     22
## 19                 Maine         1344212     43.57985940900000     23

Population estimates by age group:

# convert some variables to numeric
age_df[["GEOID", "AGEGROUP", "POP"]] = age_df[["GEOID", "AGEGROUP", "POP"]].apply(pd.to_numeric)
# order by GEOID (same as state FIPS code)
age_df = age_df.sort_values(['GEOID', 'AGEGROUP']).reset_index(drop=True)
# reshape the age groups to wide format
age_wide = age_df.pivot_table(index=["GEOID", "state"], columns='AGEGROUP', values="POP").reset_index()
# create variable for percent of people that are 65 years and older
age_wide["percent_age65over"] = (age_wide[26] / age_wide[0]) * 100
# select columns of interest
age_wide = age_wide.loc[:, ['GEOID', 'state', 'percent_age65over']]
# exclude Puerto Rico
age_wide = age_wide[age_wide.state != "Puerto Rico"]

print(age_wide.head(20))

## AGEGROUP  GEOID                 state  percent_age65over
## 0             1               Alabama          17.332346
## 1             2                Alaska          12.519804
## 2             4               Arizona          17.978897
## 3             5              Arkansas          17.359709
## 4             6            California          14.775466
## 5             8              Colorado          14.628418
## 6             9           Connecticut          17.677230
## 7            10              Delaware          19.399567
## 8            11  District of Columbia          12.375930
## 9            12               Florida          20.939529
## 10           13               Georgia          14.287403
## 11           15                Hawaii          18.959906
## 12           16                 Idaho          16.265217
## 13           17              Illinois          16.124218
## 14           18               Indiana          16.127565
## 15           19                  Iowa          17.525887
## 16           20                Kansas          16.321172
## 17           21              Kentucky          16.799887
## 18           22             Louisiana          15.940091
## 19           23                 Maine          21.221727

Population estimates by sex:

# convert some variables to numeric
sex_df[["GEOID", "SEX", "POP"]] = sex_df[["GEOID", "SEX", "POP"]].apply(pd.to_numeric)
# order by GEOID (same as state FIPS code)
sex_df = sex_df.sort_values(['GEOID', 'SEX']).reset_index(drop=True)
# reshape the sex groups to wide format
sex_wide = sex_df.pivot_table(index=["GEOID", "state"], columns='SEX', values="POP").reset_index()
# create variable for percent of people that are female
sex_wide["percent_female"] = (sex_wide[2] / sex_wide[0]) * 100
# select columns of interest
sex_wide = sex_wide.loc[:, ['GEOID', 'state', 'percent_female']]
# exclude Puerto Rico
sex_wide = sex_wide[sex_wide.state != "Puerto Rico"]

print(sex_wide.head(20))

## SEX  GEOID                 state  percent_female
## 0        1               Alabama       51.673922
## 1        2                Alaska       47.861307
## 2        4               Arizona       50.303096
## 3        5              Arkansas       50.904167
## 4        6            California       50.281577
## 5        8              Colorado       49.618319
## 6        9           Connecticut       51.212427
## 7       10              Delaware       51.652248
## 8       11  District of Columbia       52.573649
## 9       12               Florida       51.125875
## 10      13               Georgia       51.396643
## 11      15                Hawaii       49.992655
## 12      16                 Idaho       49.865953
## 13      17              Illinois       50.862958
## 14      18               Indiana       50.679560
## 15      19                  Iowa       50.204750
## 16      20                Kansas       50.170253
## 17      21              Kentucky       50.734331
## 18      22             Louisiana       51.233589
## 19      23                 Maine       51.041056

Population estimates by race:

# convert some variables to numeric
race_df[["GEOID", "RACE", "POP"]] = race_df[["GEOID", "RACE", "POP"]].apply(pd.to_numeric)
# order by GEOID (same as state FIPS code)
race_df = race_df.sort_values(['GEOID', 'RACE']).reset_index(drop=True)
# reshape the race categories to wide format
race_wide = race_df.pivot_table(index=["GEOID", "state"], columns='RACE', values="POP").reset_index()
# create variables for percentages of people that are black and white
race_wide["percent_white"] = (race_wide[1] / race_wide[0]) * 100
race_wide["percent_black"] = (race_wide[2] / race_wide[0]) * 100
# select columns of interest
race_wide = race_wide.loc[:, ['GEOID', 'state', 'percent_white', 'percent_black']]
# exclude Puerto Rico
race_wide = race_wide[race_wide.state != "Puerto Rico"]

print(race_wide.head(20))

## RACE  GEOID                 state  percent_white  percent_black
## 0         1               Alabama      69.126415      26.784447
## 1         2                Alaska      65.271173       3.705582
## 2         4               Arizona      82.616785       5.179443
## 3         5              Arkansas      79.039527      15.675239
## 4         6            California      71.939104       6.460677
## 5         8              Colorado      86.939426       4.592657
## 6         9           Connecticut      79.700204      12.189341
## 7        10              Delaware      69.223652      23.162080
## 8        11  District of Columbia      45.990997      45.977253
## 9        12               Florida      77.277192      16.917588
## 10       13               Georgia      60.198270      32.570493
## 11       15                Hawaii      25.519044       2.186497
## 12       16                 Idaho      93.022078       0.914684
## 13       17              Illinois      76.760199      14.619177
## 14       18               Indiana      84.809956       9.946141
## 15       19                  Iowa      90.618592       4.060290
## 16       20                Kansas      86.288193       6.134766
## 17       21              Kentucky      87.515872       8.471502
## 18       22             Louisiana      62.787101      32.798463
## 19       23                 Maine      94.429004       1.688052

We can now merge all the cleaned Census data into one object called demographics:

data_frames = [pop_wide, age_wide, sex_wide, race_wide]
demographics = reduce(lambda  left,right: pd.merge(left,right,on=['GEOID', 'state'], how='left'), data_frames)

print(demographics.head(20))

##                    state  ...  percent_black
## 0                Alabama  ...      26.784447
## 1                 Alaska  ...       3.705582
## 2                Arizona  ...       5.179443
## 3               Arkansas  ...      15.675239
## 4             California  ...       6.460677
## 5               Colorado  ...       4.592657
## 6            Connecticut  ...      12.189341
## 7               Delaware  ...      23.162080
## 8   District of Columbia  ...      45.977253
## 9                Florida  ...      16.917588
## 10               Georgia  ...      32.570493
## 11                Hawaii  ...       2.186497
## 12                 Idaho  ...       0.914684
## 13              Illinois  ...      14.619177
## 14               Indiana  ...       9.946141
## 15                  Iowa  ...       4.060290
## 16                Kansas  ...       6.134766
## 17              Kentucky  ...       8.471502
## 18             Louisiana  ...      32.798463
## 19                 Maine  ...       1.688052
## 
## [20 rows x 8 columns]

Combine Census and COVID-19 data

Merge the COVID-19 cases data with Census demographic data:

# merge COVID-19 cases with demographics
data_frames = [US_states_cases_selected, demographics]
US_cases_long_demogr = reduce(lambda  left,right: pd.merge(left,right,on=['GEOID', 'state'], how='left'), data_frames)

# update the case rate variables to use population estimates from 2019
US_cases_long_demogr["cases_cum_rate_100K"] = (US_cases_long_demogr["cases_cum"] / US_cases_long_demogr["pop_count_2019"]) * 1e5
US_cases_long_demogr["cases_rate_100K"] = (US_cases_long_demogr["cases_count_pos"] / US_cases_long_demogr["pop_count_2019"]) * 1e5

print(US_cases_long_demogr.head(20))

##     GEOID    state  ...  percent_white percent_black
## 0       1  Alabama  ...      69.126415     26.784447
## 1       1  Alabama  ...      69.126415     26.784447
## 2       1  Alabama  ...      69.126415     26.784447
## 3       1  Alabama  ...      69.126415     26.784447
## 4       1  Alabama  ...      69.126415     26.784447
## 5       1  Alabama  ...      69.126415     26.784447
## 6       1  Alabama  ...      69.126415     26.784447
## 7       1  Alabama  ...      69.126415     26.784447
## 8       1  Alabama  ...      69.126415     26.784447
## 9       1  Alabama  ...      69.126415     26.784447
## 10      1  Alabama  ...      69.126415     26.784447
## 11      1  Alabama  ...      69.126415     26.784447
## 12      1  Alabama  ...      69.126415     26.784447
## 13      1  Alabama  ...      69.126415     26.784447
## 14      1  Alabama  ...      69.126415     26.784447
## 15      1  Alabama  ...      69.126415     26.784447
## 16      1  Alabama  ...      69.126415     26.784447
## 17      1  Alabama  ...      69.126415     26.784447
## 18      1  Alabama  ...      69.126415     26.784447
## 19      1  Alabama  ...      69.126415     26.784447
## 
## [20 rows x 18 columns]

Aggregate to weekly-level

Once again, for the purposes of modeling, it may be useful to aggregate to the weekly-level:

# COVID-19 data and demographic data
aggs_by_col = {'pop_count_2019': lambda x: np.mean(x), 'percent_age65over': lambda x: np.mean(x), 'percent_female': lambda x: np.mean(x), \
    'percent_white': lambda x: np.mean(x), 'percent_black': lambda x: np.mean(x),'cases_cum': 'sum', 'cases_cum_rate_100K': 'sum', \
    'cases_count_pos': 'sum', 'cases_rate_100K': 'sum'}

US_cases_long_demogr_week = US_cases_long_demogr.groupby(['state', 'week_of_year'], as_index=False).agg(aggs_by_col)

print(US_cases_long_demogr_week.head(20))

##       state  week_of_year  ...  cases_count_pos  cases_rate_100K
## 0   Alabama             4  ...              0.0         0.000000
## 1   Alabama             5  ...              0.0         0.000000
## 2   Alabama             6  ...              0.0         0.000000
## 3   Alabama             7  ...              0.0         0.000000
## 4   Alabama             8  ...              0.0         0.000000
## 5   Alabama             9  ...              0.0         0.000000
## 6   Alabama            10  ...              0.0         0.000000
## 7   Alabama            11  ...             23.0         0.469083
## 8   Alabama            12  ...            134.0         2.732917
## 9   Alabama            13  ...            673.0        13.725772
## 10  Alabama            14  ...           1010.0        20.598856
## 11  Alabama            15  ...           1743.0        35.548322
## 12  Alabama            16  ...           1320.0        26.921277
## 13  Alabama            17  ...           1518.0        30.959468
## 14  Alabama            18  ...           1467.0        29.919328
## 15  Alabama            19  ...           2001.0        40.810208
## 16  Alabama            20  ...           1882.0        38.383214
## 17  Alabama            21  ...           2707.0        55.209012
## 18  Alabama            22  ...           3474.0        70.851905
## 19  Alabama            23  ...           2548.0        51.966222
## 
## [20 rows x 11 columns]

Let’s store the data frame in a .csv file so that we can easily access it later:

# store the data frame in a .csv file
US_cases_long_demogr_week.to_csv("data_py/US_cases_long_demogr_week.csv")