OPTIONAL: U.S. Census data (R)

This section is optional. It provides an example of how to acquire potentially interesting predictors of COVID-19 cases from the U.S. Census Bureau.

The COVID-19 dataset we accessed above provides daily COVID-19 case counts for each U.S State, together with population counts from the 2010 Decennial Census. This should be enough information to produce some interesting visualizations. For modeling, however, we really only have one useful predictor in the dataset — time. This section describes some options for acquiring other potentially interesting predictors of COVID-19 cases.

U.S. Census Bureau API

We may want to use additional demographic information in our visualizations and analysis of the COVID-19 cases. An obvious place to source this information is from the U.S. Census Bureau. There are three U.S. Census Bureau data sources, each with their own API:

Decennial Census: survey of every household in the U.S. every 10 years — used to calculate population of U.S. geographic areas.
American Community Survey: yearly representative sample of 3.5 million households — used to calculate population estimates of U.S. geographic areas.
Population Estimates: yearly population estimates of U.S. geographic areas.

The COVID-19 data from Dataverse already contains population values from the 2010 decennial census. But, using the Census Bureau’s Population Estimates API, we can get updated population data for 2019 as well as population data stratified by age groups, race, and sex.

We’re going to use the tidycensus package as an interface to the Census Bureau API. A basic usage guide is available — https://walker-data.com/tidycensus/articles/basic-usage.html — but we’ll walk through all the necessary steps.

The first step is to sign-up for an API key: http://api.census.gov/data/key_signup.html. Then give the key a name.

# store API key
API_key <- "your-API-key-here"

We can then set the API key for our current R session using the census_api_key() function (or we can include it in an .Renviron file for future use):

# set API key for current session
census_api_key(API_key)

Next, we can use the get_estimates() function to access the Population Estimates API and extract variables of interest:

pop <- get_estimates(
    geography = "state", # we'll select state-level data
    product = "population", # provides overall population estimates and population densities
    year = 2019, # the latest year available
    key = API_key)

glimpse(pop)

## Rows: 104
## Columns: 4
## $ NAME     <chr> "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", …
## $ GEOID    <chr> "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", …
## $ variable <chr> "POP", "POP", "POP", "POP", "POP", "POP", "POP", "POP", "POP…
## $ value    <dbl> 2976149, 6137428, 1068778, 1934408, 3080156, 1359711, 888219…

Get population estimates by age group:

age <- get_estimates(
    geography = "state", 
    product = "characteristics", # provides population estimates stratified by the variable specified in `breakdown`
    breakdown = "AGEGROUP", # population estimates for different age groups
    breakdown_labels = TRUE, # labels for age groups
    year = 2019, 
    key = API_key) 

glimpse(age)

## Rows: 1,664
## Columns: 4
## $ GEOID    <chr> "28", "28", "28", "28", "29", "28", "28", "28", "28", "28", …
## $ NAME     <chr> "Mississippi", "Mississippi", "Mississippi", "Mississippi", …
## $ value    <dbl> 2976149.0, 183478.0, 189377.0, 38.0, 38.9, 206282.0, 201350.…
## $ AGEGROUP <fct> All ages, Age 0 to 4 years, Age 5 to 9 years, Median age, Me…

Get population estimates by sex:

sex <- get_estimates(
    geography = "state", 
    product = "characteristics",
    breakdown = "SEX", # population estimates for different sexes
    breakdown_labels = TRUE,
    year = 2019, 
    key = API_key) 

glimpse(sex)

## Rows: 156
## Columns: 4
## $ GEOID <chr> "28", "28", "28", "29", "29", "29", "30", "30", "30", "31", "31…
## $ NAME  <chr> "Mississippi", "Mississippi", "Mississippi", "Missouri", "Misso…
## $ value <dbl> 2976149, 1442292, 1533857, 6137428, 3012662, 3124766, 1068778, …
## $ SEX   <chr> "Both sexes", "Male", "Female", "Both sexes", "Male", "Female",…

Get population estimates by race:

race <- get_estimates(
    geography = "state", 
    product = "characteristics",
    breakdown ="RACE", # population estimates for different races
    breakdown_labels = TRUE,
    year = 2019, 
    key = API_key)

glimpse(race)

## Rows: 613
## Columns: 4
## $ GEOID <chr> "28", "28", "28", "28", "28", "28", "28", "28", "28", "28", "28…
## $ NAME  <chr> "Mississippi", "Mississippi", "Mississippi", "Mississippi", "Mi…
## $ value <dbl> 2976149, 1758081, 1124559, 18705, 33032, 1806, 39966, 1792535, …
## $ RACE  <chr> "All races", "White alone", "Black alone", "American Indian and…

Clean U.S. Census data

The Census data we extracted contain population estimates for multiple categories of age, race, and sex. It will be useful to simplify these data by creating some derived variables that may be of interest when visualizing and analyzing the data. For example, for each state, we may want to calculate:

Overall population count and density
Proportion of people that are 65 years and older
Proportion of people that are female (or male)
Proportion of people that are black (or white, or other race)

Overall population estimates:

pop_wide <- pop %>% 
    # order by GEOID (same as state FIPS code)
    arrange(GEOID) %>%
    # rename state 
    rename(state = NAME) %>%
    # exclude Puerto Rico
    filter(state != "Puerto Rico") %>% 
    # reshape population variables to wide format
    pivot_wider(names_from = variable, values_from = value) %>%
    # rename population variables
    rename(pop_count_2019 = POP, pop_density_2019 = DENSITY)

glimpse(pop_wide)

## Rows: 51
## Columns: 4
## $ state            <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
## $ GEOID            <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11"…
## $ pop_count_2019   <dbl> 4903185, 731545, 7278717, 3017804, 39512223, 5758736…
## $ pop_density_2019 <dbl> 96.811652, 1.281127, 64.043252, 57.992836, 253.52068…

Population estimates by age group:

age_wide <- age %>% 
    # order by GEOID (same as state FIPS code)
    arrange(GEOID) %>%
    # rename state
    rename(state = NAME) %>%
    # reshape the age groups to wide format
    pivot_wider(names_from = AGEGROUP, values_from = value) %>%
    # create variable for percentortion of people that are 65 years and older
    mutate(percent_age65over = (`65 years and over` /  `All ages`) * 100) %>%
    # select columns of interest
    select(GEOID, state, percent_age65over)

glimpse(age_wide)

## Rows: 52
## Columns: 3
## $ GEOID             <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11…
## $ state             <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo…
## $ percent_age65over <dbl> 17.33235, 12.51980, 17.97890, 17.35971, 14.77547, 1…

Population estimates by sex:

sex_wide <- sex %>% 
    # order by GEOID (same as state FIPS code)
    arrange(GEOID) %>%
    # rename state
    rename(state = NAME) %>%
    # reshape the sex categories to wide format
    pivot_wider(names_from = SEX, values_from = value) %>%
    # create variable for percentortion of people that are female
    mutate(percent_female = (Female / `Both sexes`) * 100) %>%
    # select columns of interest
    select(GEOID, state, percent_female)

glimpse(sex_wide)

## Rows: 52
## Columns: 3
## $ GEOID          <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11", …
## $ state          <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californi…
## $ percent_female <dbl> 51.67392, 47.86131, 50.30310, 50.90417, 50.28158, 49.6…

Population estimates by race:

race_wide <- race %>% 
    # order by GEOID (same as state FIPS code)
    arrange(GEOID) %>%
    # rename state
    rename(state = NAME) %>%
    # reshape the race categories to wide format
    pivot_wider(names_from = RACE, values_from = value) %>%
    # create variables for percentortion of people that are black and white
    mutate(percent_white = (`White alone` /  `All races`) * 100,
           percent_black = (`Black alone` /  `All races`) * 100) %>%
    # select columns of interest
    select(GEOID, state, percent_white, percent_black)

glimpse(race_wide)

## Rows: 52
## Columns: 4
## $ GEOID         <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11", "…
## $ state         <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California…
## $ percent_white <dbl> 69.12641, 65.27117, 82.61679, 79.03953, 71.93910, 86.93…
## $ percent_black <dbl> 26.7844473, 3.7055820, 5.1794430, 15.6752393, 6.4606767…

We can now merge all the cleaned Census data into one object called demographics:

demographics <- list(pop_wide, age_wide, sex_wide, race_wide) %>% 
    reduce(left_join, by = c("GEOID", "state"))

glimpse(demographics)

## Rows: 51
## Columns: 8
## $ state             <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo…
## $ GEOID             <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11…
## $ pop_count_2019    <dbl> 4903185, 731545, 7278717, 3017804, 39512223, 575873…
## $ pop_density_2019  <dbl> 96.811652, 1.281127, 64.043252, 57.992836, 253.5206…
## $ percent_age65over <dbl> 17.33235, 12.51980, 17.97890, 17.35971, 14.77547, 1…
## $ percent_female    <dbl> 51.67392, 47.86131, 50.30310, 50.90417, 50.28158, 4…
## $ percent_white     <dbl> 69.12641, 65.27117, 82.61679, 79.03953, 71.93910, 8…
## $ percent_black     <dbl> 26.7844473, 3.7055820, 5.1794430, 15.6752393, 6.460…

Combine Census and COVID-19 data

Merge the COVID-19 cases data with Census demographic data:

# merge COVID-19 cases with demographics
US_cases_long_demogr <- US_cases_long %>% 
    left_join(demographics, by = c("GEOID", "state")) 

# update the case rate variables to use population estimates from 2019
US_cases_long_demogr <- US_cases_long_demogr %>%
    mutate(cases_cum_rate_100K = (cases_cum / pop_count_2019) * 1e5,
           cases_rate_100K = (cases_count_pos / pop_count_2019) * 1e5)

glimpse(US_cases_long_demogr)

## Rows: 16,014
## Columns: 18
## Groups: state [51]
## $ GEOID               <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ state               <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
## $ pop_count_2010      <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779…
## $ date                <date> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, …
## $ cases_cum           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_of_year         <dbl> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 3…
## $ week_of_year        <dbl> 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6…
## $ month               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2…
## $ cases_count         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_count_pos     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_rate_100K     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_cum_rate_100K <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pop_count_2019      <dbl> 4903185, 4903185, 4903185, 4903185, 4903185, 4903…
## $ pop_density_2019    <dbl> 96.81165, 96.81165, 96.81165, 96.81165, 96.81165,…
## $ percent_age65over   <dbl> 17.33235, 17.33235, 17.33235, 17.33235, 17.33235,…
## $ percent_female      <dbl> 51.67392, 51.67392, 51.67392, 51.67392, 51.67392,…
## $ percent_white       <dbl> 69.12641, 69.12641, 69.12641, 69.12641, 69.12641,…
## $ percent_black       <dbl> 26.78445, 26.78445, 26.78445, 26.78445, 26.78445,…

Aggregate to weekly-level

Once again, for the purposes of modeling, it may be useful to aggregate to the weekly-level:

# COVID-19 data and demographic data
US_cases_long_demogr_week <- US_cases_long_demogr %>%
    group_by(state, week_of_year) %>%
    summarize(pop_count_2019 = mean(pop_count_2019),
              percent_age65over = mean(percent_age65over), 
              percent_female = mean(percent_female), 
              percent_white = mean(percent_white), 
              percent_black = mean(percent_black),
              cases_count_pos = sum(cases_count_pos), 
              cases_rate_100K = sum(cases_rate_100K)) %>%
    drop_na()

Let’s store the data frame in a binary R file so that we can easily access it later:

save(US_cases_long_demogr_week, file = "data_r/US_cases_long_demogr_week.Rdata")