OPTIONAL: U.S. Census data (R)

This section is optional. It provides an example of how to acquire potentially interesting predictors of COVID-19 cases from the U.S. Census Bureau.

The COVID-19 dataset we accessed above provides daily COVID-19 case counts for each U.S State, together with population counts from the 2010 Decennial Census. This should be enough information to produce some interesting visualizations. For modeling, however, we really only have one useful predictor in the dataset — time. This section describes some options for acquiring other potentially interesting predictors of COVID-19 cases.

U.S. Census Bureau API

We may want to use additional demographic information in our visualizations and analysis of the COVID-19 cases. An obvious place to source this information is from the U.S. Census Bureau. There are three U.S. Census Bureau data sources, each with their own API:

  1. Decennial Census: survey of every household in the U.S. every 10 years — used to calculate population of U.S. geographic areas.
  2. American Community Survey: yearly representative sample of 3.5 million households — used to calculate population estimates of U.S. geographic areas.
  3. Population Estimates: yearly population estimates of U.S. geographic areas.

The COVID-19 data from Dataverse already contains population values from the 2010 decennial census. But, using the Census Bureau’s Population Estimates API, we can get updated population data for 2019 as well as population data stratified by age groups, race, and sex.

We’re going to use the tidycensus package as an interface to the Census Bureau API. A basic usage guide is available — https://walker-data.com/tidycensus/articles/basic-usage.html — but we’ll walk through all the necessary steps.

The first step is to sign-up for an API key: http://api.census.gov/data/key_signup.html. Then give the key a name.

We can then set the API key for our current R session using the census_api_key() function (or we can include it in an .Renviron file for future use):

Next, we can use the get_estimates() function to access the Population Estimates API and extract variables of interest:

## Rows: 104
## Columns: 4
## $ NAME     <chr> "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", …
## $ GEOID    <chr> "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", …
## $ variable <chr> "POP", "POP", "POP", "POP", "POP", "POP", "POP", "POP", "POP…
## $ value    <dbl> 2976149, 6137428, 1068778, 1934408, 3080156, 1359711, 888219…

Get population estimates by age group:

## Rows: 1,664
## Columns: 4
## $ GEOID    <chr> "28", "28", "28", "28", "29", "28", "28", "28", "28", "28", …
## $ NAME     <chr> "Mississippi", "Mississippi", "Mississippi", "Mississippi", …
## $ value    <dbl> 2976149.0, 183478.0, 189377.0, 38.0, 38.9, 206282.0, 201350.…
## $ AGEGROUP <fct> All ages, Age 0 to 4 years, Age 5 to 9 years, Median age, Me…

Get population estimates by sex:

## Rows: 156
## Columns: 4
## $ GEOID <chr> "28", "28", "28", "29", "29", "29", "30", "30", "30", "31", "31…
## $ NAME  <chr> "Mississippi", "Mississippi", "Mississippi", "Missouri", "Misso…
## $ value <dbl> 2976149, 1442292, 1533857, 6137428, 3012662, 3124766, 1068778, …
## $ SEX   <chr> "Both sexes", "Male", "Female", "Both sexes", "Male", "Female",…

Get population estimates by race:

## Rows: 613
## Columns: 4
## $ GEOID <chr> "28", "28", "28", "28", "28", "28", "28", "28", "28", "28", "28…
## $ NAME  <chr> "Mississippi", "Mississippi", "Mississippi", "Mississippi", "Mi…
## $ value <dbl> 2976149, 1758081, 1124559, 18705, 33032, 1806, 39966, 1792535, …
## $ RACE  <chr> "All races", "White alone", "Black alone", "American Indian and…

Clean U.S. Census data

The Census data we extracted contain population estimates for multiple categories of age, race, and sex. It will be useful to simplify these data by creating some derived variables that may be of interest when visualizing and analyzing the data. For example, for each state, we may want to calculate:

  1. Overall population count and density
  2. Proportion of people that are 65 years and older
  3. Proportion of people that are female (or male)
  4. Proportion of people that are black (or white, or other race)

Overall population estimates:

## Rows: 51
## Columns: 4
## $ state            <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
## $ GEOID            <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11"…
## $ pop_count_2019   <dbl> 4903185, 731545, 7278717, 3017804, 39512223, 5758736…
## $ pop_density_2019 <dbl> 96.811652, 1.281127, 64.043252, 57.992836, 253.52068…

Population estimates by age group:

## Rows: 52
## Columns: 3
## $ GEOID             <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11…
## $ state             <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo…
## $ percent_age65over <dbl> 17.33235, 12.51980, 17.97890, 17.35971, 14.77547, 1…

Population estimates by sex:

## Rows: 52
## Columns: 3
## $ GEOID          <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11", …
## $ state          <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californi…
## $ percent_female <dbl> 51.67392, 47.86131, 50.30310, 50.90417, 50.28158, 49.6…

Population estimates by race:

## Rows: 52
## Columns: 4
## $ GEOID         <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11", "…
## $ state         <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California…
## $ percent_white <dbl> 69.12641, 65.27117, 82.61679, 79.03953, 71.93910, 86.93…
## $ percent_black <dbl> 26.7844473, 3.7055820, 5.1794430, 15.6752393, 6.4606767…

We can now merge all the cleaned Census data into one object called demographics:

## Rows: 51
## Columns: 8
## $ state             <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo…
## $ GEOID             <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11…
## $ pop_count_2019    <dbl> 4903185, 731545, 7278717, 3017804, 39512223, 575873…
## $ pop_density_2019  <dbl> 96.811652, 1.281127, 64.043252, 57.992836, 253.5206…
## $ percent_age65over <dbl> 17.33235, 12.51980, 17.97890, 17.35971, 14.77547, 1…
## $ percent_female    <dbl> 51.67392, 47.86131, 50.30310, 50.90417, 50.28158, 4…
## $ percent_white     <dbl> 69.12641, 65.27117, 82.61679, 79.03953, 71.93910, 8…
## $ percent_black     <dbl> 26.7844473, 3.7055820, 5.1794430, 15.6752393, 6.460…

Combine Census and COVID-19 data

Merge the COVID-19 cases data with Census demographic data:

## Rows: 16,014
## Columns: 18
## Groups: state [51]
## $ GEOID               <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ state               <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
## $ pop_count_2010      <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779…
## $ date                <date> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, …
## $ cases_cum           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_of_year         <dbl> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 3…
## $ week_of_year        <dbl> 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6…
## $ month               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2…
## $ cases_count         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_count_pos     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_rate_100K     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_cum_rate_100K <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pop_count_2019      <dbl> 4903185, 4903185, 4903185, 4903185, 4903185, 4903…
## $ pop_density_2019    <dbl> 96.81165, 96.81165, 96.81165, 96.81165, 96.81165,…
## $ percent_age65over   <dbl> 17.33235, 17.33235, 17.33235, 17.33235, 17.33235,…
## $ percent_female      <dbl> 51.67392, 51.67392, 51.67392, 51.67392, 51.67392,…
## $ percent_white       <dbl> 69.12641, 69.12641, 69.12641, 69.12641, 69.12641,…
## $ percent_black       <dbl> 26.78445, 26.78445, 26.78445, 26.78445, 26.78445,…