OPTIONAL: U.S. Census data (R)
This section is optional. It provides an example of how to acquire potentially interesting predictors of COVID-19 cases from the U.S. Census Bureau.
The COVID-19 dataset we accessed above provides daily COVID-19 case counts for each U.S State, together with population counts from the 2010 Decennial Census. This should be enough information to produce some interesting visualizations. For modeling, however, we really only have one useful predictor in the dataset — time. This section describes some options for acquiring other potentially interesting predictors of COVID-19 cases.
U.S. Census Bureau API
We may want to use additional demographic information in our visualizations and analysis of the COVID-19 cases. An obvious place to source this information is from the U.S. Census Bureau. There are three U.S. Census Bureau data sources, each with their own API:
- Decennial Census: survey of every household in the U.S. every 10 years — used to calculate population of U.S. geographic areas.
- American Community Survey: yearly representative sample of 3.5 million households — used to calculate population estimates of U.S. geographic areas.
- Population Estimates: yearly population estimates of U.S. geographic areas.
The COVID-19 data from Dataverse already contains population values from the 2010 decennial census. But, using the Census Bureau’s Population Estimates API, we can get updated population data for 2019 as well as population data stratified by age groups, race, and sex.
We’re going to use the tidycensus
package as an interface to the Census Bureau API. A basic usage guide is available — https://walker-data.com/tidycensus/articles/basic-usage.html — but we’ll walk through all the necessary steps.
The first step is to sign-up for an API key: http://api.census.gov/data/key_signup.html. Then give the key a name.
We can then set the API key for our current R session using the census_api_key()
function (or we can include it in an .Renviron
file for future use):
Next, we can use the get_estimates()
function to access the Population Estimates API and extract variables of interest:
pop <- get_estimates(
geography = "state", # we'll select state-level data
product = "population", # provides overall population estimates and population densities
year = 2019, # the latest year available
key = API_key)
glimpse(pop)
## Rows: 104
## Columns: 4
## $ NAME <chr> "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", …
## $ GEOID <chr> "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", …
## $ variable <chr> "POP", "POP", "POP", "POP", "POP", "POP", "POP", "POP", "POP…
## $ value <dbl> 2976149, 6137428, 1068778, 1934408, 3080156, 1359711, 888219…
Get population estimates by age group:
age <- get_estimates(
geography = "state",
product = "characteristics", # provides population estimates stratified by the variable specified in `breakdown`
breakdown = "AGEGROUP", # population estimates for different age groups
breakdown_labels = TRUE, # labels for age groups
year = 2019,
key = API_key)
glimpse(age)
## Rows: 1,664
## Columns: 4
## $ GEOID <chr> "28", "28", "28", "28", "29", "28", "28", "28", "28", "28", …
## $ NAME <chr> "Mississippi", "Mississippi", "Mississippi", "Mississippi", …
## $ value <dbl> 2976149.0, 183478.0, 189377.0, 38.0, 38.9, 206282.0, 201350.…
## $ AGEGROUP <fct> All ages, Age 0 to 4 years, Age 5 to 9 years, Median age, Me…
Get population estimates by sex:
sex <- get_estimates(
geography = "state",
product = "characteristics",
breakdown = "SEX", # population estimates for different sexes
breakdown_labels = TRUE,
year = 2019,
key = API_key)
glimpse(sex)
## Rows: 156
## Columns: 4
## $ GEOID <chr> "28", "28", "28", "29", "29", "29", "30", "30", "30", "31", "31…
## $ NAME <chr> "Mississippi", "Mississippi", "Mississippi", "Missouri", "Misso…
## $ value <dbl> 2976149, 1442292, 1533857, 6137428, 3012662, 3124766, 1068778, …
## $ SEX <chr> "Both sexes", "Male", "Female", "Both sexes", "Male", "Female",…
Get population estimates by race:
race <- get_estimates(
geography = "state",
product = "characteristics",
breakdown ="RACE", # population estimates for different races
breakdown_labels = TRUE,
year = 2019,
key = API_key)
glimpse(race)
## Rows: 613
## Columns: 4
## $ GEOID <chr> "28", "28", "28", "28", "28", "28", "28", "28", "28", "28", "28…
## $ NAME <chr> "Mississippi", "Mississippi", "Mississippi", "Mississippi", "Mi…
## $ value <dbl> 2976149, 1758081, 1124559, 18705, 33032, 1806, 39966, 1792535, …
## $ RACE <chr> "All races", "White alone", "Black alone", "American Indian and…
Clean U.S. Census data
The Census data we extracted contain population estimates for multiple categories of age, race, and sex. It will be useful to simplify these data by creating some derived variables that may be of interest when visualizing and analyzing the data. For example, for each state, we may want to calculate:
- Overall population count and density
- Proportion of people that are 65 years and older
- Proportion of people that are female (or male)
- Proportion of people that are black (or white, or other race)
Overall population estimates:
pop_wide <- pop %>%
# order by GEOID (same as state FIPS code)
arrange(GEOID) %>%
# rename state
rename(state = NAME) %>%
# exclude Puerto Rico
filter(state != "Puerto Rico") %>%
# reshape population variables to wide format
pivot_wider(names_from = variable, values_from = value) %>%
# rename population variables
rename(pop_count_2019 = POP, pop_density_2019 = DENSITY)
glimpse(pop_wide)
## Rows: 51
## Columns: 4
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
## $ GEOID <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11"…
## $ pop_count_2019 <dbl> 4903185, 731545, 7278717, 3017804, 39512223, 5758736…
## $ pop_density_2019 <dbl> 96.811652, 1.281127, 64.043252, 57.992836, 253.52068…
Population estimates by age group:
age_wide <- age %>%
# order by GEOID (same as state FIPS code)
arrange(GEOID) %>%
# rename state
rename(state = NAME) %>%
# reshape the age groups to wide format
pivot_wider(names_from = AGEGROUP, values_from = value) %>%
# create variable for percentortion of people that are 65 years and older
mutate(percent_age65over = (`65 years and over` / `All ages`) * 100) %>%
# select columns of interest
select(GEOID, state, percent_age65over)
glimpse(age_wide)
## Rows: 52
## Columns: 3
## $ GEOID <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11…
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo…
## $ percent_age65over <dbl> 17.33235, 12.51980, 17.97890, 17.35971, 14.77547, 1…
Population estimates by sex:
sex_wide <- sex %>%
# order by GEOID (same as state FIPS code)
arrange(GEOID) %>%
# rename state
rename(state = NAME) %>%
# reshape the sex categories to wide format
pivot_wider(names_from = SEX, values_from = value) %>%
# create variable for percentortion of people that are female
mutate(percent_female = (Female / `Both sexes`) * 100) %>%
# select columns of interest
select(GEOID, state, percent_female)
glimpse(sex_wide)
## Rows: 52
## Columns: 3
## $ GEOID <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11", …
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californi…
## $ percent_female <dbl> 51.67392, 47.86131, 50.30310, 50.90417, 50.28158, 49.6…
Population estimates by race:
race_wide <- race %>%
# order by GEOID (same as state FIPS code)
arrange(GEOID) %>%
# rename state
rename(state = NAME) %>%
# reshape the race categories to wide format
pivot_wider(names_from = RACE, values_from = value) %>%
# create variables for percentortion of people that are black and white
mutate(percent_white = (`White alone` / `All races`) * 100,
percent_black = (`Black alone` / `All races`) * 100) %>%
# select columns of interest
select(GEOID, state, percent_white, percent_black)
glimpse(race_wide)
## Rows: 52
## Columns: 4
## $ GEOID <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11", "…
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California…
## $ percent_white <dbl> 69.12641, 65.27117, 82.61679, 79.03953, 71.93910, 86.93…
## $ percent_black <dbl> 26.7844473, 3.7055820, 5.1794430, 15.6752393, 6.4606767…
We can now merge all the cleaned Census data into one object called demographics
:
demographics <- list(pop_wide, age_wide, sex_wide, race_wide) %>%
reduce(left_join, by = c("GEOID", "state"))
glimpse(demographics)
## Rows: 51
## Columns: 8
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo…
## $ GEOID <chr> "01", "02", "04", "05", "06", "08", "09", "10", "11…
## $ pop_count_2019 <dbl> 4903185, 731545, 7278717, 3017804, 39512223, 575873…
## $ pop_density_2019 <dbl> 96.811652, 1.281127, 64.043252, 57.992836, 253.5206…
## $ percent_age65over <dbl> 17.33235, 12.51980, 17.97890, 17.35971, 14.77547, 1…
## $ percent_female <dbl> 51.67392, 47.86131, 50.30310, 50.90417, 50.28158, 4…
## $ percent_white <dbl> 69.12641, 65.27117, 82.61679, 79.03953, 71.93910, 8…
## $ percent_black <dbl> 26.7844473, 3.7055820, 5.1794430, 15.6752393, 6.460…
Combine Census and COVID-19 data
Merge the COVID-19 cases data with Census demographic data:
# merge COVID-19 cases with demographics
US_cases_long_demogr <- US_cases_long %>%
left_join(demographics, by = c("GEOID", "state"))
# update the case rate variables to use population estimates from 2019
US_cases_long_demogr <- US_cases_long_demogr %>%
mutate(cases_cum_rate_100K = (cases_cum / pop_count_2019) * 1e5,
cases_rate_100K = (cases_count_pos / pop_count_2019) * 1e5)
glimpse(US_cases_long_demogr)
## Rows: 16,014
## Columns: 18
## Groups: state [51]
## $ GEOID <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
## $ pop_count_2010 <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779…
## $ date <date> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, …
## $ cases_cum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_of_year <dbl> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 3…
## $ week_of_year <dbl> 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6…
## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2…
## $ cases_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_count_pos <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_rate_100K <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_cum_rate_100K <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pop_count_2019 <dbl> 4903185, 4903185, 4903185, 4903185, 4903185, 4903…
## $ pop_density_2019 <dbl> 96.81165, 96.81165, 96.81165, 96.81165, 96.81165,…
## $ percent_age65over <dbl> 17.33235, 17.33235, 17.33235, 17.33235, 17.33235,…
## $ percent_female <dbl> 51.67392, 51.67392, 51.67392, 51.67392, 51.67392,…
## $ percent_white <dbl> 69.12641, 69.12641, 69.12641, 69.12641, 69.12641,…
## $ percent_black <dbl> 26.78445, 26.78445, 26.78445, 26.78445, 26.78445,…
Aggregate to weekly-level
Once again, for the purposes of modeling, it may be useful to aggregate to the weekly-level:
# COVID-19 data and demographic data
US_cases_long_demogr_week <- US_cases_long_demogr %>%
group_by(state, week_of_year) %>%
summarize(pop_count_2019 = mean(pop_count_2019),
percent_age65over = mean(percent_age65over),
percent_female = mean(percent_female),
percent_white = mean(percent_white),
percent_black = mean(percent_black),
cases_count_pos = sum(cases_count_pos),
cases_rate_100K = sum(cases_rate_100K)) %>%
drop_na()
Let’s store the data frame in a binary R file so that we can easily access it later: