DAY 1: Acquiring and cleaning data (R)
Acquiring data from APIs
Often, we want to acquire data that is stored online. Online data sources are stored somewhere on a remote server — a remotely located computer that is optimized to process requests for information. Usually, we make requests using a browser, also known as a client, but we can also make requests programmatically. An Application Programming Interface (API) is the part of a remote server that receives requests and sends responses. When we make requests programmatically, the responses an API sends are typically data-based, often in some structured format like JSON or XML.
For the project we’ll pursue during DataFest, we’re going to access data stored on the Harvard Dataverse. A Dataverse is open source software for repositing research data. Once data is stored in a Dataverse, it can be accessed programmatically using the Dataverse API. We will use the R package dataverse
as an interface for the Dataverse API.
Here are three COVID-19 datasets from the Harvard Dataverse:
- US data on COVID-19 cases and deaths, daily at state-level or county-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HIDLTK
- US data on COVID-19 cases and deaths, daily at metropolitan-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5B8YM8
- World data on COVID-19 cases and deaths, daily at country-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L20LOT
As an example of how to use the Dataverse API, we’re going to extract daily data on COVID-19 cases from the U.S. at the state-level (from dataset #1 above). These data span the period from January 21st 2020 until November 29th 2020 for each U.S. state (and the District of Columbia). If you wish, you may choose to use one of the other datasets for your project.
We can use the dataverse
package as an interface for the API. All we need to start is a digital object identifier (DOI) for the dataset — this is the persistentId
parameter at the end of the above URLs. We can then use the get_dataset()
function to retrieve the metadata for the dataset:
# get the digital object identifier for the Dataverse dataset
DOI <- "doi:10.7910/DVN/HIDLTK"
# retrieve the contents of the dataset
covid <- get_dataset(DOI)
The covid
object is a list of metadata that includes a data frame of all the files stored within this dataset. Let’s look at the structure of this object:
## List of 16
## $ id : int 218273
## $ datasetId : int 3679837
## $ datasetPersistentId: chr "doi:10.7910/DVN/HIDLTK"
## $ storageIdentifier : chr "s3://10.7910/DVN/HIDLTK"
## $ versionNumber : int 46
## $ versionMinorNumber : int 0
## $ versionState : chr "RELEASED"
## $ UNF : chr "UNF:6:l+QTrceV0xEn3GGLAskwEQ=="
## $ lastUpdateTime : chr "2020-12-01T22:15:13Z"
## $ releaseTime : chr "2020-12-01T22:15:13Z"
## $ createTime : chr "2020-12-01T22:11:46Z"
## $ license : chr "CC0"
## $ termsOfUse : chr "CC0 Waiver"
## $ fileAccessRequest : logi FALSE
## $ metadataBlocks :List of 1
## $ files :'data.frame': 12 obs. of 22 variables:
## - attr(*, "class")= chr "dataverse_dataset"
Let’s dig further and display the available files:
## [1] "COUNTY_MAP.zip"
## [2] "INDEX.txt"
## [3] "METRO_MAP_2018_ESRI.zip"
## [4] "METRO_MAP.zip"
## [5] "Metropolitan_statistical_areas_for_US_counties__Sept_2018.xml"
## [6] "Metropolitan_statistical_areas_for_US_counties__Sept_2018.zip"
## [7] "README.txt"
## [8] "STATE_MAP.zip"
## [9] "us_county_confirmed_cases.tab"
## [10] "us_county_deaths_cases.tab"
## [11] "us_state_confirmed_case.tab"
## [12] "us_state_deaths_case.tab"
For our example project, we’re going to use the data on cumulative COVID-19 cases at the state-level contained in the us_state_confirmed_case.tab
file. We can use the get_file()
function to extract these data into a raw vector:
# get data file for COVID-19 cases
US_cases_file <- get_file("us_state_confirmed_case.tab", dataset = DOI)
To convert the data from the raw vector into a more user friendly data frame, we can use the read_csv()
function from the readr
package:
We can now examine the structure of the data:
## # A tibble: 6 x 326
## fips NAME POP70 HHD70 POP80 HHD80 POP90 HHD90 POP00 HHD00 POP10
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01 Alab… 3.43e6 1.03e6 3.89e6 1.34e6 4.04e6 1.51e6 4.45e6 1.74e6 4.78e6
## 2 02 Alas… 2.25e5 6.15e4 3.93e5 1.30e5 5.50e5 1.89e5 6.27e5 2.22e5 7.10e5
## 3 04 Ariz… 1.77e6 5.39e5 2.71e6 9.53e5 3.66e6 1.37e6 5.13e6 1.90e6 6.39e6
## 4 05 Arka… 1.90e6 6.08e5 2.25e6 8.06e5 2.35e6 8.91e5 2.67e6 1.04e6 2.92e6
## 5 06 Cali… 1.98e7 6.57e6 2.36e7 8.62e6 2.97e7 1.04e7 3.39e7 1.15e7 3.73e7
## 6 08 Colo… 2.18e6 6.85e5 2.83e6 1.04e6 3.28e6 1.28e6 4.28e6 1.65e6 5.03e6
## # … with 315 more variables: HHD10 <dbl>, `2020-01-21` <dbl>,
## # `2020-01-22` <dbl>, `2020-01-23` <dbl>, `2020-01-24` <dbl>,
## # `2020-01-25` <dbl>, `2020-01-26` <dbl>, `2020-01-27` <dbl>,
## # `2020-01-28` <dbl>, `2020-01-29` <dbl>, `2020-01-30` <dbl>,
## # `2020-01-31` <dbl>, `2020-02-01` <dbl>, `2020-02-02` <dbl>,
## # `2020-02-03` <dbl>, `2020-02-04` <dbl>, `2020-02-05` <dbl>,
## # `2020-02-06` <dbl>, `2020-02-07` <dbl>, `2020-02-08` <dbl>,
## # `2020-02-09` <dbl>, `2020-02-10` <dbl>, `2020-02-11` <dbl>,
## # `2020-02-12` <dbl>, `2020-02-13` <dbl>, `2020-02-14` <dbl>,
## # `2020-02-15` <dbl>, `2020-02-16` <dbl>, `2020-02-17` <dbl>,
## # `2020-02-18` <dbl>, `2020-02-19` <dbl>, `2020-02-20` <dbl>,
## # `2020-02-21` <dbl>, `2020-02-22` <dbl>, `2020-02-23` <dbl>,
## # `2020-02-24` <dbl>, `2020-02-25` <dbl>, `2020-02-26` <dbl>,
## # `2020-02-27` <dbl>, `2020-02-28` <dbl>, `2020-02-29` <dbl>,
## # `2020-03-01` <dbl>, `2020-03-02` <dbl>, `2020-03-03` <dbl>,
## # `2020-03-04` <dbl>, `2020-03-05` <dbl>, `2020-03-06` <dbl>,
## # `2020-03-07` <dbl>, `2020-03-08` <dbl>, `2020-03-09` <dbl>,
## # `2020-03-10` <dbl>, `2020-03-11` <dbl>, `2020-03-12` <dbl>,
## # `2020-03-13` <dbl>, `2020-03-14` <dbl>, `2020-03-15` <dbl>,
## # `2020-03-16` <dbl>, `2020-03-17` <dbl>, `2020-03-18` <dbl>,
## # `2020-03-19` <dbl>, `2020-03-20` <dbl>, `2020-03-21` <dbl>,
## # `2020-03-22` <dbl>, `2020-03-23` <dbl>, `2020-03-24` <dbl>,
## # `2020-03-25` <dbl>, `2020-03-26` <dbl>, `2020-03-27` <dbl>,
## # `2020-03-28` <dbl>, `2020-03-29` <dbl>, `2020-03-30` <dbl>,
## # `2020-03-31` <dbl>, `2020-04-01` <dbl>, `2020-04-02` <dbl>,
## # `2020-04-03` <dbl>, `2020-04-04` <dbl>, `2020-04-05` <dbl>,
## # `2020-04-06` <dbl>, `2020-04-07` <dbl>, `2020-04-08` <dbl>,
## # `2020-04-09` <dbl>, `2020-04-10` <dbl>, `2020-04-11` <dbl>,
## # `2020-04-12` <dbl>, `2020-04-13` <dbl>, `2020-04-14` <dbl>,
## # `2020-04-15` <dbl>, `2020-04-16` <dbl>, `2020-04-17` <dbl>,
## # `2020-04-18` <dbl>, `2020-04-19` <dbl>, `2020-04-20` <dbl>,
## # `2020-04-21` <dbl>, `2020-04-22` <dbl>, `2020-04-23` <dbl>,
## # `2020-04-24` <dbl>, `2020-04-25` <dbl>, `2020-04-26` <dbl>,
## # `2020-04-27` <dbl>, `2020-04-28` <dbl>, …
Cleaning data
COVID-19 cases data
The COVID-19 cases data are in wide format, with individual columns for each day’s case counts. To visualize and analyze the data, it will be much easier to reshape the data so that it is organized in long format, with a single column for case counts and another column indicating the date those counts are associated with.
In addition, it will be useful to derive some time-related variables (e.g., day of year, week of year) from the dates. Finally, we should transform our cumulative case counts into regular counts and create some rate variables by normalizing by population count.
US_cases_long <- US_cases %>%
# select columns of interest
select(fips, NAME, POP10, matches("^\\d")) %>%
# rename some columns
rename(GEOID = fips, state = NAME, pop_count_2010 = POP10) %>%
# reshape to long format for dates
pivot_longer(cols = grep("^\\d", colnames(.), value = TRUE),
names_to = "date", values_to = "cases_cum") %>%
# create new derived time variables from dates
mutate(date = ymd(date), # year-month-day format
day_of_year = yday(date),
week_of_year = week(date),
month = month(date)) %>%
group_by(state) %>%
# create cases counts
mutate(cases_count = cases_cum - lag(cases_cum, default = 0),
# tidy-up negative counts
cases_count_pos = ifelse(cases_count < 0, 0, cases_count),
# create cases rates
cases_rate_100K = (cases_count_pos / pop_count_2010) * 1e5,
cases_cum_rate_100K = (cases_cum / pop_count_2010) * 1e5)
glimpse(US_cases_long) # 16014 observations (50 states + 1 DC * 314 days)
## Rows: 16,014
## Columns: 12
## Groups: state [51]
## $ GEOID <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
## $ pop_count_2010 <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779…
## $ date <date> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, …
## $ cases_cum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_of_year <dbl> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 3…
## $ week_of_year <dbl> 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6…
## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2…
## $ cases_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_count_pos <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_rate_100K <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_cum_rate_100K <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Aggregate data
The cleaned data object US_cases_long
has 16,014 observations (50 states + 1 DC * 314 days). For visualization, this should be fine in most cases. When we come to build models for these data, they may take a long time to run. If we’re mainly interested in longer term trends, we can probably get a good approximation by aggregating the data to the weekly level for modeling:
# aggregate to weekly level (for later modeling)
US_cases_long_week <- US_cases_long %>%
group_by(GEOID, state, week_of_year) %>%
summarize(pop_count_2010 = mean(pop_count_2010),
cases_count_pos = sum(cases_count_pos),
cases_rate_100K = sum(cases_rate_100K)) %>%
drop_na()
glimpse(US_cases_long_week)
## Rows: 2,346
## Columns: 6
## Groups: GEOID, state [51]
## $ GEOID <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01",…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
## $ week_of_year <dbl> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ pop_count_2010 <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779736,…
## $ cases_count_pos <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 39, 203, 757, 1198, 1756, 137…
## $ cases_rate_100K <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…