DAY 1: Acquiring and cleaning data (R)

Acquiring data from APIs

Often, we want to acquire data that is stored online. Online data sources are stored somewhere on a remote server — a remotely located computer that is optimized to process requests for information. Usually, we make requests using a browser, also known as a client, but we can also make requests programmatically. An Application Programming Interface (API) is the part of a remote server that receives requests and sends responses. When we make requests programmatically, the responses an API sends are typically data-based, often in some structured format like JSON or XML.

For the project we’ll pursue during DataFest, we’re going to access data stored on the Harvard Dataverse. A Dataverse is open source software for repositing research data. Once data is stored in a Dataverse, it can be accessed programmatically using the Dataverse API. We will use the R package dataverse as an interface for the Dataverse API.

Here are three COVID-19 datasets from the Harvard Dataverse:

  1. US data on COVID-19 cases and deaths, daily at state-level or county-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HIDLTK
  2. US data on COVID-19 cases and deaths, daily at metropolitan-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5B8YM8
  3. World data on COVID-19 cases and deaths, daily at country-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L20LOT

As an example of how to use the Dataverse API, we’re going to extract daily data on COVID-19 cases from the U.S. at the state-level (from dataset #1 above). These data span the period from January 21st 2020 until November 29th 2020 for each U.S. state (and the District of Columbia). If you wish, you may choose to use one of the other datasets for your project.

We can use the dataverse package as an interface for the API. All we need to start is a digital object identifier (DOI) for the dataset — this is the persistentId parameter at the end of the above URLs. We can then use the get_dataset() function to retrieve the metadata for the dataset:

The covid object is a list of metadata that includes a data frame of all the files stored within this dataset. Let’s look at the structure of this object:

## List of 16
##  $ id                 : int 218273
##  $ datasetId          : int 3679837
##  $ datasetPersistentId: chr "doi:10.7910/DVN/HIDLTK"
##  $ storageIdentifier  : chr "s3://10.7910/DVN/HIDLTK"
##  $ versionNumber      : int 46
##  $ versionMinorNumber : int 0
##  $ versionState       : chr "RELEASED"
##  $ UNF                : chr "UNF:6:l+QTrceV0xEn3GGLAskwEQ=="
##  $ lastUpdateTime     : chr "2020-12-01T22:15:13Z"
##  $ releaseTime        : chr "2020-12-01T22:15:13Z"
##  $ createTime         : chr "2020-12-01T22:11:46Z"
##  $ license            : chr "CC0"
##  $ termsOfUse         : chr "CC0 Waiver"
##  $ fileAccessRequest  : logi FALSE
##  $ metadataBlocks     :List of 1
##  $ files              :'data.frame':	12 obs. of  22 variables:
##  - attr(*, "class")= chr "dataverse_dataset"

Let’s dig further and display the available files:

##  [1] "COUNTY_MAP.zip"                                               
##  [2] "INDEX.txt"                                                    
##  [3] "METRO_MAP_2018_ESRI.zip"                                      
##  [4] "METRO_MAP.zip"                                                
##  [5] "Metropolitan_statistical_areas_for_US_counties__Sept_2018.xml"
##  [6] "Metropolitan_statistical_areas_for_US_counties__Sept_2018.zip"
##  [7] "README.txt"                                                   
##  [8] "STATE_MAP.zip"                                                
##  [9] "us_county_confirmed_cases.tab"                                
## [10] "us_county_deaths_cases.tab"                                   
## [11] "us_state_confirmed_case.tab"                                  
## [12] "us_state_deaths_case.tab"

For our example project, we’re going to use the data on cumulative COVID-19 cases at the state-level contained in the us_state_confirmed_case.tab file. We can use the get_file() function to extract these data into a raw vector:

To convert the data from the raw vector into a more user friendly data frame, we can use the read_csv() function from the readr package:

We can now examine the structure of the data:

## # A tibble: 6 x 326
##   fips  NAME   POP70  HHD70  POP80  HHD80  POP90  HHD90  POP00  HHD00  POP10
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 01    Alab… 3.43e6 1.03e6 3.89e6 1.34e6 4.04e6 1.51e6 4.45e6 1.74e6 4.78e6
## 2 02    Alas… 2.25e5 6.15e4 3.93e5 1.30e5 5.50e5 1.89e5 6.27e5 2.22e5 7.10e5
## 3 04    Ariz… 1.77e6 5.39e5 2.71e6 9.53e5 3.66e6 1.37e6 5.13e6 1.90e6 6.39e6
## 4 05    Arka… 1.90e6 6.08e5 2.25e6 8.06e5 2.35e6 8.91e5 2.67e6 1.04e6 2.92e6
## 5 06    Cali… 1.98e7 6.57e6 2.36e7 8.62e6 2.97e7 1.04e7 3.39e7 1.15e7 3.73e7
## 6 08    Colo… 2.18e6 6.85e5 2.83e6 1.04e6 3.28e6 1.28e6 4.28e6 1.65e6 5.03e6
## # … with 315 more variables: HHD10 <dbl>, `2020-01-21` <dbl>,
## #   `2020-01-22` <dbl>, `2020-01-23` <dbl>, `2020-01-24` <dbl>,
## #   `2020-01-25` <dbl>, `2020-01-26` <dbl>, `2020-01-27` <dbl>,
## #   `2020-01-28` <dbl>, `2020-01-29` <dbl>, `2020-01-30` <dbl>,
## #   `2020-01-31` <dbl>, `2020-02-01` <dbl>, `2020-02-02` <dbl>,
## #   `2020-02-03` <dbl>, `2020-02-04` <dbl>, `2020-02-05` <dbl>,
## #   `2020-02-06` <dbl>, `2020-02-07` <dbl>, `2020-02-08` <dbl>,
## #   `2020-02-09` <dbl>, `2020-02-10` <dbl>, `2020-02-11` <dbl>,
## #   `2020-02-12` <dbl>, `2020-02-13` <dbl>, `2020-02-14` <dbl>,
## #   `2020-02-15` <dbl>, `2020-02-16` <dbl>, `2020-02-17` <dbl>,
## #   `2020-02-18` <dbl>, `2020-02-19` <dbl>, `2020-02-20` <dbl>,
## #   `2020-02-21` <dbl>, `2020-02-22` <dbl>, `2020-02-23` <dbl>,
## #   `2020-02-24` <dbl>, `2020-02-25` <dbl>, `2020-02-26` <dbl>,
## #   `2020-02-27` <dbl>, `2020-02-28` <dbl>, `2020-02-29` <dbl>,
## #   `2020-03-01` <dbl>, `2020-03-02` <dbl>, `2020-03-03` <dbl>,
## #   `2020-03-04` <dbl>, `2020-03-05` <dbl>, `2020-03-06` <dbl>,
## #   `2020-03-07` <dbl>, `2020-03-08` <dbl>, `2020-03-09` <dbl>,
## #   `2020-03-10` <dbl>, `2020-03-11` <dbl>, `2020-03-12` <dbl>,
## #   `2020-03-13` <dbl>, `2020-03-14` <dbl>, `2020-03-15` <dbl>,
## #   `2020-03-16` <dbl>, `2020-03-17` <dbl>, `2020-03-18` <dbl>,
## #   `2020-03-19` <dbl>, `2020-03-20` <dbl>, `2020-03-21` <dbl>,
## #   `2020-03-22` <dbl>, `2020-03-23` <dbl>, `2020-03-24` <dbl>,
## #   `2020-03-25` <dbl>, `2020-03-26` <dbl>, `2020-03-27` <dbl>,
## #   `2020-03-28` <dbl>, `2020-03-29` <dbl>, `2020-03-30` <dbl>,
## #   `2020-03-31` <dbl>, `2020-04-01` <dbl>, `2020-04-02` <dbl>,
## #   `2020-04-03` <dbl>, `2020-04-04` <dbl>, `2020-04-05` <dbl>,
## #   `2020-04-06` <dbl>, `2020-04-07` <dbl>, `2020-04-08` <dbl>,
## #   `2020-04-09` <dbl>, `2020-04-10` <dbl>, `2020-04-11` <dbl>,
## #   `2020-04-12` <dbl>, `2020-04-13` <dbl>, `2020-04-14` <dbl>,
## #   `2020-04-15` <dbl>, `2020-04-16` <dbl>, `2020-04-17` <dbl>,
## #   `2020-04-18` <dbl>, `2020-04-19` <dbl>, `2020-04-20` <dbl>,
## #   `2020-04-21` <dbl>, `2020-04-22` <dbl>, `2020-04-23` <dbl>,
## #   `2020-04-24` <dbl>, `2020-04-25` <dbl>, `2020-04-26` <dbl>,
## #   `2020-04-27` <dbl>, `2020-04-28` <dbl>, …

Cleaning data

COVID-19 cases data

The COVID-19 cases data are in wide format, with individual columns for each day’s case counts. To visualize and analyze the data, it will be much easier to reshape the data so that it is organized in long format, with a single column for case counts and another column indicating the date those counts are associated with.

In addition, it will be useful to derive some time-related variables (e.g., day of year, week of year) from the dates. Finally, we should transform our cumulative case counts into regular counts and create some rate variables by normalizing by population count.

## Rows: 16,014
## Columns: 12
## Groups: state [51]
## $ GEOID               <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ state               <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
## $ pop_count_2010      <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779…
## $ date                <date> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, …
## $ cases_cum           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_of_year         <dbl> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 3…
## $ week_of_year        <dbl> 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6…
## $ month               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2…
## $ cases_count         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_count_pos     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_rate_100K     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_cum_rate_100K <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Aggregate data

The cleaned data object US_cases_long has 16,014 observations (50 states + 1 DC * 314 days). For visualization, this should be fine in most cases. When we come to build models for these data, they may take a long time to run. If we’re mainly interested in longer term trends, we can probably get a good approximation by aggregating the data to the weekly level for modeling:

## Rows: 2,346
## Columns: 6
## Groups: GEOID, state [51]
## $ GEOID           <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01",…
## $ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
## $ week_of_year    <dbl> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ pop_count_2010  <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779736,…
## $ cases_count_pos <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 39, 203, 757, 1198, 1756, 137…
## $ cases_rate_100K <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…