DAY 1: Acquiring and cleaning data (R)

Acquiring data from APIs

Often, we want to acquire data that is stored online. Online data sources are stored somewhere on a remote server — a remotely located computer that is optimized to process requests for information. Usually, we make requests using a browser, also known as a client, but we can also make requests programmatically. An Application Programming Interface (API) is the part of a remote server that receives requests and sends responses. When we make requests programmatically, the responses an API sends are typically data-based, often in some structured format like JSON or XML.

For the project we’ll pursue during DataFest, we’re going to access data stored on the Harvard Dataverse. A Dataverse is open source software for repositing research data. Once data is stored in a Dataverse, it can be accessed programmatically using the Dataverse API. We will use the R package dataverse as an interface for the Dataverse API.

Here are three COVID-19 datasets from the Harvard Dataverse:

US data on COVID-19 cases and deaths, daily at state-level or county-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HIDLTK
US data on COVID-19 cases and deaths, daily at metropolitan-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5B8YM8
World data on COVID-19 cases and deaths, daily at country-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L20LOT

As an example of how to use the Dataverse API, we’re going to extract daily data on COVID-19 cases from the U.S. at the state-level (from dataset #1 above). These data span the period from January 21st 2020 until November 29th 2020 for each U.S. state (and the District of Columbia). If you wish, you may choose to use one of the other datasets for your project.

We can use the dataverse package as an interface for the API. All we need to start is a digital object identifier (DOI) for the dataset — this is the persistentId parameter at the end of the above URLs. We can then use the get_dataset() function to retrieve the metadata for the dataset:

# get the digital object identifier for the Dataverse dataset
DOI <- "doi:10.7910/DVN/HIDLTK"

# retrieve the contents of the dataset
covid <- get_dataset(DOI)

The covid object is a list of metadata that includes a data frame of all the files stored within this dataset. Let’s look at the structure of this object:

# view contents
glimpse(covid, max.level = 1)

## List of 16
##  $ id                 : int 218273
##  $ datasetId          : int 3679837
##  $ datasetPersistentId: chr "doi:10.7910/DVN/HIDLTK"
##  $ storageIdentifier  : chr "s3://10.7910/DVN/HIDLTK"
##  $ versionNumber      : int 46
##  $ versionMinorNumber : int 0
##  $ versionState       : chr "RELEASED"
##  $ UNF                : chr "UNF:6:l+QTrceV0xEn3GGLAskwEQ=="
##  $ lastUpdateTime     : chr "2020-12-01T22:15:13Z"
##  $ releaseTime        : chr "2020-12-01T22:15:13Z"
##  $ createTime         : chr "2020-12-01T22:11:46Z"
##  $ license            : chr "CC0"
##  $ termsOfUse         : chr "CC0 Waiver"
##  $ fileAccessRequest  : logi FALSE
##  $ metadataBlocks     :List of 1
##  $ files              :'data.frame':	12 obs. of  22 variables:
##  - attr(*, "class")= chr "dataverse_dataset"

Let’s dig further and display the available files:

# view available files
covid$files$filename

##  [1] "COUNTY_MAP.zip"                                               
##  [2] "INDEX.txt"                                                    
##  [3] "METRO_MAP_2018_ESRI.zip"                                      
##  [4] "METRO_MAP.zip"                                                
##  [5] "Metropolitan_statistical_areas_for_US_counties__Sept_2018.xml"
##  [6] "Metropolitan_statistical_areas_for_US_counties__Sept_2018.zip"
##  [7] "README.txt"                                                   
##  [8] "STATE_MAP.zip"                                                
##  [9] "us_county_confirmed_cases.tab"                                
## [10] "us_county_deaths_cases.tab"                                   
## [11] "us_state_confirmed_case.tab"                                  
## [12] "us_state_deaths_case.tab"

For our example project, we’re going to use the data on cumulative COVID-19 cases at the state-level contained in the us_state_confirmed_case.tab file. We can use the get_file() function to extract these data into a raw vector:

# get data file for COVID-19 cases
US_cases_file <- get_file("us_state_confirmed_case.tab", dataset = DOI)

To convert the data from the raw vector into a more user friendly data frame, we can use the read_csv() function from the readr package:

# convert raw vector to dataframe
US_cases <- read_csv(US_cases_file)

We can now examine the structure of the data:

# inspect the data
head(US_cases) # 50 states plus DC by 314 days

## # A tibble: 6 x 326
##   fips  NAME   POP70  HHD70  POP80  HHD80  POP90  HHD90  POP00  HHD00  POP10
##   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 01    Alab… 3.43e6 1.03e6 3.89e6 1.34e6 4.04e6 1.51e6 4.45e6 1.74e6 4.78e6
## 2 02    Alas… 2.25e5 6.15e4 3.93e5 1.30e5 5.50e5 1.89e5 6.27e5 2.22e5 7.10e5
## 3 04    Ariz… 1.77e6 5.39e5 2.71e6 9.53e5 3.66e6 1.37e6 5.13e6 1.90e6 6.39e6
## 4 05    Arka… 1.90e6 6.08e5 2.25e6 8.06e5 2.35e6 8.91e5 2.67e6 1.04e6 2.92e6
## 5 06    Cali… 1.98e7 6.57e6 2.36e7 8.62e6 2.97e7 1.04e7 3.39e7 1.15e7 3.73e7
## 6 08    Colo… 2.18e6 6.85e5 2.83e6 1.04e6 3.28e6 1.28e6 4.28e6 1.65e6 5.03e6
## # … with 315 more variables: HHD10 <dbl>, `2020-01-21` <dbl>,
## #   `2020-01-22` <dbl>, `2020-01-23` <dbl>, `2020-01-24` <dbl>,
## #   `2020-01-25` <dbl>, `2020-01-26` <dbl>, `2020-01-27` <dbl>,
## #   `2020-01-28` <dbl>, `2020-01-29` <dbl>, `2020-01-30` <dbl>,
## #   `2020-01-31` <dbl>, `2020-02-01` <dbl>, `2020-02-02` <dbl>,
## #   `2020-02-03` <dbl>, `2020-02-04` <dbl>, `2020-02-05` <dbl>,
## #   `2020-02-06` <dbl>, `2020-02-07` <dbl>, `2020-02-08` <dbl>,
## #   `2020-02-09` <dbl>, `2020-02-10` <dbl>, `2020-02-11` <dbl>,
## #   `2020-02-12` <dbl>, `2020-02-13` <dbl>, `2020-02-14` <dbl>,
## #   `2020-02-15` <dbl>, `2020-02-16` <dbl>, `2020-02-17` <dbl>,
## #   `2020-02-18` <dbl>, `2020-02-19` <dbl>, `2020-02-20` <dbl>,
## #   `2020-02-21` <dbl>, `2020-02-22` <dbl>, `2020-02-23` <dbl>,
## #   `2020-02-24` <dbl>, `2020-02-25` <dbl>, `2020-02-26` <dbl>,
## #   `2020-02-27` <dbl>, `2020-02-28` <dbl>, `2020-02-29` <dbl>,
## #   `2020-03-01` <dbl>, `2020-03-02` <dbl>, `2020-03-03` <dbl>,
## #   `2020-03-04` <dbl>, `2020-03-05` <dbl>, `2020-03-06` <dbl>,
## #   `2020-03-07` <dbl>, `2020-03-08` <dbl>, `2020-03-09` <dbl>,
## #   `2020-03-10` <dbl>, `2020-03-11` <dbl>, `2020-03-12` <dbl>,
## #   `2020-03-13` <dbl>, `2020-03-14` <dbl>, `2020-03-15` <dbl>,
## #   `2020-03-16` <dbl>, `2020-03-17` <dbl>, `2020-03-18` <dbl>,
## #   `2020-03-19` <dbl>, `2020-03-20` <dbl>, `2020-03-21` <dbl>,
## #   `2020-03-22` <dbl>, `2020-03-23` <dbl>, `2020-03-24` <dbl>,
## #   `2020-03-25` <dbl>, `2020-03-26` <dbl>, `2020-03-27` <dbl>,
## #   `2020-03-28` <dbl>, `2020-03-29` <dbl>, `2020-03-30` <dbl>,
## #   `2020-03-31` <dbl>, `2020-04-01` <dbl>, `2020-04-02` <dbl>,
## #   `2020-04-03` <dbl>, `2020-04-04` <dbl>, `2020-04-05` <dbl>,
## #   `2020-04-06` <dbl>, `2020-04-07` <dbl>, `2020-04-08` <dbl>,
## #   `2020-04-09` <dbl>, `2020-04-10` <dbl>, `2020-04-11` <dbl>,
## #   `2020-04-12` <dbl>, `2020-04-13` <dbl>, `2020-04-14` <dbl>,
## #   `2020-04-15` <dbl>, `2020-04-16` <dbl>, `2020-04-17` <dbl>,
## #   `2020-04-18` <dbl>, `2020-04-19` <dbl>, `2020-04-20` <dbl>,
## #   `2020-04-21` <dbl>, `2020-04-22` <dbl>, `2020-04-23` <dbl>,
## #   `2020-04-24` <dbl>, `2020-04-25` <dbl>, `2020-04-26` <dbl>,
## #   `2020-04-27` <dbl>, `2020-04-28` <dbl>, …

Cleaning data

COVID-19 cases data

The COVID-19 cases data are in wide format, with individual columns for each day’s case counts. To visualize and analyze the data, it will be much easier to reshape the data so that it is organized in long format, with a single column for case counts and another column indicating the date those counts are associated with.

In addition, it will be useful to derive some time-related variables (e.g., day of year, week of year) from the dates. Finally, we should transform our cumulative case counts into regular counts and create some rate variables by normalizing by population count.

US_cases_long <- US_cases %>%
    # select columns of interest
    select(fips, NAME, POP10, matches("^\\d")) %>% 
    # rename some columns
    rename(GEOID = fips, state = NAME, pop_count_2010 = POP10) %>%
    # reshape to long format for dates
    pivot_longer(cols = grep("^\\d", colnames(.), value = TRUE), 
                 names_to = "date", values_to = "cases_cum") %>%
    # create new derived time variables from dates 
    mutate(date = ymd(date), # year-month-day format
           day_of_year = yday(date),
           week_of_year = week(date),
           month = month(date)) %>% 
    group_by(state) %>% 
           # create cases counts
    mutate(cases_count = cases_cum - lag(cases_cum, default = 0),
           # tidy-up negative counts
           cases_count_pos = ifelse(cases_count < 0, 0, cases_count),
           # create cases rates
           cases_rate_100K = (cases_count_pos / pop_count_2010) * 1e5,
           cases_cum_rate_100K = (cases_cum / pop_count_2010) * 1e5)

glimpse(US_cases_long) # 16014 observations (50 states + 1 DC * 314 days)

## Rows: 16,014
## Columns: 12
## Groups: state [51]
## $ GEOID               <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ state               <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
## $ pop_count_2010      <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779…
## $ date                <date> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, …
## $ cases_cum           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_of_year         <dbl> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 3…
## $ week_of_year        <dbl> 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6…
## $ month               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2…
## $ cases_count         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_count_pos     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_rate_100K     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cases_cum_rate_100K <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Aggregate data

The cleaned data object US_cases_long has 16,014 observations (50 states + 1 DC * 314 days). For visualization, this should be fine in most cases. When we come to build models for these data, they may take a long time to run. If we’re mainly interested in longer term trends, we can probably get a good approximation by aggregating the data to the weekly level for modeling:

# aggregate to weekly level (for later modeling)
US_cases_long_week <- US_cases_long %>%
    group_by(GEOID, state, week_of_year) %>%
    summarize(pop_count_2010 = mean(pop_count_2010),
              cases_count_pos = sum(cases_count_pos), 
              cases_rate_100K = sum(cases_rate_100K)) %>% 
    drop_na()

glimpse(US_cases_long_week)

## Rows: 2,346
## Columns: 6
## Groups: GEOID, state [51]
## $ GEOID           <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01",…
## $ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
## $ week_of_year    <dbl> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ pop_count_2010  <dbl> 4779736, 4779736, 4779736, 4779736, 4779736, 4779736,…
## $ cases_count_pos <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 39, 203, 757, 1198, 1756, 137…
## $ cases_rate_100K <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…