DAY 1: Acquiring and cleaning data (Python)

Acquiring data from APIs

Often, we want to acquire data that is stored online. Online data sources are stored somewhere on a remote server — a remotely located computer that is optimized to process requests for information. Usually, we make requests using a browser, also known as a client, but we can also make requests programmatically. An Application Programming Interface (API) is the part of a remote server that receives requests and sends responses. When we make requests programmatically, the responses an API sends are typically data-based, often in some structured format like JSON or XML.

For the project we’ll pursue during DataFest, we’re going to access data stored on the Harvard Dataverse. A Dataverse is open source software for repositing research data. Once data is stored in a Dataverse, it can be accessed programmatically using the Dataverse API. We will use the R package dataverse as an interface for the Dataverse API.

Here are three COVID-19 datasets from the Harvard Dataverse:

US data on COVID-19 cases and deaths, daily at state-level or county-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HIDLTK
US data on COVID-19 cases and deaths, daily at metropolitan-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5B8YM8
World data on COVID-19 cases and deaths, daily at country-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L20LOT

As an example of how to use the Dataverse API, we’re going to extract daily data on COVID-19 cases from the U.S. at the state-level (from dataset #1 above). These data span the period from January 21st 2020 until November 29th 2020 for each U.S. state (and the District of Columbia). If you wish, you may choose to use one of the other datasets for your project.

We can use the pyDataverse module as an interface for the API:

# get the digital object identifier for the Dataverse dataset
DOI = "doi:10.7910/DVN/HIDLTK"

# establish connection
base_url = 'https://dataverse.harvard.edu/'
api = Api(base_url)
print(api.status)

# retrieve the contents of the dataset

## OK

covid = api.get_dataset(DOI)

The covid object is a list of metadata that includes information on all the files stored within this dataset.

covid_files_list = covid.json()['data']['latestVersion']['files']

Let’s dig further and display the available files:

# view available files
for fileObject in covid_files_list:
    print("File name is {}; id is {}".format(fileObject["dataFile"]["filename"], fileObject["dataFile"]["id"]))

## File name is COUNTY_MAP.zip; id is 3758784
## File name is INDEX.txt; id is 3758782
## File name is METRO_MAP_2018_ESRI.zip; id is 3842202
## File name is METRO_MAP.zip; id is 3826156
## File name is Metropolitan_statistical_areas_for_US_counties__Sept_2018.xml; id is 3867748
## File name is Metropolitan_statistical_areas_for_US_counties__Sept_2018.zip; id is 3867747
## File name is README.txt; id is 3758780
## File name is STATE_MAP.zip; id is 3758781
## File name is us_county_confirmed_cases.tab; id is 4202274
## File name is us_county_deaths_cases.tab; id is 4202273
## File name is us_state_confirmed_case.tab; id is 4201597
## File name is us_state_deaths_case.tab; id is 4201596

For our example project, we’re going to use the data on cumulative COVID-19 cases at the state-level contained in the us_state_confirmed_case.tab file, which has the ID 4201597.

# get data file for COVID-19 cases
US_states_cases_file = api.get_datafile("4201597")

To convert the data into a more user friendly data frame, we have to jump through a few hoops:

# convert
in_text = US_states_cases_file.content
tmp = "data_py/US_states_cases.tab"

f = open(tmp, "wb")
f.write(in_text)

## 91400

f.close()

US_states_cases = pd.read_csv(tmp, sep='\t')

We can now inspect the data:

# inspect the data
print(US_states_cases.head(10)) # 50 states plus DC by 314 days

##    fips                  NAME  ...  2020-11-28  2020-11-29
## 0     1               Alabama  ...      244993      247229
## 1     2                Alaska  ...       31279       31896
## 2     4               Arizona  ...      322774      325995
## 3     5              Arkansas  ...      155026      156247
## 4     6            California  ...     1206278     1219496
## 5     8              Colorado  ...      226567      230057
## 6     9           Connecticut  ...      112581      112581
## 7    10              Delaware  ...       34670       35251
## 8    11  District of Columbia  ...       21308       21448
## 9    12               Florida  ...      985289      992652
## 
## [10 rows x 326 columns]

Cleaning data

COVID-19 cases data

The COVID-19 cases data are in wide format, with individual columns for each day’s case counts. To visualize and analyze the data, it will be much easier to reshape the data so that it is organized in long format, with a single column for case counts and another column indicating the date those counts are associated with.

In addition, it will be useful to derive some time-related variables (e.g., day of year, week of year) from the dates. Finally, we should transform our cumulative case counts into regular counts and create some rate variables by normalizing by population count.

# select columns of interest
US_states_cases_filtered = US_states_cases.filter(regex="^\\d", axis=1)
US_states_cases_selected = US_states_cases.loc[:, ['fips', 'NAME', 'POP10']]
US_states_cases_selected = US_states_cases_selected.assign(**US_states_cases_filtered)

# rename some columns
US_states_cases_selected = US_states_cases_selected.rename(columns={'fips':'GEOID', 'NAME':'state', 'POP10':'pop_count_2010'})

# reshape to long format for dates
US_states_cases_selected = pd.melt(US_states_cases_selected, 
    id_vars=["GEOID", "state", "pop_count_2010"], 
    var_name='date', value_name="cases_cum")
US_states_cases_selected = US_states_cases_selected.sort_values(['GEOID', 'date']).reset_index(drop=True)

# create new derived time variables from dates 
US_states_cases_selected["day_of_year"] = pd.to_datetime(US_states_cases_selected.date).dt.dayofyear
US_states_cases_selected["week_of_year"] = pd.to_datetime(US_states_cases_selected.date).dt.isocalendar().week
US_states_cases_selected["month"] = pd.to_datetime(US_states_cases_selected.date).dt.month

# create cases counts
US_states_cases_selected["cases_count"] = US_states_cases_selected.groupby('state').cases_cum.apply(lambda x: x - x.shift(1)).fillna(0)

# tidy-up negative counts
US_states_cases_selected["cases_count_pos"] = np.where(US_states_cases_selected["cases_count"] < 0, 0, US_states_cases_selected["cases_count"])

# create cases rates
US_states_cases_selected["cases_rate_100K"] = (US_states_cases_selected["cases_count_pos"] / US_states_cases_selected["pop_count_2010"]) * 1e5
US_states_cases_selected["cases_cum_rate_100K"] = (US_states_cases_selected["cases_cum"] / US_states_cases_selected["pop_count_2010"]) * 1e5

US_states_cases_selected.to_csv("data_py/US_states_cases_selected.csv") # 16014 observations (50 states + 1 DC * 314 days)

Aggregate data

The cleaned data object US_states_cases_selected has 16,014 observations (50 states + 1 DC * 314 days). For visualization, this should be fine in most cases. When we come to build models for these data, they may take a long time to run. If we’re mainly interested in longer term trends, we can probably get a good approximation by aggregating the data to the weekly level for modeling:

# aggregate to weekly level (for later modeling)
aggs_by_col = {'pop_count_2010': lambda x: np.mean(x), 'cases_cum': 'sum', 'cases_cum_rate_100K': 'sum', 'cases_count_pos': 'sum', 'cases_rate_100K': 'sum'}

US_states_cases_week = US_states_cases_selected.groupby(['GEOID', 'state', 'week_of_year'], as_index=False).agg(aggs_by_col)
print(US_states_cases_week.head(20))

##     GEOID    state  ...  cases_count_pos  cases_rate_100K
## 0       1  Alabama  ...              0.0         0.000000
## 1       1  Alabama  ...              0.0         0.000000
## 2       1  Alabama  ...              0.0         0.000000
## 3       1  Alabama  ...              0.0         0.000000
## 4       1  Alabama  ...              0.0         0.000000
## 5       1  Alabama  ...              0.0         0.000000
## 6       1  Alabama  ...              0.0         0.000000
## 7       1  Alabama  ...             23.0         0.481198
## 8       1  Alabama  ...            134.0         2.803502
## 9       1  Alabama  ...            673.0        14.080276
## 10      1  Alabama  ...           1010.0        21.130874
## 11      1  Alabama  ...           1743.0        36.466449
## 12      1  Alabama  ...           1320.0        27.616588
## 13      1  Alabama  ...           1518.0        31.759076
## 14      1  Alabama  ...           1467.0        30.692072
## 15      1  Alabama  ...           2001.0        41.864237
## 16      1  Alabama  ...           1882.0        39.374560
## 17      1  Alabama  ...           2707.0        56.634927
## 18      1  Alabama  ...           3474.0        72.681838
## 19      1  Alabama  ...           2548.0        53.308384
## 
## [20 rows x 8 columns]

US_states_cases_week.to_csv("data_py/US_states_cases_week.csv")