DAY 1: Acquiring and cleaning data (Python)
Acquiring data from APIs
Often, we want to acquire data that is stored online. Online data sources are stored somewhere on a remote server — a remotely located computer that is optimized to process requests for information. Usually, we make requests using a browser, also known as a client, but we can also make requests programmatically. An Application Programming Interface (API) is the part of a remote server that receives requests and sends responses. When we make requests programmatically, the responses an API sends are typically data-based, often in some structured format like JSON or XML.
For the project we’ll pursue during DataFest, we’re going to access data stored on the Harvard Dataverse. A Dataverse is open source software for repositing research data. Once data is stored in a Dataverse, it can be accessed programmatically using the Dataverse API. We will use the R package dataverse
as an interface for the Dataverse API.
Here are three COVID-19 datasets from the Harvard Dataverse:
- US data on COVID-19 cases and deaths, daily at state-level or county-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HIDLTK
- US data on COVID-19 cases and deaths, daily at metropolitan-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5B8YM8
- World data on COVID-19 cases and deaths, daily at country-level: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L20LOT
As an example of how to use the Dataverse API, we’re going to extract daily data on COVID-19 cases from the U.S. at the state-level (from dataset #1 above). These data span the period from January 21st 2020 until November 29th 2020 for each U.S. state (and the District of Columbia). If you wish, you may choose to use one of the other datasets for your project.
We can use the pyDataverse
module as an interface for the API:
# get the digital object identifier for the Dataverse dataset
DOI = "doi:10.7910/DVN/HIDLTK"
# establish connection
base_url = 'https://dataverse.harvard.edu/'
api = Api(base_url)
print(api.status)
# retrieve the contents of the dataset
## OK
The covid
object is a list of metadata that includes information on all the files stored within this dataset.
Let’s dig further and display the available files:
# view available files
for fileObject in covid_files_list:
print("File name is {}; id is {}".format(fileObject["dataFile"]["filename"], fileObject["dataFile"]["id"]))
## File name is COUNTY_MAP.zip; id is 3758784
## File name is INDEX.txt; id is 3758782
## File name is METRO_MAP_2018_ESRI.zip; id is 3842202
## File name is METRO_MAP.zip; id is 3826156
## File name is Metropolitan_statistical_areas_for_US_counties__Sept_2018.xml; id is 3867748
## File name is Metropolitan_statistical_areas_for_US_counties__Sept_2018.zip; id is 3867747
## File name is README.txt; id is 3758780
## File name is STATE_MAP.zip; id is 3758781
## File name is us_county_confirmed_cases.tab; id is 4202274
## File name is us_county_deaths_cases.tab; id is 4202273
## File name is us_state_confirmed_case.tab; id is 4201597
## File name is us_state_deaths_case.tab; id is 4201596
For our example project, we’re going to use the data on cumulative COVID-19 cases at the state-level contained in the us_state_confirmed_case.tab
file, which has the ID 4201597
.
To convert the data into a more user friendly data frame, we have to jump through a few hoops:
# convert
in_text = US_states_cases_file.content
tmp = "data_py/US_states_cases.tab"
f = open(tmp, "wb")
f.write(in_text)
## 91400
We can now inspect the data:
## fips NAME ... 2020-11-28 2020-11-29
## 0 1 Alabama ... 244993 247229
## 1 2 Alaska ... 31279 31896
## 2 4 Arizona ... 322774 325995
## 3 5 Arkansas ... 155026 156247
## 4 6 California ... 1206278 1219496
## 5 8 Colorado ... 226567 230057
## 6 9 Connecticut ... 112581 112581
## 7 10 Delaware ... 34670 35251
## 8 11 District of Columbia ... 21308 21448
## 9 12 Florida ... 985289 992652
##
## [10 rows x 326 columns]
Cleaning data
COVID-19 cases data
The COVID-19 cases data are in wide format, with individual columns for each day’s case counts. To visualize and analyze the data, it will be much easier to reshape the data so that it is organized in long format, with a single column for case counts and another column indicating the date those counts are associated with.
In addition, it will be useful to derive some time-related variables (e.g., day of year, week of year) from the dates. Finally, we should transform our cumulative case counts into regular counts and create some rate variables by normalizing by population count.
# select columns of interest
US_states_cases_filtered = US_states_cases.filter(regex="^\\d", axis=1)
US_states_cases_selected = US_states_cases.loc[:, ['fips', 'NAME', 'POP10']]
US_states_cases_selected = US_states_cases_selected.assign(**US_states_cases_filtered)
# rename some columns
US_states_cases_selected = US_states_cases_selected.rename(columns={'fips':'GEOID', 'NAME':'state', 'POP10':'pop_count_2010'})
# reshape to long format for dates
US_states_cases_selected = pd.melt(US_states_cases_selected,
id_vars=["GEOID", "state", "pop_count_2010"],
var_name='date', value_name="cases_cum")
US_states_cases_selected = US_states_cases_selected.sort_values(['GEOID', 'date']).reset_index(drop=True)
# create new derived time variables from dates
US_states_cases_selected["day_of_year"] = pd.to_datetime(US_states_cases_selected.date).dt.dayofyear
US_states_cases_selected["week_of_year"] = pd.to_datetime(US_states_cases_selected.date).dt.isocalendar().week
US_states_cases_selected["month"] = pd.to_datetime(US_states_cases_selected.date).dt.month
# create cases counts
US_states_cases_selected["cases_count"] = US_states_cases_selected.groupby('state').cases_cum.apply(lambda x: x - x.shift(1)).fillna(0)
# tidy-up negative counts
US_states_cases_selected["cases_count_pos"] = np.where(US_states_cases_selected["cases_count"] < 0, 0, US_states_cases_selected["cases_count"])
# create cases rates
US_states_cases_selected["cases_rate_100K"] = (US_states_cases_selected["cases_count_pos"] / US_states_cases_selected["pop_count_2010"]) * 1e5
US_states_cases_selected["cases_cum_rate_100K"] = (US_states_cases_selected["cases_cum"] / US_states_cases_selected["pop_count_2010"]) * 1e5
US_states_cases_selected.to_csv("data_py/US_states_cases_selected.csv") # 16014 observations (50 states + 1 DC * 314 days)
Aggregate data
The cleaned data object US_states_cases_selected
has 16,014 observations (50 states + 1 DC * 314 days). For visualization, this should be fine in most cases. When we come to build models for these data, they may take a long time to run. If we’re mainly interested in longer term trends, we can probably get a good approximation by aggregating the data to the weekly level for modeling:
# aggregate to weekly level (for later modeling)
aggs_by_col = {'pop_count_2010': lambda x: np.mean(x), 'cases_cum': 'sum', 'cases_cum_rate_100K': 'sum', 'cases_count_pos': 'sum', 'cases_rate_100K': 'sum'}
US_states_cases_week = US_states_cases_selected.groupby(['GEOID', 'state', 'week_of_year'], as_index=False).agg(aggs_by_col)
print(US_states_cases_week.head(20))
## GEOID state ... cases_count_pos cases_rate_100K
## 0 1 Alabama ... 0.0 0.000000
## 1 1 Alabama ... 0.0 0.000000
## 2 1 Alabama ... 0.0 0.000000
## 3 1 Alabama ... 0.0 0.000000
## 4 1 Alabama ... 0.0 0.000000
## 5 1 Alabama ... 0.0 0.000000
## 6 1 Alabama ... 0.0 0.000000
## 7 1 Alabama ... 23.0 0.481198
## 8 1 Alabama ... 134.0 2.803502
## 9 1 Alabama ... 673.0 14.080276
## 10 1 Alabama ... 1010.0 21.130874
## 11 1 Alabama ... 1743.0 36.466449
## 12 1 Alabama ... 1320.0 27.616588
## 13 1 Alabama ... 1518.0 31.759076
## 14 1 Alabama ... 1467.0 30.692072
## 15 1 Alabama ... 2001.0 41.864237
## 16 1 Alabama ... 1882.0 39.374560
## 17 1 Alabama ... 2707.0 56.634927
## 18 1 Alabama ... 3474.0 72.681838
## 19 1 Alabama ... 2548.0 53.308384
##
## [20 rows x 8 columns]