Reads in the Dataverse file into the R environment with any
user-specified function, such as read.csv
or readr
functions.
Use get_dataframe_by_name
if you know the name of the datafile and the DOI
of the dataset. Use get_dataframe_by_doi
if you know the DOI of the datafile
itself. Use get_dataframe_by_id
if you know the numeric ID of the
datafile. For files that are not datasets, the more generic get_file
that
downloads the content as a binary is simpler.
The function can read datasets that are unpublished and are still drafts, as long as the entry has a UNF. See the download vignette for details.
get_dataframe_by_name(
filename,
dataset = NULL,
.f = NULL,
original = FALSE,
...
)
get_dataframe_by_id(fileid, .f = NULL, original = FALSE, ...)
get_dataframe_by_doi(filedoi, .f = NULL, original = FALSE, ...)
The name of the file of interest, with file extension, for example
"roster-bulls-1996.tab"
. Can be a vector for multiple files.
A character specifying a persistent identification ID for a dataset,
for example "10.70122/FK2/HXJVJU"
. Alternatively, an object of class
“dataverse_dataset” obtained by dataverse_contents()
.
The function to used for reading in the raw dataset. The user
must choose the appropriate function: for example if the target is a .rds
file, then .f
should be readRDS
or readr::read_rds
. It can be a custom
function defined by the user. See examples for details.
A logical, whether to read the ingested,
archival version of the datafile if one exists. If TRUE
, users should supply
a function to use to read in the original. The archival versions are tab-delimited
.tab
files so if original = FALSE
, .f
is set to readr::read_tsv
.
Arguments passed on to get_file
file
An integer specifying a file identifier; or a vector of integers
specifying file identifiers; or, if used with the prefix "doi:"
, a
character with the file-specific DOI; or, if used without the prefix, a
filename accompanied by a dataset DOI in the dataset
argument, or an object of
class “dataverse_file” as returned by dataset_files
.
Can be a vector for multiple files.
format
A character string specifying a file format for download.
by default, this is “original” (the original file format). If NULL
,
no query is added, so ingested files are returned in their ingested TSV form.
For tabular datasets, the option “bundle” downloads the bundle
of the original and archival versions, as well as the documentation.
See https://guides.dataverse.org/en/latest/api/dataaccess.html for details.
vars
A character vector specifying one or more variable names, used to extract a subset of the data.
key
A character string specifying a Dataverse server API key. If one
is not specified, functions calling authenticated API endpoints will fail.
Keys can be specified atomically or globally using
Sys.setenv("DATAVERSE_KEY" = "examplekey")
.
server
A character string specifying a Dataverse server.
Multiple Dataverse installations exist, with "dataverse.harvard.edu"
being the
most major. The server can be defined each time within a function, or it can
be set as a default via an environment variable. To set a default, run
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
or add DATAVERSE_SERVER = "dataverse.harvard.edu"
in one's .Renviron
file (usethis::edit_r_environ()
), with the appropriate domain as its value.
version
A character specifying a version of the dataset.
This can be of the form "1.1"
or "1"
(where in "x.y"
, x is a major
version and y is an optional minor version), or
":latest"
(the default, the latest published version).
We recommend using the number format so that
the function stores a cache of the data (See cache_dataset
).
If the user specifies a key
or DATAVERSE_KEY
argument, they can access the
draft version by ":draft"
(the current draft) or ":latest"
(which will
prioritize the draft over the latest published version.
Finally, set use_cache = "none"
to not read from the cache and re-download
afresh even when version
is provided.
return_url
Instead of downloading the file, return the URL for download.
Defaults to FALSE
.
A numeric ID internally used for get_file_by_id
. Can be a vector for multiple files.
A DOI for a single file (not the entire dataset), of the form
"10.70122/FK2/PPIAXE/MHDB0O"
or "doi:10.70122/FK2/PPIAXE/MHDB0O"
.
Can be a vector for multiple files.
A R object that is returned by the default or user-supplied function
.f
argument. For example, if .f = readr::read_tsv()
, the function will
return a dataframe as read in by readr::read_tsv()
. If the file identifier
is a vector, it will return a list where each slot corresponds to elements of the vector.
if (FALSE) { # \dontrun{
# 1. For files originally in plain-text (.csv, .tsv), we recommend
# retreiving data.frame from dataverse DOI and file name, or the file's DOI.
df_tab <-
get_dataframe_by_name(
filename = "roster-bulls-1996.tab",
dataset = "doi:10.70122/FK2/HXJVJU",
server = "demo.dataverse.org"
)
df_tab <-
get_dataframe_by_doi(
filedoi = "10.70122/FK2/HXJVJU/SA3Z2V",
server = "demo.dataverse.org"
)
# 2. For files where Dataverse's ingest loses information (Stata .dta, SPSS .sav)
# or cannot be ingested (R .rds), we recommend
# specifying `original = TRUE` and specifying a read-in function in .f.
# Rds files are not ingested so original = TRUE and .f is required.
if (requireNamespace("readr", quietly = TRUE)) {
df_from_rds_original <-
get_dataframe_by_name(
filename = "nlsw88_rds-export.rds",
dataset = "doi:10.70122/FK2/PPIAXE",
server = "demo.dataverse.org",
original = TRUE,
.f = readr::read_rds
)
}
# Stata dta files lose attributes such as value labels upon ingest so
# reading the original version by a Stata reader such as `haven` is recommended.
if (requireNamespace("haven", quietly = TRUE)) {
df_stata_original <-
get_dataframe_by_name(
filename = "nlsw88.tab",
dataset = "doi:10.70122/FK2/PPIAXE",
server = "demo.dataverse.org",
original = TRUE,
.f = haven::read_dta
)
}
# 3. RData files are read in by `base::load()` but cannot be assigned to an
# object name. The following shows two possible ways to read in such files.
# First, the RData object can be loaded to the environment without object assignment.
get_dataframe_by_doi(
filedoi = "10.70122/FK2/PPIAXE/X2FC5V",
server = "demo.dataverse.org",
original = TRUE,
.f = function(x) load(x, envir = .GlobalEnv))
# If you are certain each RData contains only one object, one could define a
# custom function used in https://stackoverflow.com/a/34926943
load_object <- function(file) {
tmp <- new.env()
load(file = file, envir = tmp)
tmp[[ls(tmp)[1]]]
}
# https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/PPIAXE/X2FC5V
as_rda <- get_dataframe_by_id(
file = 1939003,
server = "demo.dataverse.org",
.f = load_object,
original = TRUE)
} # }