Download dataverse file as a dataframe — get_dataframe_by

Reads in the Dataverse file into the R environment with any user-specified function, such as read.csv or readr functions.

Use get_dataframe_by_name if you know the name of the datafile and the DOI of the dataset. Use get_dataframe_by_doi if you know the DOI of the datafile itself. Use get_dataframe_by_id if you know the numeric ID of the datafile. For files that are not datasets, the more generic get_file that downloads the content as a binary is simpler.

The function can read datasets that are unpublished and are still drafts, as long as the entry has a UNF. See the download vignette for details.

get_dataframe_by_name(
  filename,
  dataset = NULL,
  .f = NULL,
  original = FALSE,
  ...
)

get_dataframe_by_id(fileid, .f = NULL, original = FALSE, ...)

get_dataframe_by_doi(filedoi, .f = NULL, original = FALSE, ...)

Arguments

filename

The name of the file of interest, with file extension, for example "roster-bulls-1996.tab". Can be a vector for multiple files.

dataset

A character specifying a persistent identification ID for a dataset, for example "10.70122/FK2/HXJVJU". Alternatively, an object of class “dataverse_dataset” obtained by dataverse_contents().

.f

The function to used for reading in the raw dataset. The user must choose the appropriate function: for example if the target is a .rds file, then .f should be readRDS or readr::read_rds. It can be a custom function defined by the user. See examples for details.

original

A logical, whether to read the ingested, archival version of the datafile if one exists. If TRUE, users should supply a function to use to read in the original. The archival versions are tab-delimited .tab files so if original = FALSE, .f is set to readr::read_tsv.

...

Arguments passed on to get_file

file: An integer specifying a file identifier; or a vector of integers specifying file identifiers; or, if used with the prefix "doi:", a character with the file-specific DOI; or, if used without the prefix, a filename accompanied by a dataset DOI in the dataset argument, or an object of class “dataverse_file” as returned by dataset_files. Can be a vector for multiple files.
format: A character string specifying a file format for download. by default, this is “original” (the original file format). If NULL, no query is added, so ingested files are returned in their ingested TSV form. For tabular datasets, the option “bundle” downloads the bundle of the original and archival versions, as well as the documentation. See https://guides.dataverse.org/en/latest/api/dataaccess.html for details.
vars: A character vector specifying one or more variable names, used to extract a subset of the data.
key: A character string specifying a Dataverse server API key. If one is not specified, functions calling authenticated API endpoints will fail. Keys can be specified atomically or globally using Sys.setenv("DATAVERSE_KEY" = "examplekey").
server: A character string specifying a Dataverse server. Multiple Dataverse installations exist, with "dataverse.harvard.edu" being the most major. The server can be defined each time within a function, or it can be set as a default via an environment variable. To set a default, run Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu") or add DATAVERSE_SERVER = "dataverse.harvard.edu" in one's .Renviron file (usethis::edit_r_environ()), with the appropriate domain as its value.
version: A character specifying a version of the dataset. This can be of the form "1.1" or "1" (where in "x.y", x is a major version and y is an optional minor version), or ":latest" (the default, the latest published version). We recommend using the number format so that the function stores a cache of the data (See cache_dataset). If the user specifies a key or DATAVERSE_KEY argument, they can access the draft version by ":draft" (the current draft) or ":latest" (which will prioritize the draft over the latest published version. Finally, set use_cache = "none" to not read from the cache and re-download afresh even when version is provided.
return_url: Instead of downloading the file, return the URL for download. Defaults to FALSE.

fileid

A numeric ID internally used for get_file_by_id. Can be a vector for multiple files.

filedoi

A DOI for a single file (not the entire dataset), of the form "10.70122/FK2/PPIAXE/MHDB0O" or "doi:10.70122/FK2/PPIAXE/MHDB0O". Can be a vector for multiple files.

Value

A R object that is returned by the default or user-supplied function .f argument. For example, if .f = readr::read_tsv(), the function will return a dataframe as read in by readr::read_tsv(). If the file identifier is a vector, it will return a list where each slot corresponds to elements of the vector.

Examples

if (FALSE) { # \dontrun{
# 1. For files originally in plain-text (.csv, .tsv), we recommend
# retreiving data.frame from dataverse DOI and file name, or the file's DOI.

df_tab <-
  get_dataframe_by_name(
    filename = "roster-bulls-1996.tab",
    dataset  = "doi:10.70122/FK2/HXJVJU",
    server   = "demo.dataverse.org"
  )

df_tab <-
  get_dataframe_by_doi(
    filedoi      = "10.70122/FK2/HXJVJU/SA3Z2V",
    server       = "demo.dataverse.org"
  )

# 2. For files where Dataverse's ingest loses information (Stata .dta, SPSS .sav)
# or cannot be ingested (R .rds), we recommend
# specifying `original = TRUE` and specifying a read-in function in .f.

# Rds files are not ingested so original = TRUE and .f is required.
if (requireNamespace("readr", quietly = TRUE)) {
  df_from_rds_original <-
    get_dataframe_by_name(
      filename   = "nlsw88_rds-export.rds",
      dataset    = "doi:10.70122/FK2/PPIAXE",
      server     = "demo.dataverse.org",
      original   = TRUE,
      .f         = readr::read_rds
    )
}

# Stata dta files lose attributes such as value labels upon ingest so
# reading the original version by a Stata reader such as `haven` is recommended.
if (requireNamespace("haven", quietly = TRUE)) {
  df_stata_original <-
    get_dataframe_by_name(
      filename   = "nlsw88.tab",
      dataset    = "doi:10.70122/FK2/PPIAXE",
      server     = "demo.dataverse.org",
      original   = TRUE,
      .f         = haven::read_dta
    )
}

# 3. RData files are read in by `base::load()` but cannot be assigned to an
# object name. The following shows two possible ways to read in such files.
# First, the RData object can be loaded to the environment without object assignment.

get_dataframe_by_doi(
  filedoi = "10.70122/FK2/PPIAXE/X2FC5V",
  server = "demo.dataverse.org",
  original = TRUE,
  .f = function(x) load(x, envir = .GlobalEnv))

# If you are certain each RData contains only one object, one could define a
# custom function used in https://stackoverflow.com/a/34926943
load_object <- function(file) {
  tmp <- new.env()
  load(file = file, envir = tmp)
  tmp[[ls(tmp)[1]]]
}

# https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/PPIAXE/X2FC5V
as_rda <- get_dataframe_by_id(
  file = 1939003,
  server = "demo.dataverse.org",
  .f = load_object,
  original = TRUE)
} # }