7 Data
It’s often useful to include data in a package. If the primary purpose of a package is to distribute useful functions, example datasets make it easier to write excellent documentation. These datasets can be hand-crafted to provide compelling use cases for the functions in the package. Here are some examples of this type of package data:
-
tidyr:
billboard
(song rankings),who
(tuberculosis data from the World Health Organization) -
dplyr:
starwars
(Star Wars characters),storms
(storm tracks)
At the other extreme, some packages exist solely for the purpose of distributing data, along with its documentation. These are sometimes called “data packages”. A data package can be a nice way to share example data across multiple packages. It is also a useful technique for getting relatively large, static files out of a more function-oriented package, which might require more frequent updates. Here are some examples of data packages:
Finally, many packages benefit from having internal data that is used for internal purposes, but that is not directly exposed to the users of the package.
In this chapter we describe useful mechanisms for including data in your package. The practical details differ depending on who needs access to the data, how often it changes, and what they will do with it:
If you want to store R objects and make them available to the user, put them in
data/
. This is the best place to put example datasets. All the concrete examples above for data in a package and data as a package use this mechanism. See section sec-data-data.If you want to store R objects for your own use as a developer, put them in
R/sysdata.rda
. This is the best place to put internal data that your functions need. See section sec-data-sysdata.If you want to store data in some raw, non-R-specific form and make it available to the user, put it in
inst/extdata/
. For example, readr and readxl each use this mechanism to provide a collection of delimited files and Excel workbooks, respectively. See section sec-data-extdata.If you want to store dynamic data that reflects the internal state of your package within a single R session, use an environment. This technique is not as common or well-known as those above, but can be very useful in specific situations. See section sec-data-state.
If you want to store data persistently across R sessions, such as configuration or user-specific data, use one of the officially sanctioned locations. See section sec-data-persistent.
7.1 Exported data
The most common location for package data is (surprise!) data/
. We recommend that each file in this directory be an .rda
file created by save()
containing a single R object, with the same name as the file. The easiest way to achieve this is to use usethis::use_data()
.
Let’s imagine we are working on a package named “pkg”. The snippet above creates data/my_pkg_data.rda
inside the source of the pkg package and adds LazyData: true
in your DESCRIPTION
. This makes the my_pkg_data
R object available to users of pkg via pkg::my_pkg_data
or, after attaching pkg with library(pkg)
, as my_pkg_data
.
The snippet above is something the maintainer executes once (or every time they need to update my_pkg_data
). This is workflow code and should not appear in the R/
directory of the source package. (We’ll talk about a suitable place to keep this code below.) For larger datasets, you may want to experiment with the compression setting, which is under the control of the compress
argument. The default is “bzip2”, but sometimes “gzip” or “xz” can create smaller files.
It’s possible to use other types of files below data/
, but we don’t recommend it because .rda
files are already fast, small, and explicit. The other possibilities are described in the documentation for utils::data()
and in the Data in packages section of Writing R Extensions. In terms of advice to package authors, the help topic for data()
seems to implicitly make the same recommendations as we do above:
- Store one R object in each
data/*.rda
file - Use the same name for that object and its
.rda
file - Use lazy-loading, by default
If the DESCRIPTION
contains LazyData: true
, then datasets will be lazily loaded. This means that they won’t occupy any memory until you use them. The following example shows memory usage before and after loading the nycflights13 package. You can see that memory usage doesn’t change significantly until you inspect the flights
dataset stored inside the package.
We recommend that you include LazyData: true
in your DESCRIPTION
if you are shipping .rda
files below data/
. If you use use_data()
to create such datasets, it will automatically make this modification to DESCRIPTION
for you.
7.1.1 Preserve the origin story of package data
Often, the data you include in data/
is a cleaned up version of raw data you’ve gathered from elsewhere. We highly recommend taking the time to include the code used to do this in the source version of your package. This makes it easy for you to update or reproduce your version of the data. This data-creating script is also a natural place to leave comments about important properties of the data, i.e. which features are important for downstream usage in package documentation.
We suggest that you keep this code in one or more .R
files below data-raw/
. You don’t want it in the bundled version of your package, so this folder should be listed in .Rbuildignore
. usethis has a convenience function that can be called when you first adopt the data-raw/
practice or when you add an additional .R
file to the folder:
usethis::use_data_raw()
usethis::use_data_raw("my_pkg_data")
use_data_raw()
creates the data-raw/
folder and lists it in .Rbuildignore
. A typical script in data-raw/
includes code to prepare a dataset and ends with a call to use_data()
.
These data packages all use the approach recommended here for data-raw/
:
7.1.2 Documenting datasets
Objects in data/
are always effectively exported (they use a slightly different mechanism than NAMESPACE
but the details are not important). This means that they must be documented. Documenting data is like documenting a function with a few minor differences. Instead of documenting the data directly, you document the name of the dataset and save it in R/
. For example, the roxygen2 block used to document the who
data in tidyr is saved in R/data.R
and looks something like this:
#' World Health Organization TB data
#'
#' A subset of data from the World Health Organization Global Tuberculosis
#' Report ...
#'
#' @format ## `who`
#' A data frame with 7,240 rows and 60 columns:
#' \describe{
#' \item{country}{Country name}
#' \item{iso2, iso3}{2 & 3 letter ISO country codes}
#' \item{year}{Year}
#' ...
#' }
#' @source <https://www.who.int/teams/global-tuberculosis-programme/data>
"who"
There are two roxygen tags that are especially important for documenting datasets:
@format
gives an overview of the dataset. For data frames, you should include a definition list that describes each variable. It’s usually a good idea to describe variables’ units here.@source
provides details of where you got the data, often a URL.
Never @export
a data set.
7.1.3 Non-ASCII characters in data
The R objects you store in data/*.rda
often contain strings, with the most common example being character columns in a data frame. If you can constrain these strings to only use ASCII characters, it certainly makes things simpler. But of course, there are plenty of legitimate reasons why package data might include non-ASCII characters.
In that case, we recommend that you embrace the UTF-8 Everywhere manifesto and use the UTF-8 encoding. The DESCRIPTION
file placed by usethis::create_package()
always includes Encoding: UTF-8
, so by default a devtools-produced package already advertises that it will use UTF-8.
Making sure that the strings embedded in your package data have the intended encoding is something you accomplish in your data preparation code, i.e. in the R scripts below data-raw/
. You can use Encoding()
to learn the current encoding of the elements in a character vector and functions such as enc2utf8()
or iconv()
to convert between encodings.
7.2 Internal data
Sometimes your package functions need access to pre-computed data. If you put these objects in data/
, they’ll also be available to package users, which is not appropriate. Sometimes the objects you need are small and simple enough that you can define them with c()
or data.frame()
in the code below R/
, perhaps in R/data.R
. Larger or more complicated objects should be stored in your package’s internal data in R/sysdata.rda
.
Here are some examples of internal package data:
- Two colour-related packages, munsell and dichromat, use
R/sysdata.rda
to store large tables of colour data. -
googledrive and googlesheets4 wrap the Google Drive and Google Sheets APIs, respectively. Both use
R/sysdata.rda
to store data derived from a so-called Discovery Document which “describes the surface of the API, how to access the API and how API requests and responses are structured”.
The easiest way to create R/sysdata.rda
is to use usethis::use_data(internal = TRUE)
:
internal_this <- ...
internal_that <- ...
usethis::use_data(internal_this, internal_that, internal = TRUE)
Unlike data/
, where you use one .rda
file per exported data object, you store all of your internal data objects together in the single file R/sysdata.rda
.
Let’s imagine we are working on a package named “pkg”. The snippet above creates R/sysdata.rda
inside the source of the pkg package. This makes the objects internal_this
and internal_that
available for use inside of the functions defined below R/
and in the tests. During interactive development, internal_this
and internal_that
are available after a call to devtools::load_all()
, just like an internal function.
Much of the advice given for external data holds for internal data as well:
- It’s a good idea to store the code that generates your individual internal data objects, as well as the
use_data()
call that writes all of them intoR/sysdata.rda
. This is workflow code that belongs belowdata-raw/
, not belowR/
. -
usethis::use_data_raw()
can be used to initiate the use ofdata-raw/
or to initiate a new.R
script there. - If your package is uncomfortably large, experiment with different values of
compress
inuse_data(internal = TRUE)
.
There are also key distinctions, where the handling of internal and external data differs:
- Objects in
R/sysdata.rda
are not exported (they shouldn’t be), so they don’t need to be documented. - Usage of
R/sysdata.rda
has no impact on DESCRIPTION, i.e. the need to specify theLazyData
field is strictly about the exported data belowdata/
.
7.3 Raw data file
If you want to show examples of loading/parsing raw data, put the original files in inst/extdata/
. When the package is installed, all files (and folders) in inst/
are moved up one level to the top-level directory, which is why they can’t have names that conflict with standard parts of an R package, like R/
or DESCRIPTION
. The files below inst/extdata/
in the source package will be located below extdata/
in the corresponding installed package. You may want to revisit Figure fig-package-files to review the file structure for different package states.
The main reason to include such files is when a key part of a package’s functionality is to act on an external file. Examples of such packages include:
- readr, which reads rectangular data out of delimited files
- readxl, which reads rectangular data out of Excel spreadsheets
- xml2, which can read XML and HTML from file
- archive, which can read archive files, such as tar or ZIP
All of these packages have one or more example files below inst/extdata/
, which are useful for writing documentation and tests.
It is also common for data packages to provide, e.g., a csv version of the package data that is also provided as an R object. Examples of such packages include:
- palmerpenguins:
penguins
andpenguins_raw
are also represented asextdata/penguins.csv
andextdata/penguins_raw.csv
- gapminder:
gapminder
,continent_colors
, andcountry_colors
are also represented asextdata/gapminder.tsv
,extdata/continent-colors.tsv
, andextdata/country-colors.tsv
This has two payoffs: First, it gives teachers and other expositors more to work with once they decide to use a specific dataset. If you’ve started teaching R with palmerpenguins::penguins
or gapminder::gapminder
and you want to introduce data import, it can be helpful to students if their first use of a new command, like readr::read_csv()
or read.csv()
, is applied to a familiar dataset. They have pre-existing intuition about the expected result. Finally, if package data evolves over time, having a csv or other plain text representation in the source package can make it easier to see what’s changed.
7.3.1 Filepaths
The path to a package file found below extdata/
clearly depends on the local environment, i.e. it depends on where installed packages live on that machine. The base function system.file()
can report the full path to files distributed with an R package. It can also be useful to list the files distributed with an R package.
system.file("extdata", package = "readxl") |> list.files()
#> [1] "clippy.xls" "clippy.xlsx" "datasets.xls" "datasets.xlsx"
#> [5] "deaths.xls" "deaths.xlsx" "geometry.xls" "geometry.xlsx"
#> [9] "type-me.xls" "type-me.xlsx"
system.file("extdata", "clippy.xlsx", package = "readxl")
#> [1] "D:/1.study/R/R-4.2.0/library/readxl/extdata/clippy.xlsx"
These filepaths present yet another workflow dilemma: When you’re developing your package, you engage with it in its source form, but your users engage with it as an installed package. Happily, devtools provides a shim for base::system.file()
that is activated by load_all()
. This makes interactive calls to system.file()
from the global environment and calls from within the package namespace “just work”.
Be aware that, by default, system.file()
returns the empty string, not an error, for a file that does not exist.
system.file("extdata", "I_do_not_exist.csv", package = "readr")
#> [1] ""
If you want to force a failure in this case, specify mustWork = TRUE
:
system.file("extdata", "I_do_not_exist.csv", package = "readr", mustWork = TRUE)
#> Error in system.file("extdata", "I_do_not_exist.csv", package = "readr", : no file found
The fs package offers fs::path_package()
. This is essentially base::system.file()
with a few added features that we find advantageous, whenever it’s reasonable to take a dependency on fs:
- It errors if the filepath does not exist.
- It throws distinct errors when the package does not exist vs. when the file does not exist within the package.
- During development, it works for interactive calls, calls from within the loaded package’s namespace, and even for calls originating in dependencies.
fs::path_package("extdata", package = "idonotexist")
#> Error: Can't find package `idonotexist` in library locations:
#> - 'D:/1.study/R/R-4.2.0/library'
fs::path_package("extdata", "I_do_not_exist.csv", package = "readr")
#> Error: File(s) 'D:/1.study/R/R-4.2.0/library/readr/extdata/I_do_not_exist.csv' do not exist
fs::path_package("extdata", "chickens.csv", package = "readr")
#> D:/1.study/R/R-4.2.0/library/readr/extdata/chickens.csv
7.3.2 pkg_example()
path helpers
We like to offer convenience functions that make example files easy to access. These are just user-friendly wrappers around system.file()
or fs::path_package()
, but can have added features, such as the ability to list the example files. Here’s the definition and some usage of readxl::readxl_example()
:
readxl_example <- function(path = NULL) {
if (is.null(path)) {
dir(system.file("extdata", package = "readxl"))
} else {
system.file("extdata", path, package = "readxl", mustWork = TRUE)
}
}
readxl::readxl_example()
#> [1] "clippy.xls" "clippy.xlsx" "datasets.xls" "datasets.xlsx"
#> [5] "deaths.xls" "deaths.xlsx" "geometry.xls" "geometry.xlsx"
#> [9] "type-me.xls" "type-me.xlsx"
readxl::readxl_example("clippy.xlsx")
#> [1] "D:/1.study/R/R-4.2.0/library/readxl/extdata/clippy.xlsx"
7.4 Internal state
Sometimes there’s information that multiple functions from your package need to access that:
- Must be determined at load time (or even later), not at build time. It might even be dynamic.
- Doesn’t make sense to pass in via a function argument. Often it’s some obscure detail that a user shouldn’t even know about.
A great way to manage such data is to use an environment.1 This environment must be created at build time, but you can populate it with values after the package has been loaded and update those values over the course of an R session. This works because environments have reference semantics (whereas more pedestrian R objects, such as atomic vectors, lists, or data frames have value semantics).
Consider a package that can store the user’s favorite letters or numbers. You might start out with code like this in a file below R/
:
favorite_letters <- letters[1:3]
#' Report my favorite letters
#' @export
mfl <- function() {
favorite_letters
}
#' Change my favorite letters
#' @export
set_mfl <- function(l = letters[24:26]) {
old <- favorite_letters
favorite_letters <<- l
invisible(old)
}
favorite_letters
is initialized to (“a”, “b”, “c”) when the package is built. The user can then inspect favorite_letters
with mfl()
, at which point they’ll probably want to register their favorite letters with set_mfl()
. Note that we’ve used the super assignment operator <<-
in set_mfl()
in the hope that this will reach up into the package environment and modify the internal data object favorite_letters
. But a call to set_mfl()
fails like so:2
mfl()
#> [1] "a" "b" "c"
set_mfl(c("j", "f", "b"))
#> Error in set_mfl() :
#> cannot change value of locked binding for 'favorite_letters'
Because favorite_letters
is a regular character vector, modification requires making a copy and rebinding the name favorite_letters
to this new value. And that is what’s disallowed: you can’t change the binding for objects in the package namespace (well, at least not without trying harder than this). Defining favorite_letters
this way only works if you will never need to modify it.
However, if we maintain state within an internal package environment, we can modify objects contained in the environment (and even add completely new objects). Here’s an alternative implementation that uses an internal environment named “the”.
the <- new.env(parent = emptyenv())
the$favorite_letters <- letters[1:3]
#' Report my favorite letters
#' @export
mfl2 <- function() {
the$favorite_letters
}
#' Change my favorite letters
#' @export
set_mfl2 <- function(l = letters[24:26]) {
old <- the$favorite_letters
the$favorite_letters <- l
invisible(old)
}
Now a user can register their favorite letters:
mfl2()
#> [1] "a" "b" "c"
set_mfl2(c("j", "f", "b"))
mfl2()
#> [1] "j" "f" "b"
Note that this new value for the$favorite_letters
persists only for the remainder of the current R session (or until the user calls set_mfl2()
again). More precisely, the altered state persists only until the next time the package is loaded (including via load_all()
). At load time, the environment the
is reset to an environment containing exactly one object, named favorite_letters
, with value (“a”, “b”, “c”). It’s like the movie Groundhog Day. (We’ll discuss more persistent package- and user-specific data in the next section.)
Jim Hester introduced our group to the nifty idea of using “the” as the name of an internal package environment. This lets you refer to the objects inside in a very natural way, such as the$token
, meaning “the token”. It is also important to specify parent = emptyenv()
when defining an internal environment, as you generally don’t want the environment to inherit from any other (nonempty) environment.
As seen in the example above, the definition of the environment should happen as a top-level assignment in a file below R/
. (In particular, this is a legitimate reason to define a non-function at the top-level of a package; see section sec-code-when-executed for why this should be rare.) As for where to place this definition, there are two considerations:
Define it before you use it. If other top-level calls refer to the environment, the definition must come first when the package code is being executed at build time. This is why
R/aaa.R
is a common and safe choice.Make it easy to find later when you’re working on related functionality. If an environment is only used by one family of functions, define it there. If environment usage is sprinkled around the package, define it in a file with package-wide connotations.
Here are some examples of how packages use an internal environment:
- googledrive: Various functions need to know the file ID for the current user’s home directory on Google Drive. This requires an API call (a relatively expensive and error-prone operation) which yields an eye-watering string of ~40 seemingly random characters that only a computer can love. It would be inhumane to expect a user to know this or to pass it into every function. It would also be inefficient to rediscover the ID repeatedly. Instead, googledrive determines the ID upon first need, then caches it for later use.
- usethis: Most functions need to know the active project, i.e. which directory to target for file modification. This is often the current working directory, but that is not an invariant usethis can rely upon. One potential design is to make it possible to specify the target project as an argument of every function in usethis. But this would create significant clutter in the user interface, as well as internal fussiness. Instead, we determine the active project upon first need, cache it, and provide methods for (re)setting it.
The blog post Package-Wide Variables/Cache in R Packages gives a more detailed development of this technique.
7.5 Persistent user data
Sometimes there is data that your package obtains, on behalf of itself or the user, that should persist even across R sessions. This is our last and probably least common form of storing package data. For the data to persist this way, it has to be stored on disk and the big question is where to write such a file.
This problem is hardly unique to R. Many applications need to leave notes to themselves. It is best to comply with external conventions, which in this case means the XDG Base Directory Specification. You need to use the official locations for persistent file storage, because it’s the responsible and courteous thing to do and also to comply with CRAN policies.
The primary function you should use to derive acceptable locations for user data is tools::R_user_dir()
3. Here are some examples of the generated filepaths:
tools::R_user_dir("pkg", which = "data")
#> [1] "C:\\Users\\13081\\AppData\\Roaming/R/data/R/pkg"
tools::R_user_dir("pkg", which = "config")
#> [1] "C:\\Users\\13081\\AppData\\Roaming/R/config/R/pkg"
tools::R_user_dir("pkg", which = "cache")
#> [1] "C:\\Users\\13081\\AppData\\Local/R/cache/R/pkg"
One last thing you should consider with respect to persistent data is: does this data really need to persist? Do you really need to be the one responsible for storing it?
If the data is potentially sensitive, such as user credentials, it is recommended to obtain the user’s consent to store it, i.e. to require interactive consent when initiating the cache. Also consider that the user’s operating system or command line tools might provide a means of secure storage that is superior to any DIY solution that you might implement. The packages keyring, gitcreds, and credentials are examples of packages that tap into externally-provided tooling. Before embarking on any creative solution for storing secrets, consider that your effort is probably better spent integrating with an established tool.
If you don’t know much about R environments and what makes them special, a great resource is the Environments chapter of Advanced R.↩︎
This example will execute without error if you define
favorite_letters
,mfl()
, andset_mfl()
in the global workspace and callset_mfl()
in the console. But this code will fail oncefavorite_letters
,mfl()
, andset_mfl()
are defined inside a package.↩︎Note that
tools::R_user_dir()
first appeared in R 4.0. If you need to support older versions of R, then you should use the rappdirs package, which is a port of the Python appdirs module, and which follows the tidyverse policy regarding R version support, meaning the minimum supported R version is advancing and will eventually slide past R 4.0. rappdirs produces different filepaths thantools::R_user_dir()
. However, both tools implement something that is consistent with the XDG spec, just with different opinions about how to create filepaths beyond what the spec dictates.↩︎