14 min read

Reproducible environments in R

Shifting sands to solid ground

TL;DR 

The renv package brings virtual environments to the R language. Packages are installed into a global cache, which carefully records version information, allowing project-specific libraries to be constructed by creating a folder of symlinks to the appropriate package versions. renv keeps package versions consistent between collaborators. When used with versioned package repositories to enable binary package installations, restoring a project’s package library on a new computer is a breeze.

Key features:

  1. Backed by RStudio as a replacement for packrat, so it should receive continued development and support
  2. Global package cache reduces wasted space—a big plus when using large packages (e.g. genome packages from Bioconductor)
  3. Plays well with Python virtual environments (see renv::use_python)

Try it out on an existing project:

install.packages("renv")
renv::init()

Why an R package version manager is important 

In short, R package updates frequently introduce breaking changes. Without careful R package management, analyses will not be easily reproducible in the future. This does not mean we should avoid using packages; package version management is a solvable problem.

I was introduced to the R programming language in the spring of 2012 as part of a class on molecular systems biology during the first year of my PhD. The experience was a bit rough at first, but I was determined to learn a language that could help me with my summer rotation project. And, since it was required for my coursework, it was only natural to focus on learning R.

Using R, I encountered a number of pain points, and often considered jumping ship to focus on learning other popular languages. But, every time I was about to turn away from R, somebody would solve my problem in a new package, softening R’s rough edges. All I needed to do was install.packages("solve.erics.problem.now"), and I was happy as a clam1.

Some notable examples of this:

  • I need to track a project’s progress with dynamic reports—knitr, (2012)
  • I need these reports to have proper bibliographies—rmarkdown, (2014) an integration of knitr and pandoc, (2006)
  • I need better cross-referencing for my thesis—bookdown, (2016)
  • I need to do image processing—EBImage, (2010)
  • I need to quickly understand analysis code I wrote 6 months ago—dplyr, (2014)
  • I need to iterate as quickly as I can think for data visualization—ggplot2, (2007)
  • I need some way for my students to enter data, and to check and version control that data for analyses—googlesheets, (2015)
  • I have a bad bottleneck in my code—RCpp, (2008)
  • I want to do some network analysis—igraph, (2006)
  • I need more perplexity in my life—Rtsne, (2014)

This list could go on forever. There are seemingly innumerable problems that have been solved by an R package. Installing and using packages in R is dead simple. At this moment I have 230 packages installed on my system. And that is me being fairly conservative when adopting new packages!

Unfortunately, casually installing and using packages comes at a cost (see here for some insightful discussion). If you are not careful, you may inadvertently step into dependency hell, which is often not obvious until you need to share your project with a collaborator. They will ask,

What do I need to install?

You respond,

These 130 packages are required. 90 can be installed as usual. Wait, actually, I have some bioconductor packages in there, so I guess you should use their package installer? Oh and I’m using the development version of this and that package, so be sure to install from GitHub for those. And, I seem to remember there were some breaking changes to such and such package, so you will need to install the old version until I get around to fixing that. Oh and I used a new function recently added to R, so be sure to update R—you might need to re-install everything if there has been a minor version change. Don’t worry, I wrote an install script. You ran into an error? You’re on Windows at work, macOS at home and Ubuntu at the gym? Well, then it’s probably a problem with your setup. I recommend testing installation at work, home, and gym. Do you have Rtools/XCode/build-essential installed? Ooooh, actually, I forgot that I started using superAwesomeNewPackage. Be sure to install that as well, I’ll update the install script. Simply run that script and everything should work great.”

And they never want to collaborate with you again. Of course, this is an exaggeration. Nevertheless, complicated installation instructions and package updates that introduce breaking changes can make reproducing an old analysis feel nearly impossible, especially if package versions were not meticulously recorded. Certainly, for an infrequent R user, old analyses that rely on many packages are often not reproducible2.

So, this is the most pressing pain point I have with R right now:

  • I need to stabilize package versions to protect my projects from bit rot. I also need to make it easy for a collaborator to install a specific constellation of fixed package versions in a way that does not break their existing setup, or cause too much administrative burden. In short, I want to move my projects off of shifting sands and onto solid ground.

My journey to renv 

Over time I have made incremental progress towards solving the package versioning problem for myself. The first part of the solution was to begin using (semi-)stable repositories3. I did this by adding the following to my ~/.Rprofile (an R script that gets sourced when an R session starts).

options(
  repos = c(
    CRAN          = "https://cran.microsoft.com/snapshot/2019-12-12",
    BioCsoft      = "https://bioconductor.org/packages/3.10/bioc",
    BioCann       = "https://bioconductor.org/packages/3.10/data/annotation",
    BioCexp       = "https://bioconductor.org/packages/3.10/data/experiment",
    BioCworkflows = "https://bioconductor.org/packages/3.10/workflows"
  )
)

Once this is set, install.packages() will install the package versions from CRAN that existed on 2019-12-12, and from bioconductor version 3.10. Of course, the appropriate date and bioconductor version depends on the version of R currently in use. So, my ~/.Rprofile dynamically sets the appropriate repositories depending on the active version of R, which looks roughly like the following.

.First <- function() .repos_set_stable()

# ---- Hidden convenience functions for personal use ----
.repos_set_stable <- function(r_version = getRversion()) {
  options(
    repos = switch(
      as.character(r_version),
      # Remember to update 3.6.2 to R-4.0.0 release date
      "3.6.2" = .repos_stable("2019-12-12", bioc = "3.10"),
      "3.5.3" = .repos_stable("2019-12-12", bioc = "3.8"),
      "3.4.4" = .repos_stable("2019-03-11", bioc = "3.6"),
      "3.3.3" = .repos_stable("2018-03-19", bioc = "3.4")
    )
  )
}

.repos_stable <- function(date, bioc,
                          url_mran = "https://cran.microsoft.com/snapshot",
                          url_bioc = "https://bioconductor.org/packages") {
  c(CRAN          = file.path(url_mran, date, fsep = "/"),
    BioCsoft      = file.path(url_bioc, bioc, "bioc", fsep = "/"),
    BioCann       = file.path(url_bioc, bioc, "data/annotation", fsep = "/"),
    BioCexp       = file.path(url_bioc, bioc, "data/experiment", fsep = "/"),
    BioCworkflows = file.path(url_bioc, bioc, "workflows", fsep = "/")
  )
}

For personal use, this setup goes a long way towards saving my sanity. Any given version of R will have one officially supported constellation of package versions from its set of supported stable repositories, and most of these versions will be available as fast installing pre-compiled binaries on macOS and Windows. If I want to run an old analysis of mine, I switch to an appropriate version of R recorded in the project, and everything should work. If I am missing a package, I can use install.packages() as per usual without too much fuss.

The problem comes when imposing my workflow onto collaborators. Some collaborators, understandably, want the latest and greatest of everything, and certainly do not want to overwrite any previously installed packages with out-of-date versions. And, if I want to reproduce other people’s work, then I end up having to install versions they used, which ends up polluting my package library with “unsupported” (by me) versions. My solution to this was to isolate my “stable” packages into their own library4. I cobbled together my solution into a package that I called rprofile to streamline the creation and configuration of .Rprofiles that would stabilize project package versions (and versioned libraries could be shared across projects). This solution was OK, but not great. I was happy with it, and my collaborators seemed to tolerate it, even though it required a small setup cost and yelled at them when they installed packages into the default system library (nobody likes being yelled at).

Recently, however, I was helping a collaborator get set up with a project on a Windows computer and ran into some issues5.

The renv package arrived on CRAN towards the end of October 2019, just in time to solve my problem (mostly). I still find it useful to use stable repositories, but now, when I am starting a project, I run renv::init(), and renv::snapshot() and my project gets set up with isolated package libraries. Best of all, a nice record of my repositories and package versions used in the project gets recorded in a JSON formatted file called renv.lock. It looks something like the following.

{
  "R": {
    "Version": "3.6.2",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cran.microsoft.com/snapshot/2019-12-12"
      },
      {
        "Name": "BioCsoft",
        "URL": "https://bioconductor.org/packages/3.10/bioc"
      },
      {
        "Name": "BioCann",
        "URL": "https://bioconductor.org/packages/3.10/data/annotation"
      },
      {
        "Name": "BioCexp",
        "URL": "https://bioconductor.org/packages/3.10/data/experiment"
      },
      {
        "Name": "BioCworkflows",
        "URL": "https://bioconductor.org/packages/3.10/workflows"
      }
    ]
  },
  "Packages": {
    "BH": {
      "Package": "BH",
      "Version": "1.69.0-1",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "f4605d46264b35f53072fc9ee7ace15f"
    }
  }
...
}

Getting this project working on another computer just requires calling renv::restore(), and, if the R version doesn’t match the version recorded here, a warning is issued on startup. And, the package library will be fully isolated. For example, the libraries used as I write this post are shown below. Note the system default library that contains all the base R packages is not present (instead that second library just has symlinks to base R packages). No more worrying about polluting the system library!

.libPaths()
## [1] "/Library/Frameworks/R.framework/Versions/4.2/Resources/library-user"
## [2] "/Library/Frameworks/R.framework/Versions/4.2/Resources/library"

Also, because I use stable repositories, and because renv records these in renv.lock, binary packages are available for old package versions without my collaborators having to do any configuration.

Perhaps the nicest feature of renv is that it installs packages into a global cache, so that multiple projects that use the same version of a package can share these resources. In fact, the project’s private package library, stored in renv/library, just contains symlinks to the global cache, so there is almost no overhead to use renv on every new project. All in, renv requires very little investment to use and is certain to save me from time breaking all my things.

Remaining rough edges 

renv solves R package stabilization nicely, but, of course, there are still a few rough edges (some of these are clearly out-of-scope for renv to solve). I’m recording these here mostly as a reminder to myself of some avoidable pitfalls when using renv to create reproducible analyses.

  1. CRAN does not maintain binaries for old packages, so, when running renv::restore(), if a collaborator must install an old package, they will often need to install package development prerequisites first to be able to compile source packages. This is not a major issue for most packages, but can lead to installation failures for packages that require esoteric system dependencies. This is why I still use stable repositories that host binaries for the package versions I use. The downside of this is that I have to be diligent about managing stable repositories for every R upgrade (not a huge deal, just an update to my ~/.Rprofile described above). The other downside is that I rely on Microsoft and Bioconductor to host these repositories forever, which is kind of a big ask. 🤷

  2. When searching a project, depending on the text editor in use, the entire contents of renv/library will also be searched, which slows down search and clutters results with unimportant hits from the package library. Usually this just requires being a little more specific about where to perform a search, so not a huge deal.

  3. Many packages provide wrappers for APIs that are inherently out of their control and therefore inherently unstable. googlesheets is a good example of this problem. This package provides access to Google Sheets, which wraps version 3 of the Sheets API. Well, now there is a googlesheets4, which wraps version 4 and version 3 will stop working in March 2020. Thus, using renv to track and install old versions of googlesheets is kinda pointless and potentially harmful, since the old version will no longer work. For this reason, it is probably best not to strictly rely on packages that wrap APIs. I will still use googlesheets4 because it is extremely useful when you need to have a simple data entry method and want to retrieve that data easily and programmatically, but I end up committing the imported data to my project such that I can always re-run my analysis without strictly requiring googlesheets4 and a stable API from Google. Fortunately, renv does provide a mechanism to exclude packages from renv.lock. I like to add these packages to renv/settings.dcf with a line like the following:

    ignored.packages: googlesheets

    This prevents renv::snapshot() from adding googlesheets to renv.lock. To be honest, I don’t even really worry too much about this because I don’t consider it a huge problem to have a broken API wrapping package in my project library, so long as it isn’t strictly necessary to use that package to clean and rebuild the project.

  4. I have run into some problems using renv with devtools, but I think this is going to be fixed soon, so I am not going to elaborate on this too much (see here for details).

  5. Rough edge #4 is also related to a small issue I have had with large projects that contain rmarkdown files in subdirectories. The crux of this issue is that rmarkdown/knitr use an independent R process that is launched from the directory containing the rmarkdown file. When this happens, the project’s .Rprofile is not loaded, which means renv is not initiated. There are a couple ways to solve this problem. Currently, I add the following to the first code chunk of my reports:

    knitr::opts_knit$set(root.dir = here::here())

    This tells knitr to run code using an R process launched from the project level directory, which is fund by here::here() (which happens to sound pleasingly congratulatory). Let me know if you have a good way to set this globally for all rmarkdown files in a project!

  6. renv does not perfectly solve the “reproducible environment” problem. For example, rmarkdown has a system requirement for the pandoc command line utility. rmarkdown searches for an existing system installation of pandoc, and for a pandoc that comes bundled with RStudio. rmarkdown then chooses the latest version of pandoc that exists between the system and RStudio installations. This caused a problem when pandoc made major breaking changes with the release of version 2 in 2017. Now, all of my old rmarkdown reports require me to do a back-flip to re-compile. This was especially problematic for me because I upgraded RStudio while a paper was in review only to find that my manuscript would no longer build because of the breaking changes in pandoc. After a brief panic attack, I realized that I should be freezing pandoc the same way renv freezes R packages. Unfortunately, since rmarkdown always picks the latest version it finds from the system or RStudio, I have to use the following magic words in each project’s .Rprofile to override this behavior:

    # Add ./renv/bin to PATH, and override RStudio pandoc
    local({
      bin <- normalizePath("renv/bin")
      Sys.setenv(
        PATH = paste(bin, Sys.getenv("PATH"), sep = ":"),
        RSTUDIO_PANDOC = bin
      )
    })

    This basically adds renv/bin to the front of the PATH variable, such that programs installed here get priority, and resets the RSTUDIO_PANDOC environment variable to renv/bin such that rmarkdown will only find pandoc here. I currently copy all system requirements to renv/bin when a project has matured and is ready to be archived, but even then, I often have to figure out how to get R packages to look here for their system requirements. For example, to get blogdown to use hugo in renv/bin I had to use:

    options(blogdown.hugo.dir = normalizePath("renv/bin"))

    This renv/bin solution of mine feels like an ad-hoc error-prone field repair that is almost certainly not platform independent and doesn’t make it easy to install these requirements on a new system.

Conclusion 

In conclusion, renv is my favorite new development in the R world and I intend to begin using it with all of my projects including some of my archived work. Rough edge #6 is my remaining big hurdle towards full reproducibility. Eventually I will leap this hurdle with docker, as that seems to be the way the world is moving, but right now, asking my collaborators to use docker feels like asking them to travel to my project which is currently hosted on the moon. I hope to change that feeling soon!

If you have any questions or comments, please reach out to me on twitter!

Footnotes and references 

Billon, Pierre, Eric E Bryant, Sarah A Joseph, Tarun S Nambiar, Samuel B Hayward, Rodney Rothstein, and Alberto Ciccia. 2017. CRISPR-Mediated Base Editing Enables Efficient Disruption of Eukaryotic Genes Through Induction of STOP Codons.” Molecular Cell 67 (6): 1068–1079.e4. https://doi.org/10.1016/j.molcel.2017.08.008.

  1. I am not really sure how happy clams are, but I think probably this happy 🤩↩︎

  2. My condolences for anyone attempting to reproduce my own analysis, with 90 package dependencies, described in Billon et al. (2017), available here↩︎

  3. Repositories are where R packages are downloaded from↩︎

  4. A library is a directory where R packages are installed. R packages themselves are directories containing code and documentation.↩︎

  5. Ultimately my rprofile issue was just that Windows does not allow writing to the directory that R lives in, which is where rprofile tried to make the versioned libraries by default. It was an easy fix, but le sigh 😭↩︎