If all the world were a monorepo

2 hours ago 1

As a software engineer raised on a traditional diet of C, Java, and Lisp, I’ve found myself downright baffled by R. I’m no stranger to mastering new programming languages, but learning R was something else: it felt like studying Finnish after a lifetime of speaking Romance languages.

I’m not alone in this experience. There are piles of discussions online revealing the difficulty of using R, with some users becoming so enraged as to claim that R is “not actually a programming language”. My struggle with R continued even after developing my first package, grf. Once it was feature complete, it took me nearly 2 weeks to navigate the publication process to R’s main package manager, CRAN.

In the years since, my discomfort has given away to fascination. I’ve come to respect R’s bold choices, its clarity of focus, and the R community’s continued confidence to ‘do their own thing’. In what other ecosystem would a top package introduce itself using an eight-variable equation?

Learning R has expanded how I think as a software engineer, precisely because its perspective and community are so different to my own. This post explores one truly unique aspect of the R ecosystem, reverse dependency checks, and how it changed the way I approach software maintenance.

With many package managers like npm and PyPI, developers essentially publish and update packages ‘at will’. It’s largely the author’s responsibility to test the package before it’s released. Not so in the R ecosystem. CRAN, R’s central package manager, builds each package before publication, testing against a variety of R versions and operating systems. If one of your package’s unit tests fails on Windows Server 2022 plus a development version of R, you’ll receive an email from the CRAN team explaining why it can’t be published. Until 2021, CRAN even required packages to build against Sun Microsystems Solaris, an operating system for which it's hard to even track down a VM!

After an initial push to release my package grf on CRAN and several smooth version updates, it came time for version 2.0. I sent the package to CRAN for review and received a surprising email in response:

Dear maintainer,

package grf_2.0.0.tar.gz has been auto-processed. The auto-check found problems when checking the first order strong reverse dependencies.

Please reply-all and explain: Is this expected or do you need to fix anything in your package? If expected, have all maintainers of affected packages been informed well in advance? Are there false positives in our results?

What on earth does the CRAN team mean by “checking the first order strong reverse dependencies”? The package had passed all my tests against full the matrix of R versions and platforms. But… CRAN had also rerun the tests for all packages that depend on mine, even if they don’t belong to me! The email went on to explain the precise package and tests that were failing:

══ Failed tests ══════════════════════════════

── Error (test_cran_smoke_test.R:10:3): a simple workflow works on CRAN ────────

Error: unused argument (orthog.boosting = FALSE)

Backtrace:

  1. └─policytree::multi_causal_for

est(X, Y, W) test_cran_smoke_test.R:10:2

2. └─base::lapply(...)

3. └─policytree:::FUN(X[[i]], ...)

Taking advantage of the major version bump, we had snuck in a small API change. This change then caused a test failure in policytree, a separate CRAN package, which meant our package was blocked from publication until we helped the other package with their upgrade. Fortunately, the author of policytree was also a core grf contributor, so we could quickly address the issue and our release was only delayed by a day or so.

My initial reaction to CRAN’s reverse dependency checks was one of shock and concern. It’s critical to be able to make breaking changes to address clear API inconsistencies and other usability issues. Will reverse dependency checks make it nearly impossible to update my package as it becomes more popular? One of the authors of glmnet, a widely used R package for regularized regression, told me they had coordinate with no fewer than 150 reverse dependencies in a recent release!

I also didn’t understand why it was my responsibility to help update other packages (whose code I may not understand or endorse) before mine could be released. The concept of reverse dependency checks felt truly radical. So why does CRAN perform these checks?

The world of R packages is like a huge monorepo. When declaring dependencies, most packages don’t specify any version requirements, and if they do, it’s usually just a lower bound like ‘grf >= 1.0’. This lets a user just sit down at their computer, update all the packages in their environment to the latest versions, and resume their work, much like you would perform a ‘git pull’ in a single repo.

CRAN’s approach clearly puts a burden on package developers, and in my experience, can materially slow down the release of new versions. But it results in an excellent workflow for R’s central user base: the researchers and data scientists who want to spend as much time as possible on the details of data analysis, not stuck in transitive dependency hell.

Earlier in my career, I led the removal of mapping types in the popular search engine Elasticsearch. This was the biggest API break in Elasticsearch’s history, changing the signature of every core method in the service.

We ran the migration in a classic ‘federated’ manner: we shipped the changes, updated the docs, and our users figured out the details of upgrading their own applications. A big benefit was that my team could complete the API changes fairly quickly, while continuing to make progress on exciting new features like vector search. But the migration had a steep cost: over 6 years later, there are thousands of projects still stuck on an older version. Just this year, a friend reached out for personal help with their upgrade!

One perspective is that R simply chooses a different point in the trade-off curve between the ease of software evolution vs. integration cost. By design, CRAN’s ‘reverse dependency’ checks encourage a culture where breaking changes are less common. From my experience, more R packages have inconsistent APIs, confusing naming, or syntactic ‘sharp edges’ than in the other language ecosystems I’ve worked in. But as a user of other R packages, it’s refreshing to rarely think about dependency versions or upgrades at all.

I’ve come to hold a much bolder perspective: R doesn’t just occupy a different point on the trade-off curve — rather the trade-off curve is not quite what we think.

Obviously, most software ecosystems function nothing like monorepos. There are hundreds of thousands of open source repositories that depend on these Elasticsearch APIs, and that’s not even counting the many closed source uses. Even for this one project, the scale of the dependency graph far exceeds that of the R ecosystem.

But although the reality is messy, I believe CRAN’s is truly the right mindset for running migrations. It places you in a powerful state of extreme empathy — that our user’s code is our responsibility, and that their success is ours too. Revisiting the Elasticsearch update, I could have easily reached out to the most popular open source projects affected by the change with detailed examples of how to migrate. The ‘monorepo mindset’ might have even led me to perform some migrations myself. As it always goes, I would have become frustrated after a few refactors, and created tooling that could automate common changes. It would’ve been more work for me and my team, and we would’ve still missed many cases due to the sheer number and diversity of ‘reverse dependencies’. But it would’ve saved enormous amounts of time for tens of thousands of developers, and ultimately been better for Elasticsearch itself.

What changed my perspective on CRAN? I started a new position at Databricks in developer infrastructure, giving me the opportunity to both run and observe complex migrations in a large tech company. After large migration after migration languished, the engineering team had the insight to centralize as many migrations as possible. If a team makes a breaking change to a library or framework, it is their responsibility to update other teams’ code from start to finish. Compared to the traditional ‘federated’ strategy, we’ve observed that more migrations reach full completion, in a much shorter time and with fewer regressions. And with LLM assistance, we’ve been able to centralize and complete more ambitious migrations than we previously thought feasible. I really saw the ‘monorepo mindset’ at work.

So here’s to empathy — for all our ‘reverse dependencies’ out there, and for those languages like R that we initially find shocking!

Thank you to Brett Wines and Stefan Wager for their valuable suggestions on this post.

Discussion about this post

Read Entire Article