Why Use a Monorepo Build Tool?

4 months ago 25

Li Haoyi, 17 December 2024

Software build tools mostly fall into two categories:

One question that comes up constantly is why do people use Monorepo build tools? Tools like Bazel are orders of magnitude more complicated and hard to use than tools like Poetry or Cargo, so why do people use them at all?

It turns out that Monorepo build tools like Bazel or Mill do a lot of non-obvious things that other build tools don’t, that become important in larger codebases (100-10,000 active developers). These features are generally irrelevant for smaller projects, which explains why most people do not miss them. But past a certain size of codebase and engineering organization these features become critical. We’ll explore some of the core features of "Monorepo Build Tools" below, from the perspective of Bazel (which I am familiar with) and Mill (which this technical blog is about).

Support for Multiple Languages

While small software projects usually start in one programming language, larger ones inevitably grow more heterogeneous over time. For example, you may be building a Go binary and Rust library which are both used in a Python executable, which is then tested using a Bash script and deployed as part of a Java backend server. The Java backend server may also server a front-end web interface compiled from Typescript, and the whole deployment again tested using Selenium in Python or Playwright in Javascript.

The reality of working in any large codebase and organization, such rube-goldberg code paths do happen on a regular basis, and any monorepo build tool has to accommodate them. If the build tool does not accommodate multiple languages, then what ends up happening is you end up having lots of small build tools wired together. Taking the example above, you may have:

  • A simple Maven build for your backend server,

  • A simple Webpack build for the Web frontend

  • A simple Poetry build for your Python executable

  • A simple Cargo build for your Rust library

  • A simple Go build for your Go binary

Although each tool does its job, none of those tools are sufficient to build/test/deploy your project! Instead, you end up having a bin/ or build/ folder full of .sh scripts that wire up these simpler per-language build tools in ad-hoc ways. And while the individual language-specific build tools may be clean and simple, the rats nest of shell scripts that you also need usually ends up being a mess that is impossible to work with.

That is why monorepo build tools like Bazel and Mill try to be language agnostic. Although they may come with some built in functionality (e.g. Bazel comes with C/C++/Java support built in, Mill comes with Java/Scala/Kotlin), monorepo build tools must make it extending them to support additional languages easy. Bazel via its ecosystem of rules_* libraries, Mill via it’s extensibility APIs which make it easy to implement your own support for additional languages like Python or Typescript. That means that when the built-in language support runs out - which is inevitable in large growing monorepos - the user can smoothly extend the build tool to keep going rather than falling back to ad-hoc shell scripts.

Support for Custom Build Tasks

As projects get large, they also get more unique. Every hello-world Java or Python or Javascript project looks about the same, but larger projects start having unusual requirements that no-one else does, for example:

  • Invoking a bespoke code-generator to integrate with your company’s internal RPC system

  • Generating custom deployment artifact formats to support that one legacy datacenter you need to get your code running in

  • Downloading third-party dependency sources, patching them, and building them from source to work around issues that you have fixed but not yet succeeded in upstreaming

  • Compiling the compiler you need to compile the rest of your codebase, again perhaps to make use of compiler bugfixes that you have not yet managed to get into an upstream release.

The happy paths in build tools are usually great, and the slightly-off-the-beaten-path workflows usually have third-party plugins supporting them: things like linting, generating docker containers, and so on. But as any growing software organization quickly finds itself with build-time use cases that nobody else in the world has. At that point the paved paths have ended and the build engineers will need to implement the custom build tasks themselves

Every build tool allows some degree of customization, but how easy and flexible they are differs from tool to tool. e.g. a build tool like Maven requires its plugins to fit into a very restricted Build Lifecycle (link): this is good when compiling Java source code is all you need to do, but can be problematic when you need to do something more far afield. The alternative is the aforementioned rats-nest of shell scripts - either wrapping or wrapped by the traditional build tools - that implement the custom build tasks you require.

That is why monorepo build tools like Bazel and Mill make it easy to write custom tasks. In Bazel a custom task is just a genrule(), in Mill just def foo = Task { …​ } with a block of code doing what you need, and you can even use any third-party JVM library you are already familiar with as part of your custom task. This helps ensure your custom tasks are written in concise type-checked code with automatic caching and parallelism, which are all things that are lacking if you start implementing your logic outside of your build tool in ad-hoc scripts.

Automatically Caching and Parallelizing Everything

In most build tools, caching is opt in, so the core build/compile tasks usually end up getting cached but everything else is not and ends up being wastefully recomputed all the time. In monorepo build tools like Bazel or Mill, everything is cached and everything is parallelized. Even tests are cached so if you run a test twice on the same code and inputs (transitively), the second time it is skipped.

The importance of caching and parallelism grows together with the codebase:

  • For smaller codebases, you do not need to cache or parallelize at all: compilation and testing are fast enough that you can just run them every time from a clean slate without inconvenience

  • For medium-sized codebases, caching and parallelizing the slowest tasks (e.g. compilation or testing) is usually enough. Most build tools have some support for manually opting-in to some kind of caching or parallelization framework, and although you will likely miss out on many "ad-hoc" build tasks that still run un-cached and sequentially, those are few enough not to matter

  • For large codebases, you want everything to be cached and parallelized. Caching just the "core" build tasks is no longer enough, and any non-cached or non-parallel build tasks results in noticeable slowness and inconvenience.

Take ad-hoc source code generation as an example: a small codebase may not have any. A medium-sized codebase may have some but little enough that it doesn’t matter if it runs sequentially un-cached. But a large codebase may have multiple RPC IDL code generators (e.g. protobuf, thrift, static resource pre-processors, and other custom tasks that not caching and parallelizing these causes visible slowdowns and inconvenience.

In monorepo build tools like Mill or Bazel, caching and parallelism are automatic and enabled by default. That means that it doesn’t matter what you are running - whether it’s the core compilation workflows or some ad-hoc custom tasks - you always get the benefits of caching and parallelization to keep your build system fast and responsive.

Seamless Remote Caching

"Remote caching" means I can compile something on my laptop, you download it to your laptop for usage. "Seamless" means I don’t need to do anything to get this behavior - no manual commands to upload and download - so the distribution of build outputs from my laptop to yours happens completely automatically.

This also applies to tests: e.g. if TestFoo was run in CI on master, if I pull master and run all tests without changing the Foo code, TestFoo is skipped and uses the CI result.

Bazel, Pants, and many other monorepo build tools support this out of the box, with open source back-end servers such as Bazel Remote. The clients and servers all conform to a standardize protocol, so you can easily drop in a new server or new build client and have it work with all your existing infrastructure. Mill does not yet support remote caching, but there are some prototypes and work in progress that will hopefully add support in the not-too-distant future.

Remote Execution

"Remote execution" means that I can run "compile" on my laptop and have it automatically happen in the cloud on 96 core machines, or I run a lot of tests (e.g. after a big refactor) on my laptop and it seamlessly gets farmed out to run 1024x parallel on a large compute cluster.

Remote execution is valuable for two reasons:

  1. Better Parallelism: The largest cloud machines you can get are typically around 96 cores, whereas if you farm out the execution to a cluster you can easily run on many 1024 or more cores in parallel

  2. Better Utilization: e.g. If you give every individual a 96 core devbox, most of the time when they are not actively running anything (e.g. they are thinking, typing, talking to someone, etc.) those 96 cores are completely idle. It’s not usual for utilization on devboxes to be <1% while you are still paying for the other 99% of idle CPU time. In contrast, an auto-scaling remote execution cluster can spin down machines that are not in use, and achieve >50% utilization rates

One surprising thing is that remote execution can be both faster and cheaper_than running things locally on a laptop or devbox! Running 256 cores for 1 minute doesn’t cause any more cloud spending than running 16 cores for 16 minutes, even though the former finishes 16x faster! And due to the improved utilization from remote execution clusters, the total savings can be significant.

Monorepo build tools like Bazel, Pants, and Buck all support remote execution out of the box. Mill does not support it, which means it might not be suitable for the largest monorepos with >10,000 active developers.

Dependency based test selection

When using Bazel to build a large project, you can use bazel query to determine the possible targets and tests affected by a code change, allowing you to easily set up pull-request validation to only run tests downstream of a PR diff and skip unrelated ones. The Mill build tool also supports this, as Selective Execution, letting you snapshot your code before and after a code change and only run tasks that are downstream of those changes.

Fundamentally, running "all tests" in CI is wasteful when you know from the build tool that only some tests are relevant to the code change being tested. If every pull request always runs every single test in a monorepo, then it’s natural for PR validation times to grow unbounded as the monorepo grows. Sooner or later this will start causing issues.

Any large codebase that doesn’t use a monorepo build tool ends up re-inventing this manually, e.g. consider this code in apache/spark that re-implements this in a Python script that wraps mvn or sbt (link). With a proper monorepo build tool, such functionality comes for free out-of-the-box with better precision and correctness than anything you could hack together manually.

Build Task Sandboxing

There are two kinds of sandboxing that monorepo build tools like Bazel do:

  1. Semantic sandboxing: this ensures your build tasks do not make use of un-declared files, or write to places on disk that can affect other tasks. In most build tools, this kind of mistake results in confusing nondeterministic parallelism and cache invalidation problems down the road, where e.g. your build step may rely on a file on disk but not realize it needs to re-compute when the file changes. In Bazel, these mis-configurations result in a deterministic error up front, enforced via a variety of mechanisms (e.g. CGroups on Linux, Seatbelt Sandboxes on Mac-OSX).

  2. Resource sandboxing: Bazel also has the ability to limit CPU/Memory usage (https://github.com/bazelbuild/bazel/pull/21322), which eliminates the noisy neighbour problem and ensures a build step or test gets the same compute footprint whether run alone during development or 96x parallel on a CI worker. Otherwise, it’s common for tests to pass when run alone during manual development, then timeout or OOM when run in CI under resource pressure from other tests hogging the CPU or RAM

Both kinds of sandboxing have the same goal: to make sure your build tasks behave the same way no matter how they are run sequentially or in parallel with one another. Even Bazel’s sandboxes aren’t 100% hermetic, but are hermetic enough

The Mill build tool’s sandboxing is less powerful than Bazel’s CGroup/Seatbelt sandboxes, and simply runs tasks and subprocesses in sandbox directories to try and limit cross-task interference. But it has the same goal of adding best-effort guardrails to mitigate race conditions and non-determinism.

Most small projects never need the features listed above: small projects build quickly without any optimizations, use a single language toolchain without customization, and any bugs related to non-determinism or resource footprint can usually be investigated and dealt with manually. Any missing build-tool features can be papered over with shell scripts.

That is how every small project starts, and as most small projects never grow big you can go quite a distance without needing anything more. While the features above would be nice to have, they are wants rather than needs.

But once in a while, a project does grow large. Sometimes the rocket-ship really does take off! In such cases, as the number of developers grows from 1 to 10 to 1,000, you will inevitably start feeling pain:

  1. Local build times slowing to a crawl on your laptop, using 1 out of 16 available CPUs

  2. Pull-request validation taking 4 hours to run mostly-unnecessary tests with a 50% flake rate

  3. An unmaintainable multi-layer jungle of shell, Python, and Make scripts layered on top of your classic build tools like Maven/Poetry/Cargo, that everyone knows should be cleaned up but nobody knows how.

Monorepo build tools bring performance optimizations to bring down CI times, sandboxing improvements to reduce flakiness, and structured way of replacing the ubiquitous folder-full-of-bash-scripts. It is these features that really let a codebase scale, allowing you to grow your developer team from 100 to 1,000 developers and beyond without everything grinding to a halt. That is why people use "monorepo build tools" like Mill (most suitable for projects 10-1,000 active developers) or Bazel (most suitable for larger projects 100-10,000 active developers) .

Read Entire Article