TPDE is a single-pass compiler backend for LLVM that was open-sourced earlier this year by researchers at TUM. The comprehensive documentation walks you through integrating TPDE into custom builds of Clang and Flang. Currently, it supports LLVM 19 and LLVM 20 release versions.
Integration in LLVM ORC JIT
TPDE’s primary strength lies in delivering low-latency code generation while maintaining reasonable -O0 code quality — making it an ideal choice for a baseline JIT compiler. LLVM’s On-Request Compilation (ORC) framework provides a set of libraries for building JIT compilers for LLVM IR. While ORC uses LLVM’s built-in backends by default, its flexible architecture makes it straightforward to swap in TPDE instead!
Let’s say we use the LLJITBuilder interface to instantiate an off-the-shelf JIT:
The builder offers several extension points to customize the JIT instance it creates. To integrate TPDE, we’ll override the CreateCompileFunction member, which defines how LLVM IR gets compiled into machine code:
To use TPDE in this context, we need to wrap it in a class that’s compatible with ORC’s interface:
In the constructor, we initialize TPDE with a target triple (like x86_64-pc-linux-gnu). TPDE currently works on ELF-based systems and supports both 64-bit Intel and ARM architectures (x86_64 and aarch64). For now let’s assume that’s all we need. Now let’s implement the actual wrapper code:
Here’s what’s happening: we create a new buffer B to store the compiled binary code, then pass both the buffer and the module M to TPDE for compilation. If TPDE fails, we bail out with an error. On success, we wrap the result in a MemoryBuffer and return it. (Note: LLVM still uses char pointers for binary buffers and the three-way definition of char in the C Standard falls on our feet sometimes, but it’s difficult to change in a mature codebase like LLVM.)
For the basic integration this is it! No need to patch LLVM — this works with official release versions. We can compile simple LLVM IR code already:
I’ve created a complete working demo on GitHub that you can try out. The code in the repo handles a few more details that we’ll explore shortly. Here’s what the output looks like:
We can already see a 4x speedup with TPDE compared to built-in LLVM codegen for 100 repetitions with a large self-contained module that was generated with csmith:
LLJITBuilder has a Catch
While LLJITBuilder is convenient, it comes with a minor trade-off. The interface incorporates standard LLVM components including TargetRegistry, which is perfectly reasonable for most use cases. However, this creates a dependency we might not want: the built-in LLVM target backend must be initialized first via InitializeNativeTarget(). This means we still need to ship the LLVM backend, even though TPDE could theoretically replace it entirely.
If you want to avoid this dependency, you’ll need to set up your ORC JIT manually. For inspiration on this approach, check out how the tpde-lli tool implements it. Before diving into that rabbit hole though, let’s explore another important aspect!
Implementing a LLVM Fallback
One reason why TPDE is so fast and compact is that it focusses on the most common use cases rather than covering every edge case in the LLVM instruction set. The documentation provides this guideline:
Code generated by Clang (-O0/-O1) will typically compile; -O2 and higher will typically fail due to unsupported vector operations.
When your code includes advanced features like vector operations or non-trivial floating-point types, TPDE won’t be able to handle it. In these cases, we need a fallback to LLVM. Since this scenario is quite common in real-world applications, most tools will include both backends (and we can keep using LLJITBuilder). Implementing the fallback is straightforward using ORC’s CompileUtils:
Let’s test our fallback mechanism with the following IR file that uses an unsupported type:
Here’s what happens when we run it:
In this implementation, we create a new SimpleCompiler instance for each fallback case. While this adds some overhead, it’s acceptable since we’re already on the slow path. The key assumption is that most code in your workload will successfully compile with TPDE — if that’s not the case, then TPDE might not be the right choice in the first place. Interestingly, this approach has a valuable side-effect that becomes important in the next section: it’s inherently thread-safe!
Adding Concurrent Compilation Support
ORC JIT has built-in support for concurrent compilation. This is neat, but it requires attention when customizing the JIT. Our current setup uses a single TPDECompiler instance, but TPDE’s compile_to_elf() method isn’t thread-safe. Enabling concurrent compilation would cause multiple threads to call this method simultaneously, leading to failures.
How can we solve this? One option would be creating a new tpde_llvm::LLVMCompiler instance for each compilation job, but that adds an overhead of O(#jobs) — not ideal for our fast path. Essentially, we want to avoid calling into compile_to_elf() while there is another call in-flight on the same thread. We can achieve this easily by making the TPDECompiler instance thread-local, reducing the overhead to just O(#threads):
We also need to guard access to our underlying buffers:
With thread safety handled, we can now enable concurrent compilation:
Exercising Concurrent Lookup
It needs a lot more support code to actually exercise concurrent compilation and do basic performance measurments. The complete sample project on GitHub has one possible implementation: after loading the input module, it creates 100 duplicates of it with different entry-point names and issues a single JIT lookup for all the entry-points at once. Here’s a simplified version of how this works:
Compile-times for our csmith example drop from ~2200ms to just ~740ms when utilizing 8 threads in parallel:
Et voilà!
Let’s wrap it up and appreciate the remarkable complexity that LLVM effortlessly handles in our little example. We parse a well-defined, human-readable representation of Turing-complete programs generated from various general-purpose languages like C++, Fortran, Rust, Swift, Julia, and Zig.
LLVM’s composable JIT engine seamlessly manages these parsed modules, automatically resolving symbols and dependencies. It compiles machine code in the native object format on-demand for multiple platforms and CPU architectures, while giving us complete control over the optimization pipeline, code generator (like our TPDE integration) and many more components. The engine then links everything into an executable form — all in-memory and without external tools or platform-specific dynamic library tricks! It’s really impressive that we can simply enable compilation on N threads in parallel and have it “just work” :-)