Making a Custom CPU platform and automated build toolchain

24 minutes ago 1

Follow @popovicu94

This is the first post in the series describing the development of a custom, small CPU platform, Mrav. Mrav is a set of tools and build flows for constructing simple embedded hardware and software solutions side by side, with maximum automation and portability.

Many ideas in this series are heavily inspired by my experience working on the development of TPU chips at Google several years ago. Reflecting on my time there, I’ve come up with many questions regarding computer engineering, and Mrav is my first deep-dive independent exploration.

In this text, I will cover the high-level ideas, briefly outline the milestones, and set expectations for the rest of the series.

I will reiterate it later in the text, but think of the Mrav project as a concrete Proof of Concept for some of the ideas presented here, essentially an essay in the form of buildable code, rather than a useful project in itself.

Table of contents

Open Table of contents

Background

I’ve been coding for most of my life, but it wasn’t until I studied electrical engineering that I understood that hardware and software in digital computers (as we generally reason about them) are fundamentally the same. Both typically implement algorithms, differing primarily in the format of inputs, outputs, and state transitions.

In the embedded world, engineers tend to be more aware of these differences than software engineers operating very high in the abstraction stack. For these low-level tasks, keeping the hardware and software layers in sync can be difficult and, at times, very challenging to debug.

To further study this problem and understand the gaps in hardware/software co-design for simpler use cases, I decided to create a fully custom CPU core.

I would like to emphasize ‘simpler’ here, as the goal is not to implement a high-performance platform capable of hosting LLMs, despite my earlier reference to TPUs. The goal is to create a very simple and enjoyable workflow for experimenting with tiny computers.

To be consistent with the terminology and to further reinforce the idea that hardware and software are the same, I will refer to the presented workflows as ‘computer engineering’ flows, rather than hardware/software co-development or similar terms.

Therefore, the rest of this series will focus on hardware + software almost as one monolithic “computer,” rather than fully separate levels of abstraction. Mrav should be thought of as a Proof of Concept for the ideas from these texts, rather than a production-ready solution (at least at this point in time).

Inspiration

I first engaged in combined hardware and software development during my graduation project. At the time, I really enjoyed exploring FPGAs, and soft-core CPUs seemed infinitely interesting because they allowed me to reconfigure a single chip to run different architectures, swap out peripherals, and so on. MicroBlaze caught my attention, and I wanted to base my project on it.

However, integrating everything ultimately proved to be overkill for my project and, if I recall correctly, required expensive licenses and the installation of various software packages. Consequently, my attention shifted to PicoBlaze.

PicoBlaze is extremely simple to understand and, for the most part, trivial to use. It lacks a C compiler and is instead programmed through a simple assembly language, but it gets the job done for small use cases.

The experience wasn’t entirely without discomfort, though. The assembler ran only on Windows, if I recall correctly, or at least I had trouble running it on Linux. I had to iterate on my design by running the tools inside a Windows VM, exporting to my Linux host, refining there, and repeating the process.

It was clear that PicoBlaze, however, had all the features I needed for my project, and I assumed (in retrospect, I believe this was a correct assumption) that the iteration discomfort was still lower than opting for other platforms that would require me to set up an elaborate development environment, follow an opinionated iteration flow, and so forth.

Nevertheless, in the back of my mind, I remained curious about how what I did for my graduation project could be generalized and expanded into something more useful.

Iteration speed in computer engineering

I credit Google (as a workplace, not the search engine) for teaching me almost everything I know about engineering discipline. One of the most important lessons I still remember is when I was asked as an intern to write unit tests for a very simple C++ component I wrote: I thought it was unnecessary given how simple the component was, but the Google culture was clear — if you write production logic, you write tests for it as well, and you cannot push your code without them, even in simple cases.

I quickly started seeing the benefits of having code properly tested at different levels (unit, integration, etc.), and I formed an opinion that for code with a mid-to-long lifespan (this distinction is very important!), paying the initial cost of setting up tests actually prepares you for rapid development in the long run.

This is an opinion I still debate with people about, though, and I generally dislike over-indexing on the fact that writing tests plus the first version of the code takes longer than just writing the first version without tests, as it ignores many aspects of subsequent iteration. This is why I emphasized the mid-to-long lifespan of the code. A piece of code can stop being relevant 5 minutes after its first run (and that can be 100% legitimate), but when it comes to code that plays a crucial role in our daily lives, its relevance can extend for as long as 20 years or even much more.

During this extended lifetime of code’s relevance, many people will iterate on it, some with more context on the project, some with less. In the ideal case, though, it should be uncontroversial to say that we want to reduce the context needed to the bare minimum.

On the other hand, good testing and iteration require you to run the system in a testing environment as easily as possible. If the context needed for this is extensive, then few people can iterate in a self-service fashion. The domain knowledge becomes more complex, and the iteration speed reduces.

Linus Torvalds can’t use Debian!

I’d like to pause here and explain one of the biggest engineering revelations I’ve had: watching Linus Torvalds state that Debian is too hard to use!

You can definitely see something is amiss here. A legendary software engineer who is behind some of the most influential code ever, including Linux itself, states that a Linux distribution is too difficult to use.

When I “grew up” as an engineer, I think I started to understand the message here: “too difficult to use” doesn’t mean “can’t use,” but rather something along the lines of “it’s not worth it.” Linus Torvalds is definitely smart enough to figure out just about anything related to Debian but prefers not to expend effort there.

Years later, I consider myself a Linux enthusiast, and I’m completely fine installing Ubuntu or Fedora on my personal machines and moving on with my life. I can put together an Arch or Gentoo installation, but the efficiency isn’t there. Setting up a computer is a task I want to do quickly, as simply as possible, and I want to focus on building something else.

The message here is: keep things as simple as possible; otherwise, even though they’re doable, they become less appealing. My goal when engineering now is to automate everything as much as possible and leave no room for errors. Not because it’s hard to do things in less automated ways, but because it eliminates any chance of human errors.

Automation for iteration

I cannot discuss the details of TPU development for obvious reasons, but I can say that in a complex project like that, iteration is challenging and absolutely necessary.

There are technical difficulties, of course; you are working on bleeding-edge hardware, and working on hardware already means there aren’t many abstractions you depend on — you’re at the bottom of the stack. However, what makes development very challenging is the number of people with vastly different profiles who are involved. As mentioned before, hardware and software are tightly coupled here, and they need to be in sync. Unfortunately, you can’t expect everyone on the project to keep everything in mind at all times. The project evolves every hour, and assumptions inevitably evolve as well.

The concrete takeaway here is to simply automate as much as possible and make couplings as explicit as possible (and as few as possible, ideally).

Simplicity of Mrav

Given that this is only the zeroth text in the series about ideas in Mrav, I’ll conclude this first write-up by explaining how Mrav achieves some of the values highlighted above.

Let’s look at the following problem: we want to deploy a Mrav-based system that looks like an SoC, filled up with some software that will blink the LEDs on an FPGA board on and off continuously.

I’ll go into more details (build system and everything) in the following texts, and for now let’s just stick with the high-level idea.

mrav_small( name = "soc", software = ":sw.bin", gpio_verilog = "//hardware/rtl/soc:gpio.sv", soc_top = "//hardware/rtl/soc:soc.sv", )

This instantiates a couple of “targets” we can build, the main one being implicit, soc_bundle.sv. Building that simply means running this:

bazel build //deployments/led:soc_bundle.sv

That yields a .sv file that can go directly into Vivado and onto the FPGA!

This does many things under the hood, including:

  1. Initiates the peripherals in the SoC (e.g. GPIO). Among other things, libraries are created for targeting the peripherals from the assembly code.
  2. Generates the “stitching” code to put the whole system together: core + bus + peripherals.
  3. It depends on the sw.bin file, which contains the machine code for the aforementioned application. The memory is hardcoded to include this machine code at FPGA initialization.

Similarly, the software target is created like this:

mrav_binary( name = "sw", srcs = [ "sw.mrav", ], deps = [ ":soc_gpio_lib", ], out = "sw.bin", )

This should be pretty self-explanatory, illustrating how the software depends on the generated software library for targeting the GPIO peripheral.

For simulation purposes, it’s possible to set up a simulator run like this:

run_binary( name = "sw_run_simple", srcs = [":sw.bin"], outs = [":sw_mrav_state.txt"], args = [ "--software=$(location :sw.bin)", "--instructions_to_sim=500", "--core_state_output=$(location :sw_mrav_state.txt)", "--verbose", ], tool = "//system/binaries/memonly", )

The simulation run depends on the software in the same way that the FPGA bundle image does, by declaring a dependency on sw.bin. For a quick verification, the user can simply run this:

bazel build //deployments/led:sw_mrav_state.txt

This provides a .txt file which shows the CPU core state at the end of the software run:

PC = 0024, [ r0 = 0000 r1 = 0070 r2 = 00FF r3 = 5600 r4 = 00E0 r5 = 0001 r6 = 0000 r7 = 0000 r8 = 0000 r9 = 0000 r10 = 0000 r11 = 0000 r12 = 0000 r13 = 0000 r14 = 0000 r15 = 000C ]

The simulation run and FPGA image generation both assemble the software on the fly. The user doesn’t even have to know where and how the assembler is constructed or how exactly it is invoked to provide the machine code. What really happens is that an assembler written in Go gets built transparently as a dependency in the build graph, and then it is invoked over the .mrav assembly source files to produce the final result. The only context the user needs to have is to know bazel build should be run, and nothing else. The specifics of running the Go compiler, building the assembler, and using it are transparently handled by the build system. The same applies to the simulator as well: it is another Go binary.

A hidden benefit that is easy to overlook is that the build flow doesn’t even make any assumptions about the availability of the Go toolchain on your build machine. The build flow will dynamically fetch the toolchain and use it in the background, and it will always be the version that the project prefers, without affecting your toolchains on the host machine. In other words, the build here is entirely hermetic.

Concepts like these are just some of the ways that iteration speed is improved and the context needed to make progress with Mrav is minimized. I previously wrote a fair bit about testing but didn’t demonstrate it here — this is something I would like to leave for subsequent texts.

A note on the name

Mrav comes from my native language, Serbian, and simply means ‘ant’. It shouldn’t be too hard for speakers of other languages to remember this word, and the association here is that this CPU is small but a hard worker.

Conclusion & next steps

Check out the Mrav project page and explore the GitHub repo to start playing with the Mrav CPU. In the next few weeks, I will release more texts explaining the intuition behind the entire design.

In the meantime, please explore the project’s automation and consider how you feel about making everything runnable uniformly without any background knowledge. Focus on how different build system targets are established to simply do bazel build and roll out the entire deployment artifact, whether it’s an RTL simulation, a bundle for FPGA deployment, machine code for Mrav, or anything else. That’s the sole focus of today’s text, and we will deep dive into the implementation in the following texts.

I hope this is useful!

Please consider following on Twitter/X and LinkedIn to stay updated.

Read Entire Article