I'm Open-Sourcing My Custom Benchmark GUI

4 months ago 19

I think one of the reasons why I was able to do good performance work over the years is that at some point I started taking benchmarking seriously enough to write my own library. I used to use Google Benchmark, which is a fine library, but at some point you realize that you need a GUI to really scale up benchmarks¹. Here is a github link, and this video gives a quick intro:

The main problems it tries to address is:

Getting good numbers by running benchmarks repeatedly, visualizing them in context, picking a single opinionated good visualization, handling noise and even adding a bit of well-justified noise, and being careful about what statistics to do on the numbers.
Dealing with the inevitable combinatorial explosion of benchmarks when you want to try different data structures (min-max heap vs interval heap vs binary heap) with different operations (make_heap, push, pop) on different types (int vs string), different compilers, debug build vs release build, different variants of the code (e.g. trying loop unrolling), different input lengths etc. The full combinatorial explosion might be millions or billions of possible benchmarks. I want to be able to get a first impression for a subset in a few minutes. And then if I want less noisy results I can let it run overnight. And then I can try a new variation and visualize it together with the overnight results in under a minute.
Various ergonomic issues. Making it easy to select which numbers are together on the screen. Having the numbers as a graph first, CSV second. Being robust to the code crashing halfway through a long run: Record the partial results and be able to resume the same run. Making it easy to attach a profiler to one specific benchmark that I’m interested in.

This sounds complicated, and I have to admit that this is very much an app written by a programmer for a programmer, but the whole point of a GUI is that I can make this both more powerful and easier to use at the same time. In fact I think the patterns might be more widely useful for people who do slow-running experiments of other kinds (like training a ML model).

2. Main Lessons to Take From This

The main improvements I did that are relevant for other areas are

Having a GUI
Making it easy to add tags to your runs
Writing down all the results in a sqlite file

All of these are about adding structure to data. It really sucks when you run a long-running experiment and your only output is a log file plus a CSV. Because then if you want to compare today’s run to the run from a week ago, you have to copy both into one spreadsheet. Do this ten times and your spreadsheet is a giant mess. You’ll be confused and unable to answer questions like

“Which run brought the improvement again?”
“Wait, I remember this measure being better, when did this get worse?”
“That run from two weeks ago is still the best one we had, what changes were active in that again?”

You’ll also have trouble being confident in your improvements. The results are inherently noisy. So you just run each benchmark repeatedly until you get an idea for how much noise there is. How does that repetition compose with whatever you’re doing with your CSV files?

The sqlite database and the tags just solve this. Each result is recorded, each result is tagged. When the same run happened multiple times, you can either visualize all the runs, or show median and error bars. When you try a new variation (e.g. “don’t early-out in function foo”) just add a new tag and make sure to keep the old code working so you can run with the tag turned off and on.

With the repeated runs you have also slowed down your iteration times. Yes you could add a toggle to your scripts to switch between fast/noisy and slow/confident mode, but in practice how many times have you waited for a run that was slower than necessary because you just ran with the default? So the GUI doesn’t just help to organize all of these results, it also automatically transitions between quick/noisy runs and slow/confident runs: Whenever there is anything in your current selection that doesn’t have the quick/noisy result yet, it runs that first. When those are done it repeats runs to reduce noise.

3. Further Design Decisions and Features

This started off as very similar to Google Benchmark, and evolved from there.

3.1 The Same Interface as Google Benchmark

Google Benchmark has a good interface. Here is a benchmark for looking at a random item in memory:

void benchmark_memory_access(skb::State & state) { size_t num_bytes = state.range(0) / sizeof(size_t); std::vector<size_t> bytes(num_bytes); std::iota(bytes.begin(), bytes.end(), size_t(100)); std::uniform_int_distribution<size_t> random_index(0, num_bytes - 1); while (state.KeepRunning()) { for (size_t i = 0; i < num_loops; ++i) skb::DoNotOptimize(bytes[random_index(global_randomness)]); } state.SetItemsProcessed(state.iterations() * num_loops); }

The structure is

Lines 3 to 6: Set up the benchmark
Lines 9 and 10: Run the benchmark (the inner loop is part of the benchmark, not of the framework)
Lines 1, 7 and 12 are part of the framework.

This is pretty much identical to some Google Benchmark as of a couple years ago. The “State” parameter has the arguments for the benchmark. I only support one argument, so you always have to call state.range(0). Then you get to write your setup code, and then use a loop with state.KeepRunning(). Finally since the inner loop does multiple memory lookups per outer iteration, I have to tell the benchmarking framework that I processed more items than it thought. The last line is not required if you just do one iteration in the inner loop.

3.2 Always Specify Baseline Benchmarks

The above benchmark doesn’t tell you much because calculating a random index might take longer than the actual memory lookup. So you can specify a baseline benchmark that will be subtracted out:

void benchmark_memory_access_baseline(skb::State & state) { size_t num_bytes = state.range(0) / sizeof(size_t); std::uniform_int_distribution<size_t> random_index(0, num_bytes - 1); while (state.KeepRunning()) { for (size_t i = 0; i < num_loops; ++i) skb::DoNotOptimize(random_index(global_randomness)); } state.SetItemsProcessed(state.iterations() * num_loops); }

This is a benchmark that only does the “overhead” work of the previous benchmark. The framework will run both of these together and will always subtract the second benchmark from the first. The difference is the actual cost of a memory lookup to a random location in memory. (allowing for instruction-level parallelism. I have another benchmark where each read depends on the previous read, not allowing for parallelism)

3.3 Add Noise by Running Each Benchmark in a New Process

This is inspired by stabilizer, which isn’t being maintained any more. If you run enough benchmarks you will eventually notice that you can measure noticeable differences between an identical benchmark in two different compilation units. (or compiled on different days)

When you see a speedup you have three different cases:

It’s too noisy to be sure if the speedup is real
The noise is small and the speedup is real
The noise is small and the speedup was caused by compilation luck and can go away on the next recompilation

Like everyone else I tried to ignore this for a long time. You just hope that the third case won’t happen that often. But then I spent way too long chasing a tiny optimization of an inner loop which made no difference. Eventually I noticed that literally the same benchmark compiled twice would give me different results that could not be explained by noise. I could reproduce big speedups over and over again for exactly the same benchmark with identical assembly. Only then did I decide to finally address this.

Since stabilizer is not maintained any more, the closest easy thing you can do is to run with “Address Space Layout Randomization” (ASLR) which means you get a randomized memory layout on each run. This isn’t as comprehensive as what stabilizer does, but I’ve found it to be good enough in practice. After turning this on and running each benchmark in a new process, you mostly remove case 3. Instead you added more noise, so some optimizations that might fall into case 2 will instead fall into case 1. But I visualize how much noise there is and if something is too noisy, you can try to redesign the benchmark until there is less noise. (usually by looping more often or by optimizing the baseline)

In practice this change allowed me to be more accurate, allowing me to measure fractions of a nanosecond. Where before you always had to doubt yourself and think that you’re fooling yourself, now you can actually go after those small improvements that will add up.

3.4 Log X Axis, Linear Y axis

The log/log plot is popular for benchmarks because that’s how your numbers grow. You want to measure sorting a list with 10 items, 10,000 items and 10,000,000 items. Putting those numbers on the same graph seems to require a log y-axis. The problem is that you can’t see anything on a log/log plot. An O(n) radix sort will look the same as a O(n log n) quick sort on a log/log plot. Mar’s law says

Everything is linear if plotted log-log with a fat magic marker.

So how do you solve that? You divide by X. Now the O(n) radix sort will look like you’re just plotting a constant value and the O(n log n) quick sort will look like you’re plotting log(x). Which are two completely different plots. I knew this was the right choice when I started seeing all kinds of interesting patterns in the radix sort plot. You can clearly see when you have to change from two passes to three passes, and when your data gets too big to fit into CPU caches.

So this is the only visualization I support: log-scale on the x-axis, and Y/X with linear scale on the Y axis. I’m actually not opposed to adding other visualizations, it’s just that this visualization is so good that I haven’t found the need to add more.

3.5 Profiler Mode

When you’re curious about why two lines look different you’ll often want to look at the assembly. (Especially when we’re talking nanoseconds) I do this so often that I had to make it easy. So there is a “Profile Mode” button, which does two things:

Runs a benchmark in an infinite loop. (you click on a point in a graph to decide which benchmark to run)
Stop recording the results. The profiler might add overhead, so we don’t want to persist those results to the database.

With this you can attach easily to the running benchmark and see what it’s doing. I demo this in the video at the beginning, where I show that Clang has a pessimization enabled by default which makes heaps much slower.

3.6 Don’t Try to Be Clever About Noise

I don’t have the answer to noisiness in benchmarking. There are lots of different opinions here. But whenever I see someone being clever, I see them get this wrong. I tried using Rust’s Criterion.rs once to benchmark something, and it gets this completely wrong: It will tell you confidence intervals for your benchmark, and then you rerun the same benchmark and the new confidence interval doesn’t even overlap with the old one. What are you supposed to do with that result? When it gets the answer this wrong, you know it’s not just a bug in the code, it’s a bug in the assumptions about how benchmarking works.

So sure, I show error bars because that tells you at a glance if the benchmark is producing exact results or not, but I also allow you to show the data as points, which shows every single time that a benchmark was run. I click that checkbox often. I think Criterion.rs should also just show all the numbers it measured, and not do extra math on it. But I understand that it’s difficult to show lots of numbers if you don’t have a GUI.

I also plot the median result by default. I know Alexandrescu’s opinion that you should choose the min, because noise is only additive. Noise can never make your benchmark faster, it can only make it slower. But I have lots of random numbers in my benchmarks. E.g. I generate a new random array to sort every time that I run my sorting benchmark. If I take the min across all runs, who’s to say that I won’t just end up with a really lucky run where the input was presorted? (especially when measuring sorting of short arrays) Could you avoid that with a deterministic seed? Sure, but now you have other problems like a branch predictor that memorizes too much. You don’t want to be in a world where you have to trick the compiler and trick the CPU. Better to generate new random numbers each time. You want the CPU to actually do the work. (my equivalent of DoNotOptimize also does actual work that can’t be eliminated) You can always subtract out that cost as a baseline.

My justification here is similar to the ASLR approach above: I’d rather have a benchmark that’s a bit more noisy but where I can trust that I get a realistic distribution of results. If one algorithm is reliably 0.5 nanoseconds faster than another on that benchmark, I can see that after enough runs and I can trust that. If the benchmark ends up too noisy to see that, I can always optimize the code until it’s less noisy. (this just means optimizing all the code that you’re not trying to measure. E.g. use Lemire’s nearly branchless range reduction to make “random numbers in a range” faster. Then the benchmark will have less noise)

3.7 Other things I tried

I spent a lot of time on trying to run benchmarks in different orders. Like also run benchmarks that are not currently visualized sometimes, to eventually fill out the whole database. Or do a halton sequence to get the rough shape of the overall line before filling in the gaps. (if your line goes from 0 to 1, then you start off running the benchmark at 0, then 1, then 0.5, then 0.25, then 0.75, then 0.125, 0.625… always filling in the largest gap between the previous numbers) I ended up removing that for the open source version because even though it used to be the default, I always ended up turning it off. Too complicated. Better to just run the benchmarks in sequence.

I also have a “Normalize for Memory” checkbox where I barely remember what that feature even did. I think it was something to do with hash tables and trying to be fair to hash tables with different amount of memory overhead. (it’s easy to have a fast hash table if set the max_load_factor to 0.25) Or something like that. I think this feature still works.

4. Code and License

The code is available here under the MIT license.

I ship the benchmark library with two examples:

Memory benchmarks which you can use to measure how fast the memory is on your machine
Heap benchmarks

The second one has a fast implementation of pairing heaps, not that those get used by many people. (I had one time where I absolutely needed a node-based heap, and I noticed that you can safely eliminate one of the two passes in merge-pairs) It also has a fast quaternary heap, which might just be the fastest heap out there. I haven’t done the work to confirm that, I was going to do that for a future blog post, but who knows when I’ll get around to that.

The way to use the library is to add another executable, following those examples. This means it’s more of a “framework”, not a “library”, meaning it’s expected that you plug your code into this framework, not plug this library into your code. I know this isn’t good, but I think copy+paste is fine for this. Just make a copy of the benchmarking library, make a copy of your code, and create a new executable. Then do your experiments in that and when you’re done, copy the results back. There is no good workflow yet since it’s a library written by me for me. I think if you have something self-contained like a hash table or a sorting algorithm, you can easily follow the examples.

Read Entire Article