How to waste CPU like a Professional

3 hours ago 1

Or: Hey, keeping the CPU busy for a given amount of time should be easy?

Welcome back to my blog. Last week, I showed you how to profile your Cloudfoundry application, and the week before, how I made the CPU-time profiler a tiny bit better by removing redundant synchronization. This week’s blog post will be closer to the latter, trying to properly waste CPU.

As a short backstory, my profiler needed a test to check that the queue size of the sampler really increased dynamically (see Java 25’s new CPU-Time Profiler: Queue Sizing (3)), so I needed a way to let a thread spend a pre-defined number of seconds running natively on the CPU. You can find the test case in its hopefully final form here, but be aware that writing such cases is more complicated than it looks.

So here we are: In need to essentially properly waste CPU-time, preferably in user-land, for a fixed amount of time. The problem: There are only a few scant resources online, so I decided to create my own. I’ll show you seven different ways to implement a simple

void my_waint(int seconds);

method, and you’ll learn far more about this topic than you ever wanted to. That works both on Mac OS and Linux. All the code is MIT licensed; you can find it on GitHub in my waste-cpu-experiments, alongside some profiling results.

As another tangent: Apparently, my Java 25’s new CPU-Time Profiler (1) blog post blew up on Hacker News. Fun times.

Basic Implementation

Let us start with the most basic implementation: Just a loop that checks the time continuously.

void my_wait(int seconds) { clock_t end_time = clock() + seconds * CLOCKS_PER_SEC; while (clock() < end_time) { // cpu wasting loop } }

No surprise in the assembler, it’s literally just clock calls and a jump:

We assume here and in the following that the code is compiled with full optimisations enabled (-O3) enabled, albeit the compilers probably create different results for -O2.

The basic code is the solution that I initially used. Running it one hundred times for 10 seconds each (my_wait(10)) on a calm x86 Linux machine showed promising results. The program ran for 10.009 seconds (±0.01% standard deviation), and therefore, with an accuracy of 99.907%. Not that accuracy matters, but it’s nice to know that the method does what it says on the tin.

The only problem is that the method spends 85% of its time in the Kernel and not in user-land. The main culprit for this is, of course, the many clock() calls that each resulted in a clock_gettime system call. To be precise, the small program did 10,628,885 (± 0.37%) of these system calls in just 10 seconds. Interestingly, we can now determine that a clock() call takes roughly one microsecond. Which is pretty fast.

But we can do better:

For-Loop

The main idea for the following four implementations is to use an inner loop that wastes CPU-time without calling clock(). Of course, we could just add a for-loop directly:

void my_wait(int seconds) { clock_t end_time = clock() + seconds * CLOCKS_PER_SEC; while (clock() < end_time) { for (int i = 0; i < 100000; i++); } }

But sadly/gladly, modern compilers are intelligent enough. The loop doesn’t have any side effects, so it will be optimised out:

So, this has the exact performance characteristics as the basic implementation. Of course, when we turn off optimisations (-O0), then the inner loop is emitted correctly in the assembly:

The program only spent 0.035% (± 16%) of the runtime in non-user land, with only around 55 thousand system calls.

But how can we force the compiler not to optimise the inner for-loop without restricting the optimisation of the whole program? We have three options that I’ll show you in the following:

Method attributes
Volatile memory
Inline assembly

These all produce similar assembly output and therefore have similar performance characteristics.

Method Attributes

Every compiler (checked clang and GCC) has a method attribute that prevents the compiler from optimizing the method. We can use this to our advantage. The main issue is that the attributes are highly compiler-specific. But gladly, someone on StackOverflow has already asked How to change optimization level of one function? and Evan Nemerson answered:

GCC has an optimize(X) function attribute
Clang has optnone and minsize function attributes (use __has_attribute to test for support). Since I believe 3.5 it also has #pragma clang optimize on|off.
Intel C/C++ compiler has #pragma intel optimization_level 0 which applies to the next function after the pragma
MSVC has #pragma optimize, which applies to the first function after the pragma
IBM XL has #pragma option_override(funcname, "opt(level,X)"). Note that 13.1.6 (at least) returns true for __has_attribute(optnone) but doesn’t actually support it.
ARM has #pragma Onum, which can be coupled with #pragma push/pop
ODS has #pragma opt X (funcname)
Cray has #pragma _CRI [no]opt
TI has #pragma FUNCTION_OPTIONS(func,"…") (C) and #pragma FUNCTION_OPTIONS("…") (C++)
IAR has #pragma optimize=...
Pelles C has #pragma optimize time/size/none

StackOverflow Answer by Even NeMerson

Adding all attributes for all compilers is cumbersome, so I restricted myself to the function attributes for GCC and clang:

__attribute__((optnone, optimize(0))) void my_wait(int seconds) { clock_t end_time = clock() + seconds * CLOCKS_PER_SEC; while (clock() < end_time) { for (int i = 0; i < 1000000; i++); } }

These attributes reliably turn off the optimisation, therefore leaving the inner for-loop as it is:

In my experiments, the code only spent 0.063% (± 39%) of its runtime in non-user-land. Of course, you can reduce this number even further by increasing the number of inner iterations, but at the cost of reduced accuracy.

Of course, adding compiler-specific attributes has the main drawback of only working with specific compilers. Isn’t there another way?

Volatile Memory

Yes, there is. We can just add an optimisation barrier in the form of accesses to a volatile variable:

void my_wait(int seconds) { clock_t end_time = clock() + seconds * CLOCKS_PER_SEC; while (clock() < end_time) { for (volatile int i = 0; i < 100000; i++); } }

This volatile int i prevents the compiler from optimising away any accesses to our for-loop-variable:

The assembly looks slightly different than before, because we allowed the compiler to optimise the rest of the function. The performance characteristics are too similar to the previous code, so I’ll skip them.

I’ll instead show you another ingenious solution, which is probably based on undefined behaviour:

Inline Assembly

Inline assembly can be used to manually write assembly instructions directly in your C code. It is used, for example, to use hardware-specific registers or call functions that expect a different calling convention. The main problem with inline assembly is that the optimisation passes in the compiler don’t have any knowledge of the instructions. We can use this to our advantage:

void my_wait(int seconds) { clock_t end_time = clock() + seconds * CLOCKS_PER_SEC; while (clock() < end_time) { for (int i = 0; i < 1000000; i++) { __asm__ (""); } } }

Apparently, this is enough to trick compilers into not optimising the loop. This behaviour can change in the future, but for now it’s an interesting solution and results in the tightest assembly loop of all three non-optimisation solutions:

As before, the performance characteristics are similar.

Are we finished? No, I have two more solutions to my original problem that don’t involve preventing optimisation. Let’s start with the most obvious one:

Monotonic Clock

Our main problem with the original solution was the high number of system calls to get the current time. What if I told you that there is a clock source that requires only a minimal number of system calls? It’s called the monotonic clock. The CPU itself increments the clock, which doesn’t change because of time updates and more. I actually wrote a blog post two and a half years ago on this clock: JFR Timestamps and System.nanoTime.

The solution looks as follows:

long own_clock_nanos() { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec * 1e9 + ts.tv_nsec; } void my_wait(int seconds) { long end_time = own_clock_nanos() + seconds * 1e9; while (own_clock_nanos() < end_time) { // cpu wasting loop } }

Looking at the assembler here is pretty pointless, as all the magic is hidden in the standard library functions.

Looking at the list of system calls is more revealing: There are no time-related ones. This also means that the program spends only 0.025% (± 46%) of its runtime in non-userspace, probably the minimal amount possible.

It’s a good solution, albeit I probably would still go for the volatile as it’s easier for the non-OS engineer to comprehend. A to comprehend solution that is slightly confusing for OS-people is the next and last one:

Alarm Signal

After all the previous solutions, you might ask yourself: Couldn’t the operating system just signal me after a fixed number of seconds, so my user-land program can focus on the most important task of idly looping?

In comes the alarm standard library function:

NAME
alarm – set signal timer alarm

LIBRARY
Standard C Library (libc, -lc)

SYNOPSIS

#include <unistd.h> unsigned alarm(unsigned seconds);

DESCRIPTION
This interface is made obsolete by setitimer(2).

The alarm() function sets a timer to deliver the signal SIGALRM to the calling process after the specified number of
seconds. If an alarm has already been set with alarm() but has not been delivered, another call to alarm() will
supersede the prior call. The request alarm(0) voids the current alarm and the signal SIGALRM will not be delivered.

MACOS 15.6 Man Page for ALARM

Oh yes, one could use itimers for more fine-grained control, but alarm is enough for our use case. No, let’s look at the code:

static jmp_buf jump_buffer; void alarm_handler(int sig) { // Jump to the label when alarm fires longjmp(jump_buffer, 1); } void my_wait(int seconds) { // Set up the alarm handler signal(SIGALRM, alarm_handler); // Set the jump point if (setjmp(jump_buffer) == 0) { // First time through - set alarm and start infinite loop alarm(seconds); while (1); } }

First, let us ignore all the jump-related code and focus on the rest.

Our while-loop is pretty simple, as we don’t need to check the time. We set a timer beforehand so that the operating system calls the alarm_handler after the specified number of seconds.

But how can we continue the execution of our program when we caught the alarm? We just use setjmp to set the current calling environment in the jump_buffer. Then in the alarm_handler we just jump back to this location. set_jmp then returns false, because the jump_buffer is already set and allows returning from the my_wait method.

Jumping from the signal handler back to the function is risky because it leads to unmaintainable code. However, projects like the OpenJDK use these methods for jumping back from segfaults, too.

Let’s see what the compiler does with this code:

The while-loop became a loop in its purest form, and the setjmp/longjmp pair is directly translated into a few assembly instructions.

Of course, the main issue is that the code is really hard to read, and it depends on the rest of the program not using the same signal. I would not recommend using this solution. But it was still fun to learn about (initial idea via StackOverflow).

Conclusion

Who could have thought that wasting CPU is such an intricate problem, with six solutions that all have advantages and disadvantages? Ultimately, I chose the volatile memory-based implementation because it works cross-compiler, with and without enabled optimisations.

Thanks for following down this rabbit hole with me, I hope I didn’t waste your time. See you next week with an OpenJDK/JFR-related blog post again.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Johannes Bechberger is a JVM developer working on profilers and their underlying technology in the SapMachine team at SAP. This includes improvements to async-profiler and its ecosystem, a website to view the different JFR event types, and improvements to the FirefoxProfiler, making it usable in the Java world. He started at SAP in 2022 after two years of research studies at the KIT in the field of Java security analyses. His work today is comprised of many open-source contributions and his blog, where he writes regularly on in-depth profiling and debugging topics, and of working on his JEP Candidate 435 to add a new profiling API to the OpenJDK.

View all posts

New posts like these come out at least every two weeks, to get notified about new posts, follow me on BlueSky, Twitter, Mastodon, or LinkedIn, or join the newsletter:

Read Entire Article