When Compiler Optimizations Hurt Performance

2 hours ago 1

Recently I have been working on benchmarks comparing techniques for calculating UTF-8 sequence lengths.

The method I described in Decoding UTF-8. Part IV: Determining Sequence Length - Counting Leading Bits uses hardware-assisted counting of leading zero bits:

#include <bit> int utf8_sequence_length(unsigned char lead_byte) { switch (std::countl_one(lead_byte)) { case 0: return 1; case 2: return 2; case 3: return 3; case 4: return 4; default: return 0; // invalid lead } }

I was pretty disappointed with the performance of this function. It was processing between 438 MB/s and 462 MB/s of text data which was far less than the “naive” method described at Decoding UTF-8. Part II: Determining Sequence Length - a Straightforward Approach.

As we saw in the Part IV post, the compiler (clang++ 18.1.3, aarch64) emits a small lookup table to avoid branching:

adrp x9, .Lswitch.table.utf8_sequence_length(unsigned char) add x9, x9, :lo12:.Lswitch.table.utf8_sequence_length(unsigned char) ldr w0, [x9, w8, uxtw #2] ... .Lswitch.table.utf8_sequence_length(unsigned char): .word 1 .word 0 .word 2 .word 3 .word 4

The lookup table generated by the compiler is a degenerate case of a jump table, which is a common optimization for switch-case statements. The “naive” code from Part II uses branching instructions instead and processes over 2000 MB/s. Obviously, the branch predictor did its magic for the case of pure ASCII text, but the performance gap still surprised me.

Playing with the Linux perf tool on a WSL Linux installation turned to be a pretty frustrating experience. It did point me to a very high number of cycles stalled in the back-end which often indicates that CPU is waiting for memory, but I couldn’t get perf annotate to work.

In any case, what I had was enough to let me try disabling jump tables; with

clang++ -O2 -fno-jump-tables --std=c++20

the switch-case part of the function compiles to:

cmp w0, #2 b.gt .LBB0_4 cbz w0, .LBB0_6 cmp w0, #2 b.eq .LBB0_5 .LBB0_3: mov w0, wzr ret .LBB0_4: cmp w0, #3 ccmp w0, #4, #4, ne b.ne .LBB0_3 .LBB0_5: ret .LBB0_6: mov w0, #1 ret

Lo and behold, the performance became comparable to the “naive” function from Part II.

I repeated the experiment using GNU g++ for AArch64. In that case, -fno-jump-tables had no effect because g++ never emitted a lookup table.

Not surprisingly, it turned out I was not the first to come up with this observation. For instance, there is an excellent article by Julian Squires from 2017 on the same topic, but focusing on x86-x64 platform: Are Jump Tables Always Fastest?

Read Entire Article