Decoding UTF-8

2 weeks ago 2

In the first part of this series, we saw how to manually decode a UTF-8 sequence. In part two, we started looking into determining sequence length. Part three was about avoiding branches when calculating sequence length with a lookup table.

Now, we will see how we can use some help from the hardware to accomplish the same task: based on the lead byte, calculate the UTF-8 sequence length. I recommend reading my recent post Counting Leading Zeros in a Byte to get familiar with the standard C and C++ functions for counting leading zeroes in a byte and the assembly instructions they use.

We will use a C++ 20 std::countl_one() function from <bit> header to get the number of consecutive set bits, starting from the most significant bit. As we saw in the part one:

  • Having 0 set bits on the left side means the most significant bit is zero. If a lead byte starts with zero, it means it is the only byte in the sequence - the length is 1.

  • 1 set bit means the byte starts with the 10 sequence - these are continuation bytes, so we report an invalid lead

  • 2 set bits mean the lead byte starts with 110; the sequence length is 2.

  • 3 set bits mean the lead byte starts with 1110; the sequence length is 3.

  • 4 set bits mean the lead byte starts with 11110; the sequence length is 4

  • Anything else indicates an invalid lead byte.

Without further ado, here is the code:

#include <bit> int utf8_sequence_length(unsigned char lead_byte) { switch (std::countl_one(lead_byte)) { case 0: return 1; case 2: return 2; case 3: return 3; case 4: return 4; default: return 0; // invalid lead } }

Note that we do not catch all invalid values of the lead byte here. It is OK, as long as we do proper validation at some point.

Let’s have a look at the generated assembly (clang 18.1, aarch64, -O2 --std=c++20):

utf8_sequence_length(unsigned char): mov w8, #255 bics w9, w8, w0 clz w9, w9 bics wzr, w8, w0 mov w8, #8 sub w9, w9, #24 csel w8, w8, w9, eq cmp w8, #4 b.hi .LBB0_2 adrp x9, .Lswitch.table.utf8_sequence_length(unsigned char) add x9, x9, :lo12:.Lswitch.table.utf8_sequence_length(unsigned char) ldr w0, [x9, w8, uxtw #2] ret .LBB0_2: mov w0, wzr ret .Lswitch.table.utf8_sequence_length(unsigned char): .word 1 .word 0 .word 2 .word 3 .word 4

Like x86-64, Aarch64 instruction set does not have an instruction for counting leading ones - therefore we need to invert the bits and count zeros. That’s what CLZ instruction does. The switch-case statement generated a small lookup table and there is no branching except to check for the invalid value (something that a branch predictor will be able to deal with).

In the next post in the series, I plan to discuss some benchmarks that compare the three methods of calculating a UTF-8 sequence length.

Read Entire Article