Rust: Optimizing Integer to String Conversions

5 hours ago 1

Doing integer to string conversions is very common when doing development (logging, or even showing frames per second (FPS) in video games...), not teaching you anything new here. As such, performance in this area is very important.

Explanations of the initial situation

If a type implements the Display trait, it also gets the ToString trait implementation "for free" because of this blanket implementation:

Runimpl<T> ToString for T where T: Display + ?Sized {}

Internally, ToString is implemented as such:

We create a String.
We create a Formatter and we give it this new String as buffer.
We call Display::fmt and we pass T (as self) and the Formatter as arguments.
The buffer (our String) was now filled, we can return it.

So based on this, it's difficult to see where the issue could be. As usual, devil is in the details.

Display implementation

For integers in particular, the Display implementation needs to check a few things to correctly generate a string representation:

Is it negative? If so we need to prepend with the - sign.
Was the + fmt sign used (like {:+})? If so and if the number is not negative, we prepend with a + sign.
Is there a minimum width used (like {:02})?
Is there...

There are quite a lot of checks. You can see the code here if you want the technical details.

Anyway, all these checks have a cost, so why not skip them? Well, that's exactly what was done.

First improvement was #136264. It specialized the implementation of ToString for all integers except i128/u128. The results were (interesting result for u8/i8):

name 1.87.0-nightly (3ea711f17 2025-03-09) With #136264 diff

bench_i16	32.06 ns/iter (+/- 0.12)	17.62 ns/iter (+/- 0.03)	-45%
bench_i32	31.61 ns/iter (+/- 0.04)	15.10 ns/iter (+/- 0.06)	-52%
bench_i64	31.71 ns/iter (+/- 0.07)	15.02 ns/iter (+/- 0.20)	-52%
bench_i8	13.21 ns/iter (+/- 0.14)	14.93 ns/iter (+/- 0.16)	+13%
bench_u16	31.20 ns/iter (+/- 0.06)	16.14 ns/iter (+/- 0.11)	-48%
bench_u32	33.27 ns/iter (+/- 0.05)	16.18 ns/iter (+/- 0.10)	-51%
bench_u64	31.44 ns/iter (+/- 0.06)	16.62 ns/iter (+/- 0.21)	-47%
bench_u8	10.57 ns/iter (+/- 0.30)	13.00 ns/iter (+/- 0.43)	+22%

Second improvement was #142294 for i128/u128. It's exactly the same idea as the first improvement, we specialize ToString implementation. Here's the result:

bench name 1.89 nightly with #142294 diff

bench_i128	29.25 ns/iter (+/- 0.66)	17.52 ns/iter (+/- 0.7)	-40.1%
bench_u128	34.06 ns/iter (+/- 0.21)	16.1 ns/iter (+/- 0.6)	-52.7%

Improving the conversion itself (not 128 bits)

Specializing the ToString implementations was an easy win, but now it's time to actually take a look at the conversion itself for integers smaller than 128 bits.

Let's take a look at #128204.

First thing to note: we have a buffer [MaybeUninit::<u8>::uninit(); SIZE] where SIZE is the maximum number of digits which can represent the integer. This buffer is filled starting by the end and then we convert the content to a &str starting at the current offset.

So if you have a u8 of value 9, SIZE is 3 (because the maximum value is 256, which is 3 digits), the buffer will look like this [<uninit>, <uninit>, b'9'], so the offset is 2.

Initially, SIZE was always the number of digits of the biggest 128 bits integer (so 40), which is pretty bad, especially for the smallest integers which only need a size of 3. So the first obvious things to do was to make the array size to depend on the integer maximum number of digits. Luckily for us, there is a (const!) method for that: ilog:

Runconst SIZE: usize = usize::MAX.ilog(10) as usize + 1; let mut buf = [MaybeUninit::<u8>::uninit(); SIZE];

You can replace usize with any integer and you will always have the maximum of digits for this integer.

So that's the first improvement, you can see this change here.

The second improvement was to remove a loop at compile time for u8/i8 since it only works if the value is bigger or equal to 10.000.

You can see this change here.

Unsurprisingly, very small performance improvements there:

bench name current std with #128204 diff

bench_std_fmt::bench_i16_0	16.45 ns/iter (+/- 0.25)	16.50 ns/iter (+/- 0.15)	0.3%
bench_std_fmt::bench_i16_max	17.83 ns/iter (+/- 0.66)	17.58 ns/iter (+/- 0.10)	-1.4%
bench_std_fmt::bench_i16_min	20.97 ns/iter (+/- 0.49)	20.50 ns/iter (+/- 0.28)	-2.3%
bench_std_fmt::bench_i32_0	16.63 ns/iter (+/- 0.06)	16.62 ns/iter (+/- 0.07)	-0.1%
bench_std_fmt::bench_i32_max	19.79 ns/iter (+/- 0.43)	19.55 ns/iter (+/- 0.14)	-1.2%
bench_std_fmt::bench_i32_min	22.97 ns/iter (+/- 0.50)	22.08 ns/iter (+/- 0.08)	-4.0%
bench_std_fmt::bench_i64_0	16.63 ns/iter (+/- 0.39)	16.69 ns/iter (+/- 0.44)	0.4%
bench_std_fmt::bench_i64_half	19.60 ns/iter (+/- 0.05)	19.10 ns/iter (+/- 0.05)	-2.6%
bench_std_fmt::bench_i64_max	25.22 ns/iter (+/- 0.34)	24.43 ns/iter (+/- 0.02)	-3.2%
bench_std_fmt::bench_i8_0	16.27 ns/iter (+/- 0.32)	15.80 ns/iter (+/- 0.17)	-3.0%
bench_std_fmt::bench_i8_max	16.71 ns/iter (+/- 0.09)	16.25 ns/iter (+/- 0.01)	-2.8%
bench_std_fmt::bench_i8_min	20.07 ns/iter (+/- 0.22)	19.80 ns/iter (+/- 0.30)	-1.4%
bench_std_fmt::bench_u128_0	21.37 ns/iter (+/- 0.24)	21.35 ns/iter (+/- 0.35)	-0.1%
bench_std_fmt::bench_u128_max	48.13 ns/iter (+/- 0.20)	48.78 ns/iter (+/- 0.29)	1.3%
bench_std_fmt::bench_u16_0	16.48 ns/iter (+/- 0.46)	16.03 ns/iter (+/- 0.39)	-2.8%
bench_std_fmt::bench_u16_max	17.31 ns/iter (+/- 0.32)	17.41 ns/iter (+/- 0.32)	0.6%
bench_std_fmt::bench_u16_min	16.40 ns/iter (+/- 0.45)	16.02 ns/iter (+/- 0.39)	-2.4%
bench_std_fmt::bench_u32_0	16.17 ns/iter (+/- 0.04)	16.29 ns/iter (+/- 0.16)	0.7%
bench_std_fmt::bench_u32_max	19.00 ns/iter (+/- 0.10)	19.16 ns/iter (+/- 0.28)	0.8%
bench_std_fmt::bench_u32_min	16.16 ns/iter (+/- 0.09)	16.28 ns/iter (+/- 0.11)	0.7%
bench_std_fmt::bench_u64_0	16.22 ns/iter (+/- 0.22)	16.14 ns/iter (+/- 0.18)	-0.5%
bench_std_fmt::bench_u64_half	19.25 ns/iter (+/- 0.07)	18.95 ns/iter (+/- 0.05)	-1.6%
bench_std_fmt::bench_u64_max	24.31 ns/iter (+/- 0.08)	24.18 ns/iter (+/- 0.08)	-0.5%
bench_std_fmt::bench_u8_0	15.76 ns/iter (+/- 0.08)	15.66 ns/iter (+/- 0.08)	-0.6%
bench_std_fmt::bench_u8_max	16.53 ns/iter (+/- 0.03)	16.29 ns/iter (+/- 0.02)	-1.5%
bench_std_fmt::bench_u8_min	15.77 ns/iter (+/- 0.06)	15.67 ns/iter (+/- 0.02)	-0.6%

It was followed by a small cleanup which simplified the implementations in #133247. In short: instead of having the same implementation for signed and unsigned integers, we convert signed into unsigned and then call the unsigned formatting.

Lastly there was #135265 which made the following improvements (quoting the pull request author):

The speed gain is because (1) the loop decision no longer depends on the digit calculation, (2) it produces fewer instructions, and (3) it no longer needs a check for special case zero, which has a "significant" leading zero.

So in short:

A loop is skipped for 8 bits integers.
Fewer assembly instructions generated for the same result.
Integer to string conversion doesn't need to special-case 0 anymore.

Sadly there is no benchmark numbers I can show off...

Improving the conversion itself (128 bits)

Let's start with #136594. It did the following improvements:

Batches per 16 instead of 19 digits.
Buffer access as array instead of unsafe pointer.

About the second one, before this pull request we used pointers like this:

Runlet buf_ptr = MaybeUninit::slice_as_mut_ptr(buf); *buf_ptr.add(curr) = nb + b'0';

Now we do like this:

Runbuf[curr].write(nb + b'0');

And here are the results:

bench name current std with #128204 diff

fmt::write_u128_max	51.44 ns/iter (+/- 0.44)	46.65 ns/iter (+/- 0.39)	-11.8%
fmt::write_u128_min	30.63 ns/iter (+/- 0.92)	26.77 ns/iter (+/- 0.58)	-15.6%

There was also a small cleanup to replace the manually written number of digits of u128 (39) with .ilog(10) done in #142114.

Adding extra tests

Working on performance is nice, but having tests to ensure we didn't break the world is always a good thing to have. We added new tests in #138546 to ensure that the formatting was working as expected.

What's next?

Before all these changes, if you wanted to have the fastest integer to string conversion possible in Rust, you needed to use the itoa crate. Now, the main difference is that it uses a buffer you need to provide everytime to store the generated string, which allows to skip the buffer allocations.

Such an API is planned to be added to the core library. A pull request (#142098) is already open to implement it. The new code will look like this:

Runuse core::fmt::NumBuffer; let nb = 0u32; let mut buf = NumBuffer::new(); let string_repr: &str = n.format_into(&mut buf);

You can reuse buf as many times as you want as long as it's for the same integer size.

Once all these pull requests will be merged, I think we'll be able to have some very nice performance for integer to string conversions in Rust.

What remains to be done

The Display trait implementation for integers is still slow because of all the checks (as a reminder, you can see the code here). Improving this part seems like the logical next step. Hopefully someone will get to that at some point. :)