Doing integer to string conversions is very common when doing development (logging, or even showing frames per second (FPS) in video games...), not teaching you anything new here. As such, performance in this area is very important.
Explanations of the initial situation
If a type implements the Display trait, it also gets the ToString trait implementation "for free" because of this blanket implementation:
Runimpl<T> ToString for T where T: Display + ?Sized {}Internally, ToString is implemented as such:
- We create a String.
- We create a Formatter and we give it this new String as buffer.
- We call Display::fmt and we pass T (as self) and the Formatter as arguments.
- The buffer (our String) was now filled, we can return it.
So based on this, it's difficult to see where the issue could be. As usual, devil is in the details.
Display implementation
For integers in particular, the Display implementation needs to check a few things to correctly generate a string representation:
- Is it negative? If so we need to prepend with the - sign.
- Was the + fmt sign used (like {:+})? If so and if the number is not negative, we prepend with a + sign.
- Is there a minimum width used (like {:02})?
- Is there...
There are quite a lot of checks. You can see the code here if you want the technical details.
Anyway, all these checks have a cost, so why not skip them? Well, that's exactly what was done.
First improvement was #136264. It specialized the implementation of ToString for all integers except i128/u128. The results were (interesting result for u8/i8):
bench_i16 | 32.06 ns/iter (+/- 0.12) | 17.62 ns/iter (+/- 0.03) | -45% |
bench_i32 | 31.61 ns/iter (+/- 0.04) | 15.10 ns/iter (+/- 0.06) | -52% |
bench_i64 | 31.71 ns/iter (+/- 0.07) | 15.02 ns/iter (+/- 0.20) | -52% |
bench_i8 | 13.21 ns/iter (+/- 0.14) | 14.93 ns/iter (+/- 0.16) | +13% |
bench_u16 | 31.20 ns/iter (+/- 0.06) | 16.14 ns/iter (+/- 0.11) | -48% |
bench_u32 | 33.27 ns/iter (+/- 0.05) | 16.18 ns/iter (+/- 0.10) | -51% |
bench_u64 | 31.44 ns/iter (+/- 0.06) | 16.62 ns/iter (+/- 0.21) | -47% |
bench_u8 | 10.57 ns/iter (+/- 0.30) | 13.00 ns/iter (+/- 0.43) | +22% |
Second improvement was #142294 for i128/u128. It's exactly the same idea as the first improvement, we specialize ToString implementation. Here's the result:
bench_i128 | 29.25 ns/iter (+/- 0.66) | 17.52 ns/iter (+/- 0.7) | -40.1% |
bench_u128 | 34.06 ns/iter (+/- 0.21) | 16.1 ns/iter (+/- 0.6) | -52.7% |
Improving the conversion itself (not 128 bits)
Specializing the ToString implementations was an easy win, but now it's time to actually take a look at the conversion itself for integers smaller than 128 bits.
Let's take a look at #128204.
First thing to note: we have a buffer [MaybeUninit::<u8>::uninit(); SIZE] where SIZE is the maximum number of digits which can represent the integer. This buffer is filled starting by the end and then we convert the content to a &str starting at the current offset.
So if you have a u8 of value 9, SIZE is 3 (because the maximum value is 256, which is 3 digits), the buffer will look like this [<uninit>, <uninit>, b'9'], so the offset is 2.
Initially, SIZE was always the number of digits of the biggest 128 bits integer (so 40), which is pretty bad, especially for the smallest integers which only need a size of 3. So the first obvious things to do was to make the array size to depend on the integer maximum number of digits. Luckily for us, there is a (const!) method for that: ilog:
Runconst SIZE: usize = usize::MAX.ilog(10) as usize + 1; let mut buf = [MaybeUninit::<u8>::uninit(); SIZE];You can replace usize with any integer and you will always have the maximum of digits for this integer.
So that's the first improvement, you can see this change here.
The second improvement was to remove a loop at compile time for u8/i8 since it only works if the value is bigger or equal to 10.000.
You can see this change here.
Unsurprisingly, very small performance improvements there:
bench_std_fmt::bench_i16_0 | 16.45 ns/iter (+/- 0.25) | 16.50 ns/iter (+/- 0.15) | 0.3% |
bench_std_fmt::bench_i16_max | 17.83 ns/iter (+/- 0.66) | 17.58 ns/iter (+/- 0.10) | -1.4% |
bench_std_fmt::bench_i16_min | 20.97 ns/iter (+/- 0.49) | 20.50 ns/iter (+/- 0.28) | -2.3% |
bench_std_fmt::bench_i32_0 | 16.63 ns/iter (+/- 0.06) | 16.62 ns/iter (+/- 0.07) | -0.1% |
bench_std_fmt::bench_i32_max | 19.79 ns/iter (+/- 0.43) | 19.55 ns/iter (+/- 0.14) | -1.2% |
bench_std_fmt::bench_i32_min | 22.97 ns/iter (+/- 0.50) | 22.08 ns/iter (+/- 0.08) | -4.0% |
bench_std_fmt::bench_i64_0 | 16.63 ns/iter (+/- 0.39) | 16.69 ns/iter (+/- 0.44) | 0.4% |
bench_std_fmt::bench_i64_half | 19.60 ns/iter (+/- 0.05) | 19.10 ns/iter (+/- 0.05) | -2.6% |
bench_std_fmt::bench_i64_max | 25.22 ns/iter (+/- 0.34) | 24.43 ns/iter (+/- 0.02) | -3.2% |
bench_std_fmt::bench_i8_0 | 16.27 ns/iter (+/- 0.32) | 15.80 ns/iter (+/- 0.17) | -3.0% |
bench_std_fmt::bench_i8_max | 16.71 ns/iter (+/- 0.09) | 16.25 ns/iter (+/- 0.01) | -2.8% |
bench_std_fmt::bench_i8_min | 20.07 ns/iter (+/- 0.22) | 19.80 ns/iter (+/- 0.30) | -1.4% |
bench_std_fmt::bench_u128_0 | 21.37 ns/iter (+/- 0.24) | 21.35 ns/iter (+/- 0.35) | -0.1% |
bench_std_fmt::bench_u128_max | 48.13 ns/iter (+/- 0.20) | 48.78 ns/iter (+/- 0.29) | 1.3% |
bench_std_fmt::bench_u16_0 | 16.48 ns/iter (+/- 0.46) | 16.03 ns/iter (+/- 0.39) | -2.8% |
bench_std_fmt::bench_u16_max | 17.31 ns/iter (+/- 0.32) | 17.41 ns/iter (+/- 0.32) | 0.6% |
bench_std_fmt::bench_u16_min | 16.40 ns/iter (+/- 0.45) | 16.02 ns/iter (+/- 0.39) | -2.4% |
bench_std_fmt::bench_u32_0 | 16.17 ns/iter (+/- 0.04) | 16.29 ns/iter (+/- 0.16) | 0.7% |
bench_std_fmt::bench_u32_max | 19.00 ns/iter (+/- 0.10) | 19.16 ns/iter (+/- 0.28) | 0.8% |
bench_std_fmt::bench_u32_min | 16.16 ns/iter (+/- 0.09) | 16.28 ns/iter (+/- 0.11) | 0.7% |
bench_std_fmt::bench_u64_0 | 16.22 ns/iter (+/- 0.22) | 16.14 ns/iter (+/- 0.18) | -0.5% |
bench_std_fmt::bench_u64_half | 19.25 ns/iter (+/- 0.07) | 18.95 ns/iter (+/- 0.05) | -1.6% |
bench_std_fmt::bench_u64_max | 24.31 ns/iter (+/- 0.08) | 24.18 ns/iter (+/- 0.08) | -0.5% |
bench_std_fmt::bench_u8_0 | 15.76 ns/iter (+/- 0.08) | 15.66 ns/iter (+/- 0.08) | -0.6% |
bench_std_fmt::bench_u8_max | 16.53 ns/iter (+/- 0.03) | 16.29 ns/iter (+/- 0.02) | -1.5% |
bench_std_fmt::bench_u8_min | 15.77 ns/iter (+/- 0.06) | 15.67 ns/iter (+/- 0.02) | -0.6% |
It was followed by a small cleanup which simplified the implementations in #133247. In short: instead of having the same implementation for signed and unsigned integers, we convert signed into unsigned and then call the unsigned formatting.
Lastly there was #135265 which made the following improvements (quoting the pull request author):
The speed gain is because (1) the loop decision no longer depends on the digit calculation, (2) it produces fewer instructions, and (3) it no longer needs a check for special case zero, which has a "significant" leading zero.
So in short:
- A loop is skipped for 8 bits integers.
- Fewer assembly instructions generated for the same result.
- Integer to string conversion doesn't need to special-case 0 anymore.
Sadly there is no benchmark numbers I can show off...
Improving the conversion itself (128 bits)
Let's start with #136594. It did the following improvements:
- Batches per 16 instead of 19 digits.
- Buffer access as array instead of unsafe pointer.
About the second one, before this pull request we used pointers like this:
Runlet buf_ptr = MaybeUninit::slice_as_mut_ptr(buf); *buf_ptr.add(curr) = nb + b'0';Now we do like this:
Runbuf[curr].write(nb + b'0');And here are the results:
fmt::write_u128_max | 51.44 ns/iter (+/- 0.44) | 46.65 ns/iter (+/- 0.39) | -11.8% |
fmt::write_u128_min | 30.63 ns/iter (+/- 0.92) | 26.77 ns/iter (+/- 0.58) | -15.6% |
There was also a small cleanup to replace the manually written number of digits of u128 (39) with .ilog(10) done in #142114.
Adding extra tests
Working on performance is nice, but having tests to ensure we didn't break the world is always a good thing to have. We added new tests in #138546 to ensure that the formatting was working as expected.
What's next?
Before all these changes, if you wanted to have the fastest integer to string conversion possible in Rust, you needed to use the itoa crate. Now, the main difference is that it uses a buffer you need to provide everytime to store the generated string, which allows to skip the buffer allocations.
Such an API is planned to be added to the core library. A pull request (#142098) is already open to implement it. The new code will look like this:
Runuse core::fmt::NumBuffer; let nb = 0u32; let mut buf = NumBuffer::new(); let string_repr: &str = n.format_into(&mut buf);You can reuse buf as many times as you want as long as it's for the same integer size.
Once all these pull requests will be merged, I think we'll be able to have some very nice performance for integer to string conversions in Rust.
What remains to be done
The Display trait implementation for integers is still slow because of all the checks (as a reminder, you can see the code here). Improving this part seems like the logical next step. Hopefully someone will get to that at some point. :)
Words of the end
All this was done thanks(?) to my relaxing cat: