A couple of years ago I wrote about random floating point numbers. In that article I was mainly concerned about how neat the code is, and I didn’t pay attention to its performance.
Recently, a comment from Oliver Hunt and a blog post from Alisa Sireneva prompted me to wonder if I made an unwarranted assumption. So I wrote a little benchmark, which you can find in pcg-dxsm.git.
As a brief recap, there are two basic ways to convert a random integer to a floating point number between 0.0 and 1.0:
-
Use bit fiddling to construct an integer whose format matches a float between 1.0 and 2.0; this is the same span as the result but with a simpler exponent. Bitcast the integer to a float and subtract 1.0 to get the result.
-
Shift the integer down to the same range as the mantissa, convert to float, then multiply by a scaling factor that reduces it to the desired range. This produces one more bit of randomness than the bithacking conversion.
(There are other less basic ways.)
My benchmark has 2 x 2 x 2 tests:
-
bithacking vs multiplying
-
32 bit vs 64 bit
-
sequential integers vs random integers
Each operation is isolated from the benchmark loop by putting it in a separate translation unit (to prevent the compiler from inlining) and there is a fence instruction (ISB SY on ARM, MFENCE on AMD) in the loop to stop the CPU from overlapping successive iterations.
I ran the benchmark on my Apple M1 Pro and my AMD Ryzen 7950X.
In the table below, the leftmost column is the number of random bits. The top half measures sequential numbers, the bottom half is random numbers. The times are nanoseconds per operation, which includes the overheads of the benchmark loop and function call.
arm amd 23 12.15 11.22 24 13.37 11.21 52 12.11 11.02 53 13.38 11.20 23 14.75 12.62 24 15.85 12.81 52 16.78 14.23 53 18.02 14.41The times vary a little from run to run but the difference in speed of the various loops is reasonably consistent.
I think my conclusion is that the bithacking conversion is about 1ns faster than the multiply conversion on my ARM box. There’s a subnanosecond difference on my AMD box which might indicate that the conversion takes different amounts of time depending on the value? Dunno.