Faster ULID Generation in Rust

4 months ago 3

I’ve spent a lot of time working with distributed systems, and if you’re building anything that needs time-sorted, unique identifiers, you’ve probably come across formats like ULID, UUIDv7, KSUID, Snowflake IDs, and others. But which one should you pick? Nope — not that kind of article.

You already know why you want to use them, so that aside you’d probably pick one that is efficient and fast, right? So let’s pick one and see how fast we can go.

In this article, I’ll describe steps on how I made Ferroid, a Rust library for generating time-ordered IDs that is fast. How fast?

How about ~288M ULIDs / second on a single M1 core.

Aside from the image — I refuse to use AI in this article because I think it takes away from my own voice. So be prepared for grammatical errors, incomplete sentences and the like.

Spoiler: Generators that produce monotonic orderings are faster and more collision resistant but can leak info

To understand how you can extract that kind of performance, it helps to understand the bit-layout:

48 bits of millisecond-resolution timestamp
80 bits of randomness

A naive generator will simply generate a timestamp, get random bytes, and mash them together to form an ID. But we can do better! A clever reader may have guessed there are a few syscalls — getting the time and generating random bytes. If you can minimize them, you’ll unlock performance.

The Clock

A naive clock implementation that returns the current milliseconds may look like this:

pub trait TimeSource<T> {
fn current_millis(&self) -> T;
}

struct MonotonicClock;

impl TimeSource<u64> for MonotonicClock {
fn current_millis(&self) -> u64 {
let now = SystemTime::now()
.duration_since(UNIX_EPOCH)
.expect("system clock before Unix epoch");
now.as_millis() as u64
}
}

impl TimeSource<u128> for MonotonicClock {
fn current_millis(&self) -> u128 {
<Self as TimeSource<u64>>::current_millis(self) as u128
}
}

Benchmarking current_millis for u64takes ~24.5ns on my 14" Macbook Pro M1, 32GB, 10 core (8 perf, 2 efficiency):

mono/clock/current_millis mono/clock/current_millis
time: [24.519 ns 24.577 ns 24.645 ns]

This is pretty fast, but what if we maintain the state of the clock internally and update a shared atomic instead? That should be quite fast, right?

I think ~935ps is about as fast as you can get:

mono/clock/u64/current_millis
time: [933.75 ps 934.26 ps 934.92 ps]

mono/clock/u128/current_millis
time: [934.29 ps 934.61 ps 934.89 ps]

And here’s the code, less comments:

#[derive(Debug)]
struct SharedTickerInner {
current: AtomicU64,
_handle: OnceLock<JoinHandle<()>>,
}

#[derive(Clone, Debug)]
pub struct MonotonicClock {
inner: Arc<SharedTickerInner>,
epoch_offset: u64, // in milliseconds
}

impl Default for MonotonicClock {
fn default() -> Self {
Self::with_epoch(CUSTOM_EPOCH)
}
}

impl MonotonicClock {
pub fn with_epoch(epoch: Duration) -> Self {
let start = Instant::now();
let system_now = SystemTime::now()
.duration_since(UNIX_EPOCH)
.expect("System clock before UNIX_EPOCH");
let offset = system_now
.checked_sub(epoch)
.expect("System clock before custom epoch")
.as_millis() as u64;

let inner = Arc::new(SharedTickerInner {
current: AtomicU64::new(0),
_handle: OnceLock::new(),
});

let weak_inner = Arc::downgrade(&inner);
let handle = thread::spawn(move || {
let mut tick = 0;

loop {
let Some(inner_ref) = weak_inner.upgrade() else {
break;
};

// Compute the absolute target time of the next tick
let target = start + Duration::from_millis(tick);

// Sleep if we are early
let now = Instant::now();
if now < target {
thread::sleep(target - now);
}

// After waking, recompute how far we actually are from the
// start
let now_ms = start.elapsed().as_millis() as u64;

// Monotonic store, aligned to elapsed milliseconds since start
inner_ref.current.store(now_ms, Ordering::Relaxed);

// Align to next tick after the current actual time
tick = now_ms + 1;
}
});

inner
._handle
.set(handle)
.expect("failed to set thread handle");

Self {
inner,
epoch_offset: offset,
}
}
}

impl TimeSource<u64> for MonotonicClock {
fn current_millis(&self) -> u64 {
self.epoch_offset + self.inner.current.load(Ordering::Relaxed)
}
}

impl TimeSource<u128> for MonotonicClock {
fn current_millis(&self) -> u128 {
<Self as TimeSource<u64>>::current_millis(self) as u128
}
}

It’s quite a lot to understand, but the gist of it is a background thread updates a shared atomic integer that tracks the elapsed milliseconds — reducing syscalls to a handful per millisecond.

It is thread-safe and cheap to clone so a single clock can be used with many generators. You’ll notice some logic around an offset — this is to allow the user to define what point in time a value of zero milliseconds represents in the ULID.

We can use the Relaxed memory ordering because we’re simply reading a single atomic value which controls no other state — many threads that would consume on the atomic value will eventually see the correct millisecond and this is acceptable behavior for our ID generators.

My implementation only uses SystemTime during construction and Instant inside the background thread which guarantees returning a monotonic instant — baring platform bugs, this implementation is safe from going backwards. However, this clock may experience time dilation (each tick may not be the same length) or jump forward (sleep for longer than 1ms). This is okay because we still have monotonic time, if the clock is performs poorly we’d only lose on throughput — the IDs generated are always valid.

Great, now that we have an optimized time source, whats next?

The RNG

A naive random number generator is… well, perfect! Here’s my implementation:

use crate::RandSource;
use rand::{Rng, rng};

#[derive(Default, Clone, Debug)]
pub struct ThreadRandom;

impl RandSource<u64> for ThreadRandom {
fn rand(&self) -> u64 {
rng().random()
}
}

impl RandSource<u128> for ThreadRandom {
fn rand(&self) -> u128 {
rng().random()
}
}

Thankfully, the rand crate’s thread-local generator is suitable for our needs. And it is fast enough:

thread/rng/u64/rand time: [11.918 ns 11.948 ns 11.981 ns]

thread/rng/u128/rand time: [20.949 ns 20.953 ns 20.958 ns]

So how can we extract more performance here? Instead of messing with RNG configurations which could be dangerous, we can do something similar to our clock hack — less work per millisecond.

Time-sortable is not the same as a monotonic ordering. ULIDs can be ordered by time, but within the same millisecond IDs could be out of order if you generate new random numbers. So the distinction is producing an ordering vs a random ordering within the same millisecond.

And of course we’re talking about a single generator and not distributed generators — you cannot guarantee a global ordering without coordination (Snowflake’s are better suited for this).

So what is the difference with respect to a single generator? Why would you care about local ordering if you cannot guarantee a global ordering for ULIDs?

Well…most likely, you don’t.

I read a lot of debate on the topic of “you don’t need monotonic IDs”, but my hot take? Monotonic IDs are a side-effect of generator optimization. And you should prefer them at the cost of leaking information about IDs generated within the same millisecond (you can statistically guess the prev/next sequence of an ID to find another ID within the same millisecond).

Quick guide between monotonic vs non-monotonic generators:

Need local ordering with a single generator? monotonic.
Need global ordering across multiple distributed generators? monotonic — but use coordinated Snowflake-like generators.
Concerned about information leaking with IDs in the same millisecond? non-monotonic.

If you don’t require ordering but are considering whether it’s acceptable, ask yourself one question:

Can your application afford to leak info in high throughput generation scenarios? This only affects IDs within the same millisecond.

Okay —moving on! This article is about performance 🦾.

I claimed performance is much better — let’s test that theory. Here’s a bench of a non-monotonic ULID generator that always generates a new random number for every invocation within the same millisecond:

mono/sequential/ulid/basic/elems/4096
time: [92.548 µs 92.592 µs 92.639 µs]
thrpt: [44.215 Melem/s 44.237 Melem/s 44.258 Melem/s]

Note: I’m benchmarking in chunks of 4096 elements to simplify and re-use benchmarking code with other (Snowflake) generators, but this has no effect on the results.

Now, ~44M ULIDs per second isn’t bad, but…

The ULID spec mentions monotonic IDs to provide an ordering for IDs generated within a single millisecond. Bingo — we can do less work by generating a single RNG per millisecond and avoid extra RNG calls by incrementing the least significant bit and guarding against overflow. The control flow and increment are much faster. How fast? ~288M ULIDs per second (6.5x faster @ ~3.47ns/ULID):

Finished `bench` profile [optimized] target(s) in 2.45s
mono/sequential/ulid/basic/elems/4096
time: [14.200 µs 14.216 µs 14.245 µs]
thrpt: [287.55 Melem/s 288.12 Melem/s 288.46 Melem/s]
change:
time: [-84.667% -84.645% -84.615%] (p = 0.00 < 0.05)
thrpt: [+549.99% +551.25% +552.17%]
Performance has improved.

Massive gains!

Encoding

Generating numeric ULIDs is great and all, but you may want to store them as lexicographical strings, right? So let’s encode/decode using base32 Crockford strings. This is the only part of the code that makes a heap allocation — everything else is on the stack. Checkout the benchmarks:

base32/ulid/encode/string
time: [22.176 ns 22.196 ns 22.224 ns]
thrpt: [44.997 Melem/s 45.052 Melem/s 45.094 Melem/s]
base32/ulid/encode/buffer base32/ulid/encode/buffer
time: [7.8981 ns 7.9133 ns 7.9319 ns]
thrpt: [126.07 Melem/s 126.37 Melem/s 126.61 Melem/s]
base32/ulid/decode
time: [9.4287 ns 9.4364 ns 9.4435 ns] base32/ulid/decode time: [9.4287 ns 9.4364 ns 9.4435 ns]
thrpt: [105.89 Melem/s 105.97 Melem/s 106.06 Melem/s]

Clearly this is a bottlenecked no matter what you do here — especially if you’re serializing to strings.

I wrote about it here, but in summary the chance of collisions changes because you’re now accounting for a range overlap — the probability that one generator’s random number, and incrementing sequence, will overlap with another generator’s random number, and incrementing sequence within the same millisecond. This is defined by the following:

The approximate formula assumes the random ranges will not overflow (and be cut off) to estimate worst-case outcomes.

We can also estimate the time required for the probability of at least one collision to reach 50%:

What does this look like on my machine? Let’s plug in 288,000 IDs/ms (288M/sec) for k and use 10 generators for g:

You are about 160 million times more likely to win the Powerball lottery in a single draw than to observe a collision in any given millisecond.

However, if you generate IDs continuously, collisions eventually become inevitable — but at the rate and number of generators, it would take roughly 1.02 million years to reach a 50% chance of any collision.

So how does this compare to non-monotonic generators? The formula changes to the simplified birthday problem:

If each ID is fully random within the millisecond

Substituting our values:

10 generators, 44k IDs per ms

Whoa, thats actually WAY worse collision risk! You’re now “only” ~42k times more likely to win the Powerball…

Use monotonic ULID generators — they’re saving you cycles and are vastly more collision resistant.

But this is not free — the tradeoff is that you leak information about IDs generated within the same millisecond.

Happy coding!

Read Entire Article

Faster ULID Generation in Rust

The Clock

The RNG

Encoding

Related

AI for Junior Software Engineers

Is there open source alternative for VAPI or retellai?

Agentic Exchange: A Search Engine and Registry for AI Agents...