Press enter or click to view image in full size
This year was my first time attending p99conf, and I had the amazing opportunity to be both an attendee and a speaker. It was also the first fully remote conference I’ve tried to join fully with more focus. Though, joining from Bangkok meant the time difference was a bit of a marathon (I was planning to watch this from BCN, but thing changes), but I did my best to stay focused.
I’m so glad I did. The topics were all eye-catching for me— every talk was laser-focused on performance and optimization, which are subjects I always find fascinating.
For those interested, my talk was about our journey migrating Kubernetes policy management from Gatekeeper to Kyverno. I shared the story, the challenges, and the significant performance gains we achieved. You can check it out here:
But this post isn’t about my talk; it’s a recap of the other incredible sessions I attended and what I learned in these two days.
The Big Themes
There were a few major trends that surfaced across many of the talks:
- The Rust Wave: So many teams shared their stories of transitioning services (often from Go or C++) to Rust and the massive performance benefits they saw.
- LLM Inference: This was a huge topic, with multiple talks digging into the nitty-gritty of LLM inference optimization, from KV caching to token generation.
- Database Internals: We saw deep dives into how databases are built for ludicrous speed, with a focus on storage engines, compaction, and sharding.
- eBPF & Observability: A lot of cool tech was shown using eBPF for fine-grained, low-overhead performance observability.
I want to share some quick highlights from the sessions I found interesting. My apologies to the speakers if I’ve oversimplified — I’m just sharing the key things I learned. Please watch the full talks on https://www.p99conf.io/
LLM & AI Optimization
LLM Inference Optimization (by Chip Huyen)
This talk was a fantastic overview of the LLM inference landscape.
My key takeaway was the difference between Time to First Token (TTFT), which measures responsiveness, and Time Per Output Token (TPOT), which measures throughput.
Chip broke down how to make inference fast vs. cheap:
- Faster Hardware: Mostly out of your control here.
- More Efficient Models by:
Quantization: Reducing the precision of model weights (e.g., from 32-bit to 8-bit). This shrinks the model (like 28GB -> 7GB), making it fit on smaller hardware and run faster with less memory. The risk is a loss in accuracy.
Distillation: Training a smaller “student” model to mimic a larger “teacher” model. This transfers the knowledge into a much faster, smaller package.
- More Efficient Service:
Batching (Static/Dynamic): Grouping requests to maximize GPU utilization.
Prefill/Decode Decoupling: Input tokens (prefill) can be processed in parallel, while output tokens (decode) are sequential. You can use different hardware setups for each stage to optimize TTFT and TPOT independently.
Prompt Caching: This is huge. Since many requests share a common system prompt or examples, you can cache the processed state of these shared tokens and reuse them, dramatically speeding up processing.
LLM KV Cache Offloading: Analysis and Practical Considerations (By Eshcar)
This talk dove deeper into one specific optimization: the KV Cache.
Inference has two phases: prefill (processing the prompt to get the first token) and decode (generating subsequent tokens one by one). The prefill phase builds the KV cache, which the decode phase uses.
The problem? In a multi-turn chat, the KV cache from the previous turn is often evicted, forcing the model to recompute it. KV Cache Offloading saves this cache to a separate store. When the next turn comes, it only computes the new part of the prompt and loads the old cache.
Why is this faster? Because retrieving the cache from storage (linear I/O) is often faster than recomputing it (quadratic compute cost).
KV Caching for LLM Inference (By John Thomson)
This session offered a great mental model for how KV caching works.
When an LLM generates text, it’s autoregressive, meaning it predicts the next token based on all the previous tokens.
To do this, it uses a mechanism called self-attention. For every token in the prompt, the model calculates two special vectors (hidden-state):
- K (Key): A vector that represents the token’s “identity” or what it is.
- V (Value): A vector that represents the token’s “content” or what it means.
Because self-attention computes hidden states for tokens that never change They are calculated once and never change. Without a cache, the model would foolishly re-calculate for previous tokens… every single time. so, this means they can be cached and reused.
The talk explored a Radix Tree implementation for managing these cache blocks. A key challenge is smart cache eviction: when the GPU memory is full, what do you evict? You want to evict the “leaf” blocks (the LLM’s answers) while keeping the “system prompt” blocks that are shared by many requests.
🦀 Rust & System Rewrites
Timeseries Storage at Ludicrous Speed (By Datadog)
Datadog shared how they built their timeseries storage engine in Rust, replacing an older Go/RocksDB system. The new design is brilliant:
- Per-shard architecture: Each shard is single-threaded, completely eliminating lock contention.
- LSM-tree storage: Optimized for the heavy write-load of metrics.
- Unified caching: Handles both scalar and distribution metrics in one platform.
The results are staggering: 60x faster ingestion, 5x faster queries, and 2x more cost-efficient.
Translations at Scale: Memory Optimization Techniques That Kept Uber’s P99 Under 1ms (By Uber)
Uber had a service with high GC pause times. Profiling showed the culprit was a translation library that created 8 million objects. The first fix was replacing a sync.Map with a simple map[string]Translation protected by an RWMutex, as writes were extremely rare.
But this still showed OOMs during reloads, as the atomic reload process briefly held two full copies of the 800MB cache in memory (1.6GB total).
The final fix was a hybrid disk-based solution. They store the full data on disk in a custom format. All keys and their file offsets are loaded into memory (only 40MB!), and an LRU cache keeps hot values in memory. This hugely improved the memory footprint and solved the OOMs.
Data, Pipelines, & Serialisation
8x Better Than Protobuf: Rethinking Serialization for Data Pipelines (By Almog Gavra)
This session was about the pain of data pipelines.
Almog described a Kafka stream app where processing time would spike to 7 minutes. The problem was a classic O(n²). To add one event to a batch, the code would deserialise the entire batch list, append the new event, and re-serialise the whole list.
The talk introduced Imprint, a new serialization format built for pipelines. It avoids this by allowing operations like joins, projections (picking specific fields), and composition without full deserialisation. It pays a few bytes per field but unlocks massive performance gains by making deserialisation a rare event.
Achieving sub-10 millisec at Climatiq (by Climatiq)
This was a very good lesson in edge computing latency. To get sub-10ms P99, they focused on a few key areas:
- Keep data close: Most latency is network I/O. They embed static data directly into their binary and use a distributed DB (FaunaDB) that’s co-located with their edge app.
- Cache aggressively: Use stale-while-revalidate to serve stale data instantly while fetching an update in the background.
- Zero-Copy Deserialisation: This was the coolest part. Deserialising JSON or CSV on every request is slow. They switched to bincode (faster), then to rkyv. With rkyv, the in-memory representation of the data is the serialised format. You can use the data directly from the byte buffer with zero parsing or allocation, just by following pointers. This took their processing time from 28ms down to just 7 nanoseconds (unsafe rkyv).
Low-Level Tuning
Go Faster: Tuning the Go Runtime for Latency and Throughput (By Pawel from ShareChat)
A great practical talk on Go tuning. The key message: observe before you optimize. Use pprof and runtime metrics (go_gc, go_sched, etc.).
The talk covered the “big three” tuning knobs:
- GOMAXPROCS: Now container-aware by default in Go 1.25, but still important to understand.
- GOGC: Controls GC frequency. A higher value (e.g., GOGC=200) means GC runs less often, trading higher memory use for lower CPU overhead.
- GOMEMLIMIT: A newer setting (Go 1.19+) that sets a limit on memory, forcing the GC to run more frequently to stay under it. A good starting point is ~90% of your container's memory limit.
He also covered Profile-Guided Optimization (PGO), where you feed production CPU profiles back into the compiler to help it make smarter inlining decisions.
Final Thoughts
This was just a fraction of the talks I saw. The engineering level and the depth of the talk at p99conf was just incredible. I’m leaving with a notebook full of ideas, a new appreciation for Rust, and a much deeper understanding of what it takes to build truly high-performance systems.
If any of these topics sound interesting, I highly recommend checking out the full list!
Cheers!
.png)
