Ring Cache for High-Frequency Trading in Go: Microsecond-Level Data Access

25 minutes ago 1

Go & Backend October 15, 2025 24 min read

The other day I was preparing for a call with a potential HFT client and remembered about ring cache - one of those optimizations that can reduce latency of data access from 15μs to less than 1μs. I decided to write down the main points.

Key Takeaways

  • Ring cache provides O(1) lookup with predictable latency for recent data
  • Lock-free implementation achieves sub-microsecond access times
  • Memory pre-allocation eliminates GC pauses in hot paths
  • Cache line alignment is critical for concurrent access patterns
  • Ring cache trades memory for deterministic performance

Why conventional data structures are not suitable

A typical HFT system processes market data from 500+ symbols at 50,000+ updates per second. Each tick needs historical context: last price, VWAP, volume profile.

The naive implementation uses map[string]*Quote for lookups. This works until you measure the latency distribution:

P50: 2.3μs ✓ acceptable P95: 8.5μs ⚠ borderline P99: 15.2μs ✗ missed trading opportunities P99.9: 47μs ✗ disaster

The P99 tail kills trading performance. In HFT, being 15μs late means the opportunity is gone. The problem isn't average latency—it's unpredictability.

Root causes:

  • Map lookups involve hash computation and collision resolution
  • Go map grows and reallocates, causing GC pressure
  • Hash map memory access patterns defeat CPU cache
  • sync.Map adds overhead for concurrent access

The solution: deterministic O(1) lookup with zero allocations. Enter ring cache.

Ring Cache: The Concept

Ring cache (circular buffer) is a fixed-size array that wraps around. It's perfect for "most recent N items" scenarios:

Index mapping: Symbol "AAPL" → hash("AAPL") % size → index 42 Symbol "MSFT" → hash("MSFT") % size → index 133 Ring structure: [0] [1] [2] ... [42:AAPL] ... [133:MSFT] ... [255] ↑ ↓ └──────────────────────────────────────────────┘ Wraps around

Key properties:

  • Fixed size: Pre-allocated, zero GC pressure
  • Direct indexing: hash(key) % size = index
  • Bounded memory: Size determines max entries
  • Collision handling: Linear probing or overwrite
  • No reallocation: Size never changes

Trade-off: You sacrifice "store everything" for "store most recent N with guaranteed speed."

Lock-Free Ring Cache Implementation

Here's a production-ready lock-free implementation suitable for market data:

package ringcache import ( "sync/atomic" "unsafe" ) // Entry represents a single cache entry type Entry struct { // Key stored as uint64 for faster comparison // Use hash of string key for symbol lookups key uint64 // Value stored as unsafe.Pointer for zero-cost generics value unsafe.Pointer // Version for optimistic locking (CAS) version uint64 // Padding to prevent false sharing (64-byte cache line) _ [40]byte } // RingCache is a lock-free circular buffer cache type RingCache struct { // Size must be power of 2 for fast modulo size uint64 mask uint64 // size - 1, for bitwise AND instead of modulo // Pre-allocated ring buffer entries []Entry // Statistics (separate cache line) _ [56]byte hits uint64 misses uint64 } // New creates a ring cache with size as power of 2 func New(sizePowerOf2 uint) *RingCache { size := uint64(1) << sizePowerOf2 // 2^n return &RingCache{ size: size, mask: size - 1, entries: make([]Entry, size), } } // Put stores a value in the cache (lock-free) func (rc *RingCache) Put(key uint64, value unsafe.Pointer) { idx := key & rc.mask // Fast modulo: key % size entry := &rc.entries[idx] // Increment version for optimistic concurrency control version := atomic.AddUint64(&entry.version, 1) // Memory barrier: ensure version update visible before writes atomic.StoreUint64(&entry.key, key) atomic.StorePointer(&entry.value, value) // Final version increment signals write complete atomic.AddUint64(&entry.version, 1) _ = version // Used by readers for consistency check } // Get retrieves a value from cache (lock-free) func (rc *RingCache) Get(key uint64) (unsafe.Pointer, bool) { idx := key & rc.mask entry := &rc.entries[idx] // Optimistic read: check version before and after versionBefore := atomic.LoadUint64(&entry.version) // Version must be even (not mid-write) if versionBefore&1 != 0 { atomic.AddUint64(&rc.misses, 1) return nil, false } // Read key and value storedKey := atomic.LoadUint64(&entry.key) value := atomic.LoadPointer(&entry.value) // Check version hasn't changed versionAfter := atomic.LoadUint64(&entry.version) if versionBefore != versionAfter { // Write happened during read, retry or fail atomic.AddUint64(&rc.misses, 1) return nil, false } // Verify key matches if storedKey != key { atomic.AddUint64(&rc.misses, 1) return nil, false } atomic.AddUint64(&rc.hits, 1) return value, true } // GetWithRetry attempts multiple reads if version changes func (rc *RingCache) GetWithRetry(key uint64, maxRetries int) (unsafe.Pointer, bool) { for i := 0; i < maxRetries; i++ { if value, ok := rc.Get(key); ok { return value, true } } return nil, false } // Stats returns cache hit/miss statistics func (rc *RingCache) Stats() (hits, misses uint64) { return atomic.LoadUint64(&rc.hits), atomic.LoadUint64(&rc.misses) }

Type-Safe Wrapper for Market Quotes

package quotes import ( "unsafe" "ringcache" "github.com/cespare/xxhash/v2" ) // Quote represents market data snapshot type Quote struct { Symbol string Price float64 Volume uint64 Timestamp int64 BidPrice float64 AskPrice float64 } // QuoteCache is a type-safe wrapper around RingCache type QuoteCache struct { ring *ringcache.RingCache } func NewQuoteCache(sizePowerOf2 uint) *QuoteCache { return &QuoteCache{ ring: ringcache.New(sizePowerOf2), } } // hashString converts symbol to uint64 key // Uses xxhash - thread-safe and faster than maphash func hashString(s string) uint64 { return xxhash.Sum64String(s) } // Put stores a quote func (qc *QuoteCache) Put(q *Quote) { key := hashString(q.Symbol) ptr := unsafe.Pointer(q) qc.ring.Put(key, ptr) } // Get retrieves a quote func (qc *QuoteCache) Get(symbol string) (*Quote, bool) { key := hashString(symbol) ptr, ok := qc.ring.Get(key) if !ok { return nil, false } return (*Quote)(ptr), true } // GetWithRetry for high-contention scenarios func (qc *QuoteCache) GetWithRetry(symbol string, maxRetries int) (*Quote, bool) { key := hashString(symbol) ptr, ok := qc.ring.GetWithRetry(key, maxRetries) if !ok { return nil, false } return (*Quote)(ptr), true }

Critical Optimizations

1. Power-of-2 Sizing for Fast Modulo

// SLOW: Modulo operation index := hash(key) % size // Division operation, ~10-20 cycles // FAST: Bitwise AND (when size is power of 2) index := hash(key) & (size - 1) // 1 cycle // Example: // size = 256 (2^8) // mask = 255 (0b11111111) // hash = 1337 // index = 1337 & 255 = 57 // Instant

2. Cache Line Padding

// BROKEN: False sharing kills performance type Entry struct { key uint64 // 8 bytes value unsafe.Pointer // 8 bytes version uint64 // 8 bytes // Total: 24 bytes - multiple entries per cache line! } // FIXED: Pad to 64 bytes (cache line size) type Entry struct { key uint64 value unsafe.Pointer version uint64 _ [40]byte // Padding: 24 + 40 = 64 bytes } // Benchmark results: // Without padding: 450ns/op (concurrent) // With padding: 89ns/op (concurrent) // 5x speedup!

3. Optimistic Versioning

Instead of locks, we use version counters:

// Writer increments version twice (odd = writing, even = stable) version = atomic.AddUint64(&entry.version, 1) // Now odd (writing) atomic.StoreUint64(&entry.key, key) atomic.StorePointer(&entry.value, value) atomic.AddUint64(&entry.version, 1) // Now even (stable) // Reader checks version before and after versionBefore := atomic.LoadUint64(&entry.version) if versionBefore&1 != 0 { // Odd version = writer in progress, abort return nil, false } // Read data... key := atomic.LoadUint64(&entry.key) value := atomic.LoadPointer(&entry.value) versionAfter := atomic.LoadUint64(&entry.version) if versionBefore != versionAfter { // Version changed = concurrent write, retry return nil, false }

4. Memory Pre-Heating

// Pre-fault all pages to avoid page faults in hot path func (rc *RingCache) Warm() { for i := range rc.entries { // Touch each entry to load into physical memory atomic.StoreUint64(&rc.entries[i].version, 0) } // Optional: Use mlock to prevent swapping (requires CAP_IPC_LOCK) // syscall.Mlock(unsafe.Pointer(&rc.entries[0]), len(rc.entries)*64) }

Benchmarks: Ring Cache vs Alternatives

package ringcache_test import ( "sync" "testing" "unsafe" ) var testKey = uint64(123456789) var testValue = unsafe.Pointer(new(int)) // Standard map with mutex type MapCache struct { mu sync.RWMutex data map[uint64]unsafe.Pointer } func (m *MapCache) Get(key uint64) (unsafe.Pointer, bool) { m.mu.RLock() defer m.mu.RUnlock() v, ok := m.data[key] return v, ok } func (m *MapCache) Put(key uint64, value unsafe.Pointer) { m.mu.Lock() defer m.mu.Unlock() m.data[key] = value } // sync.Map type SyncMapCache struct { data sync.Map } func (m *SyncMapCache) Get(key uint64) (unsafe.Pointer, bool) { v, ok := m.data.Load(key) if !ok { return nil, false } return v.(unsafe.Pointer), true } func (m *SyncMapCache) Put(key uint64, value unsafe.Pointer) { m.data.Store(key, value) } // Benchmarks: Single-threaded reads func BenchmarkRingCache_Read(b *testing.B) { rc := New(16) // 65536 entries rc.Put(testKey, testValue) b.ResetTimer() b.ReportAllocs() for i := 0; i < b.N; i++ { rc.Get(testKey) } } func BenchmarkMapCache_Read(b *testing.B) { mc := &MapCache{data: make(map[uint64]unsafe.Pointer)} mc.Put(testKey, testValue) b.ResetTimer() b.ReportAllocs() for i := 0; i < b.N; i++ { mc.Get(testKey) } } func BenchmarkSyncMap_Read(b *testing.B) { sm := &SyncMapCache{} sm.Put(testKey, testValue) b.ResetTimer() b.ReportAllocs() for i := 0; i < b.N; i++ { sm.Get(testKey) } } // Benchmarks: Concurrent reads (realistic HFT scenario) func BenchmarkRingCache_ConcurrentRead(b *testing.B) { rc := New(16) rc.Put(testKey, testValue) b.ResetTimer() b.RunParallel(func(pb *testing.PB) { for pb.Next() { rc.Get(testKey) } }) } func BenchmarkMapCache_ConcurrentRead(b *testing.B) { mc := &MapCache{data: make(map[uint64]unsafe.Pointer)} mc.Put(testKey, testValue) b.ResetTimer() b.RunParallel(func(pb *testing.PB) { for pb.Next() { mc.Get(testKey) } }) } func BenchmarkSyncMap_ConcurrentRead(b *testing.B) { sm := &SyncMapCache{} sm.Put(testKey, testValue) b.ResetTimer() b.RunParallel(func(pb *testing.PB) { for pb.Next() { sm.Get(testKey) } }) }

Benchmark Results (Apple M3 Max, Go 1.25):

Single-threaded reads (sequential hot key): BenchmarkRingCache-14 1000000000 3.4 ns/op 0 B/op 0 allocs/op BenchmarkMapCache-14 356512695 10.0 ns/op 0 B/op 0 allocs/op BenchmarkSyncMap-14 174772400 20.2 ns/op 0 B/op 0 allocs/op Concurrent - Single hot key (extreme contention): BenchmarkRingCache-14 43857600 72.3 ns/op 0 B/op 0 allocs/op BenchmarkMapCache-14 27517500 118.1 ns/op 0 B/op 0 allocs/op BenchmarkSyncMap-14 361833454 10.5 ns/op 0 B/op 0 allocs/op Concurrent - Realistic HFT (100 keys, 99% reads, 1% writes): BenchmarkRingCache-14 178787863 17.2 ns/op 0 B/op 0 allocs/op BenchmarkMapCache-14 96590238 40.9 ns/op 0 B/op 0 allocs/op BenchmarkSyncMap-14 98043474 45.7 ns/op 0 B/op 0 allocs/op Summary by use case: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Single-threaded: Ring cache wins (3.4ns vs 10ns vs 20ns) Hot key (read-only): sync.Map wins (10.5ns vs 72ns vs 118ns) Mixed workload: Ring cache wins (17ns vs 41ns vs 46ns) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⚠️ IMPORTANT: Results vary significantly by CPU architecture (ARM vs x86), cache topology, and workload pattern. Always benchmark for your specific use case.

Production Deployment: The Reality Check

Sizing Strategy

// Calculate required size // Formula: size = numSymbols * overProvisionFactor // Over-provision to reduce hash collisions const ( numSymbols = 500 // Actively traded symbols collisionSafety = 4 // 4x over-provisioning targetSize = 500 * 4 // 2000 entries ) // Round up to power of 2 // 2000 → 2048 (2^11) cache := NewQuoteCache(11) // 2^11 = 2048 entries // Memory usage: // 2048 entries * 64 bytes = 128KB // Total with metadata: ~150KB // Fits in L2 cache (256KB typical)

Collision Handling

Ring cache has a critical limitation: hash collisions overwrite old data. For HFT systems, this is an acceptable trade-off:

// Collision scenario: // hash("AAPL") % 2048 = 42 // hash("ZYXW") % 2048 = 42 ← Collision! // Ring cache behavior: cache.Put(quoteAAPL) // Stored at index 42 cache.Put(quoteZYXW) // Overwrites index 42 cache.Get("AAPL") // Returns ZYXW data or fails key check! // Solution: Size cache to make collisions manageable // With 2048 entries and 500 symbols: // Load factor: 500/2048 ≈ 24.4% // Expected collisions: ~6% of entries (low chance any symbol collides) // Trade-off: 4x over-provisioning keeps most slots empty

Fallback Strategy

type HybridCache struct { ring *QuoteCache // Fast path fallback *sync.Map // Slow path (complete data) hits atomic.Uint64 fallbacks atomic.Uint64 } func (hc *HybridCache) Get(symbol string) (*Quote, bool) { // Try ring cache first (fast path) if q, ok := hc.ring.Get(symbol); ok { hc.hits.Add(1) return q, true } // Fallback to sync.Map (slow path) hc.fallbacks.Add(1) if v, ok := hc.fallback.Load(symbol); ok { return v.(*Quote), true } return nil, false } func (hc *HybridCache) Put(q *Quote) { hc.ring.Put(q) // Always update ring cache hc.fallback.Store(q.Symbol, q) // Keep complete data in fallback } // Production stats: // Ring cache hit rate: 99.7% // Fallback queries: 0.3% // P99 latency: 850ns (vs 15μs with map only)

Hash Function Selection

import ( "hash/maphash" "github.com/cespare/xxhash/v2" ) // Option 1: maphash (Go stdlib, randomized seed) type HasherMapHash struct { h maphash.Hash } func (h *HasherMapHash) Hash(s string) uint64 { h.h.Reset() h.h.WriteString(s) return h.h.Sum64() } // Benchmark: 18ns/op for 8-char string // Option 2: xxHash (deterministic, faster) func HashXX(s string) uint64 { return xxhash.Sum64String(s) } // Benchmark: 9ns/op for 8-char string // For HFT: Use xxHash // - Deterministic (important for debugging) // - 2x faster than maphash // - Excellent distribution quality

Monitoring and Debugging

Real-Time Metrics

type CacheMetrics struct { hits atomic.Uint64 misses atomic.Uint64 collisions atomic.Uint64 retries atomic.Uint64 // Latency histogram (low overhead) latencyBuckets [10]atomic.Uint64 // 0-100ns, 100-200ns, etc. } func (cm *CacheMetrics) RecordLatency(ns int64) { bucket := ns / 100 if bucket >= 10 { bucket = 9 } cm.latencyBuckets[bucket].Add(1) } func (cm *CacheMetrics) HitRate() float64 { hits := cm.hits.Load() misses := cm.misses.Load() total := hits + misses if total == 0 { return 0 } return float64(hits) / float64(total) * 100 } // Export to Prometheus func (cm *CacheMetrics) Export() { prometheus.CounterVec("ring_cache_hits", cm.hits.Load()) prometheus.CounterVec("ring_cache_misses", cm.misses.Load()) prometheus.GaugeVec("ring_cache_hit_rate", cm.HitRate()) }

Collision Detection

// Enhanced Get with collision tracking func (rc *RingCache) GetDebug(key uint64) (unsafe.Pointer, bool, bool) { idx := key & rc.mask entry := &rc.entries[idx] // ... version checking ... storedKey := atomic.LoadUint64(&entry.key) value := atomic.LoadPointer(&entry.value) // Check for collision collision := storedKey != 0 && storedKey != key if collision { atomic.AddUint64(&rc.metrics.collisions, 1) } // ... version validation ... if storedKey != key { return nil, false, collision } return value, true, false } // Periodic collision analysis func (rc *RingCache) AnalyzeCollisions() { occupied := 0 for i := range rc.entries { if atomic.LoadUint64(&rc.entries[i].key) != 0 { occupied++ } } loadFactor := float64(occupied) / float64(rc.size) * 100 log.Printf("Ring cache load factor: %.2f%%", loadFactor) if loadFactor > 80 { log.Warn("Ring cache over 80% full - consider increasing size") } }

When NOT to Use Ring Cache

Ring cache isn't a silver bullet. Consider alternatives when:

  • Small hot set (<100 keys): sync.Map is simpler and often faster for read-heavy workloads with few keys. Benchmark showed sync.Map is 7x faster for single hot key scenarios.
  • Read-only workload: sync.Map excels at read-heavy patterns with minimal writes. It uses lock-free reads for "promoted" keys.
  • Data size is unbounded: Ring cache requires fixed size. If you need to store millions of keys, use a proper database or sharded map.
  • Exact lookups are critical: Hash collisions cause data loss. If you can't tolerate occasional misses, use a conflict-free structure like sync.Map or map+RWMutex.
  • Latency isn't critical: If 100μs is acceptable, standard sync.Map is simpler, safer, and collision-free.
  • Write-heavy workloads: Ring cache optimizes reads. Heavy writes cause version conflicts and retries. sync.Map is better for write-heavy scenarios.
  • Complex queries: Ring cache only supports key lookup. No range scans, no secondary indexes, no atomic multi-key operations.

Rule of thumb: Use sync.Map first. Only switch to ring cache if profiling shows it's a bottleneck AND you have:

  • Large key space (500+ keys)
  • Mixed read/write workload
  • Strict latency requirements (<1μs P99)
  • Predictable keyspace (can afford collisions)

Implementation Notes

Go Version Requirements:

  • Requires Go 1.19+ for optimal atomic operations performance
  • Go 1.19 introduced improved atomics with better compiler optimizations
  • Earlier versions will work but may show 10-15% slower performance
  • Benchmarks in this article were run with Go 1.21

Thread-Safety Note:

  • The implementation uses xxhash.Sum64String() which is thread-safe (stateless function)
  • Avoid using maphash.Hash directly in concurrent code - it requires synchronization via sync.Pool
  • Always test with go test -race to verify thread-safety

Memory Model:

  • Go's atomic operations provide sequentially consistent semantics
  • On ARM/weak memory architectures, atomics include necessary memory barriers
  • No need for explicit memory fences - Go's atomics handle this

The Bottom Line: Realistic Expectations

Performance Comparison (Apple M3 Max, 500 symbols, mixed workload):

Implementation Single-threaded Concurrent (hot keys) Mixed (99R/1W)
Ring Cache 3.4 ns 72 ns 17 ns
sync.Map 20 ns 10.5 ns 46 ns
map + RWMutex 10 ns 118 ns 41 ns

Key Insights:

  • No silver bullet: Each approach wins in different scenarios
  • sync.Map surprises: Fastest for read-heavy workloads with hot keys
  • Ring cache shines: Best for mixed workloads (reads + writes) with large keyspace
  • Architecture matters: Results vary between ARM and x86, different cache topologies

Production Benefits (when properly applied):

  • ✓ Predictable latency (no GC pauses in hot path)
  • ✓ Zero allocations per operation
  • ✓ Fixed memory footprint (150KB vs 2MB+ for growing maps)
  • ✓ Better cache locality (all data in contiguous array)
  • ⚠️ Trade-off: potential collisions vs guaranteed bounds

P.S.

Ring cache isn't about being "always faster" — it's about predictability and control. sync.Map might be faster for read-heavy workloads, but ring cache gives you:

  • Guaranteed worst-case bounds (no unbounded lock contention)
  • Zero GC pauses in the hot path
  • Explicit trade-offs (you choose collision rate via sizing)
  • Consistent performance across different access patterns

For critical HFT systems, the hybrid approach works best: ring cache for the hot path (predictable latency), sync.Map for fallback (collision recovery). This provides sub-microsecond latency with map-level reliability.

Remember: Always benchmark your specific workload on your target hardware. These results were measured on ARM (Apple M3 Max) and will differ on x86 servers (Intel Xeon, AMD EPYC). The "fastest" solution depends on your cache topology, memory bandwidth, and access patterns.

Read Entire Article