A high-performance embedded key-value database with a radix trie index structure
kv_log is a persistent key-value store designed for high performance and reliability. It uses a combination of append-only logging for data storage and a radix trie for indexing.
- Append-only log file: Data is written sequentially for optimal write performance
- Radix trie indexing: Fast key lookups with O(k) complexity where k is key length
- ACID transactions: Support for atomic operations with commit/rollback
- Write-Ahead Logging (WAL): Ensures durability and crash recovery
- Configurable write modes: Balance between performance and durability
- Efficient page cache: Minimizes disk I/O with intelligent caching
- Concurrent access: Thread-safe operations with appropriate locking
kv_log databases consist of three files:
- Main file: Stores all key-value data in an append-only log format
- Index file: Contains the radix trie structure for efficient key lookups
- WAL file: Contains flushed index pages, for high durability
The index uses a radix trie (also known as a patricia trie) to efficiently locate keys:
- Each node in the trie represents a part of the key
- The trie is organized by byte values (0-255)
- Leaf pages store entries with the same prefix
- When a leaf page becomes full, it's converted to a radix page
- Radix Pages: Internal nodes of the trie, containing pointers to other pages
- Leaf Pages: Terminal nodes containing key suffixes and pointers to data
-
Write Modes: Choose between durability and performance
- CallerThread_WAL_Sync: Maximum durability, lowest performance
- WorkerThread_WAL: Good performance and durability (default)
- WorkerThread_NoWAL_NoSync: Maximum performance, lowest durability
-
Cache Size: Adjust based on available memory and workload
- Larger cache improves read performance but uses more memory
-
Checkpoint Threshold: Controls WAL file size before checkpoint
- Smaller values reduce recovery time but may impact performance
- Keys are limited to 2KB
- Values are limited to 128MB
- The database uses a page size of 4KB
The database automatically recovers from crashes by:
- Reading the main file header
- Checking for a valid index file
- Scanning for commit markers in the main file
- Rebuilding the index if necessary
- Single connection per database file: Only one process can open the database in write mode at a time
- Concurrent access model: Supports one writer thread or multiple reader threads simultaneously
- Pro: It is extremely fast on reads for a disk-based database engine
- Con: The index uses A LOT of disk space for bigger databases
The index does not always grow linearly but in phases.
Example case: The index file reaches ~25GB when the main db grows above ~4GB
But it is more than 3.5x faster than BadgerDB on reads
| Set 2M values | 2m 44.45s | 13.81s | 18.95s | 6.98s |
| 20K txns (10 items each) | 1m 0.09s | 1.32s | 2.78s | 1.70s |
| Space after write | 1052.08 MB | 2002.38 MB | 1715.76 MB | 1501.09 MB |
| Space after close | 1158.78 MB | 1203.11 MB | 2223.16 MB | 1899.50 MB |
| Read 2M values (fresh) | 1m 26.87s | 35.14s | 17.21s | 11.20s |
| Read 2M values (cold) | 1m 34.46s | 38.36s | 16.84s | 10.81s |
Benchmark writting 2 million records (key: 33 random bytes, value: 750 random bytes) in a single transaction and then reading them in non-sequential order
Also insertion using 20 thousand transactions
Apache 2.0
.png)


