I vibe coded a database engine from scratch

2 hours ago 1
Relevant source files

This document introduces the LLKV database system, explaining its purpose, architecture, and core components. LLKV is a SQL database engine built in Rust that provides ACID transactions with MVCC snapshot isolation, columnar storage, and comprehensive SQL support.

For detailed information about specific subsystems:

Purpose and Scope

LLKV is a relational database system that executes SQL queries against columnar data structures. The system is organized as a Cargo workspace containing 15 specialized crates, each responsible for a distinct layer of the database stack—from SQL parsing and query planning through execution and storage.

The codebase emphasizes modularity, performance optimization through SIMD acceleration, and correctness validation via extensive SQL Logic Testing. The architecture follows a layered design where higher-level crates depend on lower-level abstractions, enabling independent testing and optimization of each subsystem.

Sources: Cargo.toml1-81 llkv-sql/src/sql_engine.rs141-185

System Architecture

The following diagram illustrates the major subsystems and their dependencies:

Crate Responsibilities:

CratePurpose
llkvRoot crate aggregating all subsystems
llkv-sqlSQL parsing, preprocessing, and SqlEngine API
llkv-planLogical query plan structures (SelectPlan, InsertPlan, etc.)
llkv-exprExpression AST types (Expr, ScalarExpr, Literal)
llkv-runtimeTransaction orchestration (RuntimeEngine, RuntimeSession)
llkv-executorQuery plan execution and result streaming
llkv-tableTable abstraction, catalog management, schema handling
llkv-column-mapColumnar storage implementation
llkv-storageStorage pager interface (Pager, BoxedPager, MemPager)
llkv-transactionMVCC transaction management
llkv-aggregateAggregate function evaluation
llkv-joinJoin operation execution
llkv-resultCommon error types (Error, Result)
llkv-test-utilsTesting utilities and tracing initialization
llkv-slt-testerSQL Logic Test harness

Sources: Cargo.toml61-80 Diagram 1 from high-level overview

Core Components

SqlEngine - Entry Point

The SqlEngine struct serves as the primary interface for executing SQL statements. It wraps a RuntimeEngine and handles SQL preprocessing, parsing, and statement routing.

Key methods:

  • SqlEngine::new() - Create engine with storage pager
  • SqlEngine::execute() - Execute one or more SQL statements
  • SqlEngine::sql() - Execute single SELECT and return batches

Sources: llkv-sql/src/sql_engine.rs272-314 llkv-sql/src/sql_engine.rs685-760

RuntimeEngine - Transaction Coordinator

The RuntimeEngine manages transaction sessions and coordinates between the execution layer and storage. Each engine maintains a RuntimeContext (shared catalog and storage) and a RuntimeSession (per-connection transaction state).

Sources: Diagram 4 from high-level overview, llkv-sql/src/sql_engine.rs55-60

Query Execution Pipeline

The execution pipeline transforms SQL text into Arrow RecordBatch results through multiple stages:

Plan Types:

  • SelectPlan - Query with projections, filters, joins, aggregates
  • InsertPlan - Insert rows (literal values or SELECT source)
  • UpdatePlan - Update rows with assignments and filter
  • DeletePlan - Delete rows matching filter
  • CreateTablePlan - Create table with column specifications

Sources: llkv-plan/src/plans.rs1-1230 Diagram 2 from high-level overview

Data Representation

LLKV uses Apache Arrow for columnar data throughout the system:

The llkv-column-map crate implements columnar storage that maps Arrow schemas to storage blobs via the Pager interface. All query results flow through RecordBatch structures, ensuring consistent data representation across subsystem boundaries.

Sources: Cargo.lock100-117 (Arrow dependencies), Diagram 3 from high-level overview

Transaction Model

LLKV implements Multi-Version Concurrency Control (MVCC) with snapshot isolation:

ComponentRole
TxnIdManagerGenerates monotonic transaction IDs and tracks watermarks
TransactionSnapshotCaptures txn_id + snapshot_id for visibility checks
MVCC TagsEach row has created_by and deleted_by transaction IDs
Row Visibility FilterCompares MVCC tags against snapshot to determine visibility

Transactions maintain two execution contexts:

  1. Base Context - Access to committed tables with MVCC filtering
  2. Staging Context - Isolated in-memory storage for new tables created in the transaction

Sources: Diagram 4 from high-level overview, Cargo.toml25 (llkv-transaction dependency)

Storage Architecture

The storage layer provides pluggable persistence through the Pager trait:

The simd-r-drive crate provides SIMD-optimized storage operations. EntryHandle represents persistent blob references that the ColumnStore uses to serialize and deserialize Arrow arrays.

Sources: Cargo.toml26-27 (simd-r-drive dependencies), llkv-sql/src/sql_engine.rs34-35

Testing Infrastructure

LLKV includes comprehensive testing at multiple levels:

Test TypeFrameworkCoverage
Unit Testscargo testPer-crate functionality
Integration Testscargo testCross-crate interactions
SQL Logic Testsllkv-slt-testerThousands of SQL test cases
BenchmarksCriterion + CodSpeedPerformance regression tracking

The llkv-slt-tester crate implements a SQL Logic Test runner that executes .test files containing SQL statements and expected results. The harness supports test preprocessing (loop expansion, conditional filtering) and integrates with CI/CD for continuous validation.

Sources: Cargo.toml20 (llkv-slt-tester), Diagram 5 from high-level overview

Key Features

SQL Capabilities:

  • DDL: CREATE TABLE, ALTER TABLE, DROP TABLE, CREATE INDEX, CREATE VIEW
  • DML: INSERT, UPDATE, DELETE, TRUNCATE, SELECT
  • Queries: Joins, subqueries, aggregates, GROUP BY, HAVING, ORDER BY, set operations
  • Transaction control: BEGIN, COMMIT, ROLLBACK

Performance Optimizations:

  • INSERT buffering: Batches multiple literal INSERT statements (up to 8,192 rows)
  • SIMD acceleration: Vectorized operations via simd-r-drive
  • Parallel execution: Rayon-based parallelism in table scans and aggregations
  • Streaming results: Memory-efficient batch processing
  • Roaring bitmaps: Efficient indexing

Concurrency:

  • Snapshot isolation preventing dirty reads
  • Optimistic concurrency control with conflict detection
  • Session-isolated transactions
  • Watermark-based visibility tracking

Sources: llkv-sql/src/sql_engine.rs186-191 (MAX_BUFFERED_INSERT_ROWS), Diagram 3 from high-level overview

Development Model

The workspace follows a modular design where each crate has a focused responsibility and minimal dependencies on peers. Shared types are defined in foundational crates:

  • llkv-result - Common Result<T> and Error types used by all crates
  • llkv-expr - Expression AST types shared between planning and execution
  • llkv-plan - Plan structures consumed by the runtime and executor

This layering enables:

  • Independent testing of each subsystem
  • Clear separation of concerns
  • Ability to swap storage backends via the Pager trait
  • Performance optimization at individual layers

Sources: Cargo.toml9-28 (workspace dependencies), Diagram 7 from high-level overview

Read Entire Article