The Index Is the Database

1 month ago 6

The hidden workhorse of your database

Press enter or click to view image in full size

There’s a well-known saying in database systems: “The WAL is the database.”

That’s partially true. The Write-Ahead Log (WAL) is critical for durability. Having worked on the WAL layer of databases for a decade I used to think this too. Because without it, a database can’t guarantee recovery after a crash. But durability is not the primary reason people use databases.

The real selling point of a database lies elsewhere: the index.

Without indexes, a database is just a heap — a raw pile of bytes, no different than a flat file. Indexes are what differentiate a simple file from a database.

What is an index?

An index points you to the location of information.

That’s the generic definition and the same applies to databases. The only difference from an index in a book is that the database indexes are hidden. You never query an index directly, the query specifies the table, and the database automatically makes use of the index that offers the most efficient query plan.

If you don’t need an index, then you don’t need a database for it!

There is an index for that

Press enter or click to view image in full size

Let’s look at the most common classes of indexes encounter today. An easy way to understand these is to look at the questions they help answer.

1. B-Trees: The Workhorse

The most common index in relational systems. B-Trees are balanced search trees where keys are sorted, making them perfect for:

What’s the phone number of a user? (WHERE id = 123)
How many people are in their 20s? (WHERE age BETWEEN 20 AND 30)
What are the last 5 orders? (ORDER BY timestamp)

In Postgres, MySQL, and most relational databases, your default CREATE INDEX is a B-Tree. Without it, even simple queries would degrade into full table scans.

Press enter or click to view image in full size

2. GIS Indexes: Making Maps Work

When working with geospatial data, B-Trees aren’t enough. Structures like R-Trees, Quadtrees, and Postgres’s GiST indexes allow spatial queries like:

Find all restaurants within 5 miles.
Which polygons overlap this region?

Without these specialized indexes, every geospatial query would require scanning entire datasets of coordinates and shapes.

Press enter or click to view image in full size

3. LSM Trees: Memory became cheap

Modern databases like RocksDB, LevelDB, Cassandra, and YugabyteDB rely on Log-Structured Merge Trees (LSMs). It takes advantage of the tremendous price reduction of main memory storage in recent years. They are very similar to B-Trees in capability.

Instead of updating rows in place, LSMs:

Buffer writes in memory (a memtable).
Flush them sequentially to disk as immutable files (SSTables).
Periodically compact those files into larger, sorted structures.

This design turns random writes into sequential ones, making it ideal for SSDs. Indexing happens across these sorted files, allowing reads to remain efficient while sustaining massive write throughput.

Press enter or click to view image in full size

4. Vector Indexes: Fueling the AI Era

The latest evolution is vector indexes, powering similarity search in embeddings. Structures like HNSW (Hierarchical Navigable Small World graphs) or IVF (Inverted File Indexes) make it possible to:

Get me images that contain a dog
What are the documents that talk about the NFL?

Without vector indexes, every similarity search would require brute-force scanning across millions of vectors — completely impractical in real-world AI systems.

Press enter or click to view image in full size

Even NoSQL Runs on Indexes

A decade ago, the NoSQL movement positioned itself as “simpler, faster, schema-less, and superior” to traditional databases. The sales pitch was that you didn’t need the complexity of relational indexes — just store JSON or key-value pairs and scale horizontally.

But reality caught up quickly. Even NoSQL databases discovered that without indexes, they couldn’t support real-world workloads.

MongoDB added B-Tree and geospatial indexes early on, and later secondary, text search, wildcard and vector indexes.
Cassandra uses LSM-based SSTables but still relies on secondary indexes for non-primary key queries.
DynamoDB markets itself as “index-free,” yet depends heavily on Global Secondary Indexes (GSIs) and Local Secondary Indexes (LSIs) to serve anything beyond simple key lookups.

The lesson? No matter how revolutionary the branding, a database needs indexes.

Indexes are expensive

Indexes require extra disk space.

Indexes slow down writes.

Reindexing is a pain!

Indexes are a copy of the data that’s already in the underlying table, so you are increasing the disk space, and memory usage of the system. It’s thus crucial to choose the right indexes and not have unused indexes. But that’s not an easy task either. The data is constantly changing, and so are the apps, so the queries that need to be optimized are fluid and ever evolving. Ai the struggle to keep the right indexes is constant.

Indexes are not free, so choose them wisely.

Indexes are hard to build!

That’s an understatement!

The trouble with an index is that it is automatically managed by the database. So, all the heavy lifting must happen within the database. Index consistency violation is data consistency violation.

The index that’s missing a row results in data loss.
The index that has multiple copies of the same row results in duplication of data.
The index that points to the wrong row results in data corruption.

And the index needs to be atomically updated with the base table. There is absolutely no room for error in managing indexes. They are unforgiving and some of the hardest things to build in a database.

The one to rule them all

Is there one index that can do everything? No.

Is there one database that can do everything? Yes.

It’s of course the database which has the most versatile set of indexes — PostgreSQL

In addition to the native indexes and PostGiS there are companies like Mooncake(DuckDB) and ParadeDB(BM25) with a primary focus on adding new index capabilities to Postgres.

Decades ago when Michael Stonebraker and Lawrence A. Rowe wrote the design for Postgres they chose index (access method) extensibility as a core design element of the project.

“Our second goal is to allow new data types, new operators and new access methods to be included in the DBMS. Moreover, it is crucial that they be implementable by non-experts which means easy-to-use interfaces should be preserved for any code that will be written by a user.”

That decision is one of the key reasons why Postgres is the most popular database today.

Closing Thought

The evolution of databases is inextricably linked to the evolution of the index. As we’ve moved from simple record-keeping to navigating vast geospatial terrains and now to searching the abstract dimensions of AI embeddings, indexes have been the critical innovation enabling each leap forward.
So, the next time you marvel at a query that executes in milliseconds across billions of rows, take a moment to appreciate the hidden workhorse that made it possible. It’s not just an add-on for performance; it’s the fundamental structure that defines a database’s power — the index is the database.

Read Entire Article

The Index Is the Database

What is an index?

There is an index for that

1. B-Trees: The Workhorse

2. GIS Indexes: Making Maps Work

3. LSM Trees: Memory became cheap

4. Vector Indexes: Fueling the AI Era

Even NoSQL Runs on Indexes

Indexes are expensive

Indexes are hard to build!

The one to rule them all

Closing Thought

Related

A Theory of Dumb: Why Are IQ Scores Suddenly Falling?

C23: A Slightly Better C

New rare earth crisis is brewing as yttrium shortages spread...