Optimise PostgreSQL queries: correlation and index-only scans

3 days ago 1

Luigi Tanzini

A couple of months ago, our team encountered a severe performance issue with one of our core services running on Postgres: service latency was spiking, database replica lag exceeded 40 seconds, and service operations were timing out.

Because of the high query times, we focused our investigation on our database servers.

In this blog post, we’ll share how we discovered the root cause of our problem and explain the key concepts that helped us understand our database’s behaviour. We’ll dive into some of PostgreSQL’s data structures, such as heap tables and B-Tree indexes, and examine how correlation plays a crucial role in query performance.

Finally, we’ll demonstrate how these insights led us to an elegant solution.

Unpacking our metrics

Our journey begins with an examination of our database metrics.

While the main instance was operating normally, we noticed that CPU utilisation on one of the read replicas was approaching 100%. The metrics revealed a very high DiskQueueDepth, and analysis of Postgres wait events showed a significant amount of time spent on IO:DataFileRead.
All vital signs pointed to massive disk access.

Additionally, IPC:BufferIO events were consuming a large portion of the timeline, indicating heavy buffer cache contention.

Briefly, what do these metrics tell us?

  • High DiskQueueDepth: This metric shows pending I/O requests waiting for storage device access. A high value indicates a backlog of requests.
  • High IO:DataFileRead Waits: This Postgres wait event indicates that processes (Postgres backends) are stalling while waiting for disk reads. These high waits typically result from buffer cache misses, forcing Postgres to fetch data directly from disk.
  • High IPC:BufferIO(LWLock:BufferIO in recent versions) Waits: these Postgres events indicate contention for shared memory buffers—the buffers are locked during I/O operations, causing client connections to wait for cache access.

With the replica clearly under strain, we investigated the database logs and quickly identified a specific query pattern that consistently failed to complete within its 60-second timeout window:

SELECT id
FROM my_table
WHERE string_field IN ($1, $2, ..., $200) -- IN list with many values
AND timestamp_field IS NULL
AND id >= $201 -- Range scan on primary key
ORDER BY id ASC -- Order by primary key
LIMIT $202; -- Limit results

Although this query pattern was common, the timeouts occurred during specific “outlier” executions. These typically happened when background jobs, triggered by certain system events, needed to process a large set of related items based on specific attributes — represented by the string_field IN list - still marked as valid by the NULL timestamp.

Making matters worse, the application automatically retried these failed queries, which increased the load on the already struggling replica, likely contributing to the severe system-wide impact.

Having identified both our primary suspect and smoking gun, we needed to understand how and why this was happening.

Let’s examine the structure of our “victim” table (NOTE: Table and column names have been anonymised for illustration):

postgres=> \d+ my_table
Table "public.my_table"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------------+-----------------------------+-----------+----------+-------------------+----------+-------------+--------------+-------------
id | text | | not null | | extended | | |
string_field | text | | not null | | extended | | |
data_field | jsonb | | not null | | extended | | |
some_other_field | jsonb | | | | extended | | |
timestamp_field | timestamp without time zone | | | | plain | | |
Indexes:
"my_table_pkey" PRIMARY KEY, btree (id)
"my_table_string_field_idx" btree (string_field)
"my_table_string_field_timestamp_field_idx" btree (string_field, timestamp_field)
Access method: heap

There was nothing particularly complex about the table or query structure, but we needed to understand why this query was failing despite having indexes in place.

Was it due to a poor execution plan? To find out, we asked the Postgres optimiser to show us how it was executing the query:

EXPLAIN SELECT id
FROM my_table
WHERE string_field IN ($1, $2, ..., $200)
AND timestamp_field IS NULL
AND id >= $201
ORDER BY id ASC
LIMIT $202;

-- Original Plan Output:
----------------------------------------------------------------------------------------------------------------
Limit (cost=1.06..119578.44 rows=201 width=8)
-> Index Scan using my_table_pkey on my_table (cost=1.06..96759518.79 rows=162645 width=8)
Index Cond: (id >= 'abc123'::text) -- Uses PK index for range scan and ordering
Filter: ((timestamp_field IS NULL) AND (string_field = ANY ('{abc345,...}'::text[]))) -- Filters applied *after* index scan
(4 rows)

Surprisingly, the planner was using an index. That should be good news, right? Yet we still had timeouts.
The reality is, simply having an index doesn’t guarantee fast query execution.

To understand why, we need to explore Postgres internals and see how index scans actually work.

Let’s start by examining how Postgres organises our data.

Heap Tables

When we create a standard table in Postgres, rows are typically stored in an unordered structure called a heap table.
The heap table is divided into 8kB chunks, named pages, and record rows — or tuples in Postgres parlance — fill these pages.
Each row has a unique physical address, named ctid, which serves as the record identifier and specifies the tuple’s page and location within that page.
Think of the ctid like a direct coordinate: given the ctid, Postgres knows exactly which page to look at and where the row record sits within that page. The idea is much like using an index in an array - array[index] gives you direct access to an element without searching.

The diagram below shows a simplified view of a heap table — tuples reside within 8Kb pages and each of them is uniquely identified by its ctid.

Row data is held in a data structure named the buffer cache: accessing a row via its ctid is very fast if the page is already in memory.
If not, the buffer cache must fetch the page from disk and read it into its memory buffers first. This I/O operation is orders of magnitude slower than reading from memory.

While lots of engineering work has gone into making disk reads as efficient as possible, reading from contiguous disk location is still much faster than accessing data at random disk blocks.

Keep this in mind as we progress on our investigation.

Indexes

Ok now, what are indexes? Indexes are database objects used to accelerate access to heap table data. They are implemented using specific data structures built over record data, and their purpose is matching a key with rows of the table.

B-Tree is Postgres’s default index type.
Within B-Trees, keys are stored in sorted order, enabling efficient lookups that complete in logarithmic time. Each key in the index is linked to a ctid, which serves as a pointer to the complete row record in the heap table.

The illustration below shows a (greatly simplified) B-Tree index. The darker cells illustrate the search path for finding the ctid when the requested key is 41.

Now that we understand how tables and indexes are represented, let’s break down how a typical index search operates. This type of search is called an index scan — the one seen in our original query plan.

First, Postgres uses the B-Tree index (my_table_pkey in our case) to efficiently locate entries matching the query's index conditions, such as id >= ? or id = ?.

For each matching index entry, it retrieves the corresponding ctid - the physical address of the row within the main heap table.

Next, Postgres uses this ctid to access the appropriate heap table page and retrieve the required row data.

This fetch must retrieve all columns needed for filtering or selection.

Let’s examine our query more closely:

SELECT id
FROM my_table
WHERE string_field IN ($1, $2, ..., $200) -- IN list with many values
AND timestamp_field IS NULL
AND id >= $201 -- Range scan on primary key
ORDER BY id ASC -- Order by primary key
LIMIT $202; -- Limit results

The query needs id, string_field, and timestamp_field to determine which records belong in the result set, even though it only returns id values.

Consequently, if the index contains only id values, Postgres must fetch string_field and timestamp_field from the heap table.

When the required heap page isn’t in Postgres’s memory buffers, this triggers a disk read.

The figure below depicts this process.

Only after fetching the full row data from the heap does Postgres apply any remaining filter conditions from the WHERE clause that couldn't be handled by the initial index condition alone.

In our original plan, these were the checks for timestamp_field IS NULL and string_field IN(...).

This entire sequence of index lookup, ctid retrieval, heap fetch, and subsequent filtering repeats until enough rows satisfying all conditions are found - up to the LIMIT, for instance - or all relevant index entries have been processed.

Our timed-out query performed this type of scan. But why was the query slow? After all, this should be standard work for a database engine.
Well, it turns out this is only part of the puzzle. We need one more piece: index correlation.

Correlation is one of the statistics that Postgres collects during table analysis, which the query optimiser uses to select the best execution plan.

Here’s the correlation we found in our table:

postgres=> select attname, correlation from pg_stats where tablename = 'my_table' order by abs(correlation) desc;
attname | correlation
-------------------+-------------
string_field | 0.26841092
timestamp_field | -0.25695387
id | 0.009034876 -- Very low correlation for the PK!

Correlation ranges between -1 and +1: if values are arranged in ascending order their correlation will approach 1 and it will be close to -1 if they are ordered descending.

Essentially, for a given column, this statistic tells us how well the physical order of rows in the heap table follows the logical sorted order of that column’s values.

With respect to B-Tree indexes, a correlation close to 0 (like our id column’s 0.009) means the physical order of rows in the heap is vastly different from the logical order of the key in index.

Recalling the index scan process, this low correlation meant that fetching rows pointed to by logically sequential id index entries required jumping randomly all over the table heap to read the pages specified by the ctids.

This means retrieving thousands of rows before filtering could potentially trigger thousands of heap table accesses, which would cause a massive number of I/O operations if the pages weren’t already in the buffer cache.

The picture above illustrates high correlation: notice how sequential index keys point to rows on the same or nearby heap pages (like 11 and 14). This allows for efficient sequential access during heap fetches.

The picture below illustrates low correlation instead: scanning the index sequentially requires fetching pages in a physically non-sequential order.

We now had a theory that perfectly explained the high DiskQueueDepth and IO:DataFileRead waits, as required pages were frequently not found in Postgres’ shared buffers.

The resulting I/O saturation also led to the observed IPC:BufferIO waits, indicating intense contention for buffer access while disk operations were in progress.

Each shared buffer in the buffer cache has an I/O latch. Each time a page has to be retrieved outside the shared buffer pool, this latch is locked until the page is read from disk into the buffer.

The IPC:BufferIO wait event precedes the IO:DataFileRead event, which occurs when data is being read from storage.

This high lock contention explained why CPU usage became saturated alongside I/O operations.

The key takeaway so far is that a standard index scan’s performance isn’t just about the index lookup; it’s heavily influenced by the cost of subsequent heap fetches, which is directly tied to index correlation.

In other words, low correlation can kill performance even in the face of an index scan.

We now understood the problem: the back-and-forth between the index and the randomly accessed heap pages was too slow.

Could we avoid the heap altogether?
The answer is yes, using an index-only scan.

An index-only scan retrieves all required data directly from the index pages, completely skipping the heap fetch step. This eliminates the need for heap pages to be read into the buffer cache, which in turn drastically reduces the potential for disk reads for this query path (though the index’s own pages might still need reading if not cached).

For this to be possible, a couple of conditions must hold.

First of all, all columns needed by the query (SELECT, WHERE, ORDER BY, etc.) must be available within the index itself.

Secondly, the rows referenced by the index must be known to be “visible” according to Postgres’ visibility map. This map helps avoid heap fetches just to check visibility — a deep dive into the visibility map is beyond this post, but know it's crucial for index-only scan efficiency.

Our plan was now clear: create an index that enabled the planner to use an index-only scan.

Executing the plan involved creating a new index designed specifically for this query:

CREATE INDEX CONCURRENTLY my_table_covering_idx
ON my_table (string_field, timestamp_field, id);
-- Note how the index now includes all columns needed for WHERE and SELECT
-- Placing 'id' last might satisfy ORDER BY if planner chooses forward scan

We used CREATE INDEX CONCURRENTLY to build it without downtime. This new index included string_field, timestamp_field, and id, covering all columns referenced by the query.

With the new index in place, we manually triggered a VACUUM operation to rebuild the visibility map and then tested the planner:

-- New Plan Output:
----------------------------------------------------------------------------------------------------------------
Limit (cost=14776.76..14777.26 rows=201 width=8)
-> Sort (cost=14776.76..15177.33 rows=160227 width=8) -- Sort still present!
Sort Key: id
-> Index Only Scan using my_table_covering_idx on my_table (cost=0.56..7846.10 rows=160227 width=8) -- Index-Only scan is used!
Index Cond: ((string_field = ANY ('{abc345,...}'::text[])) AND (timestamp_field IS NULL) AND (id >= 'abc345'::text))

We could already see large improvements by just comparing the total estimated cost of the new plan against the old one:

  • New plan total cost: 14777.26
  • Old plan total cost: 119578.44

But the massive win was in terms of actual query times, which dropped from consistent 60+ second timeouts to a worst case of ~950 milliseconds.

In many cases, execution time was as low as 9 milliseconds — an improvement of over three orders of magnitude compared to the timeouts!

Query timeouts vanished. Execution times dropped consistently to the millisecond range. I/O load normalised, replica lag disappeared, and system stability was restored.
All achieved with zero downtime and with a quick optimisation.

As part of this optimisation effort, we also identified and removed several unused indexes on my_table, which helped further reduce unnecessary buffer cache occupation and general write overhead.

Of course, there are no free lunches in software: this larger covering index does come with trade-offs, namely increased storage space and a slight overhead on write operations involving the indexed columns.
But both were acceptable costs for this critical performance gain.

This experience highlights an important fact: effective Postgres optimisation isn’t just about having indexes — it’s about understanding how those indexes interact with your data’s physical storage patterns.

Designing indexes thoughtfully can turn performance issues into major quick wins!

By recognising the critical relationship between index correlation and query performance, we transformed a system-crippling bottleneck into a millisecond-level operation without changing a single line of application code.

Read Entire Article