Sharded Is Not Distributed: What You Should Know When PostgreSQL Is Not Enough

4 days ago 1

One year ago, we published the post “When PostgreSQL is not enough: PostgreSQL vs distributed system performance comparison”. PostgreSQL showed exceptional performance in scenarios where service availability and data durability were not a priority. We called this configuration “fault-intolerant but extremely performant PostgreSQL.” However, performance drops significantly once write-ahead logging (WAL) and replication are enabled. Replication can become a bottleneck, limiting the system‘s ability to scale vertically. According to the TPC-C benchmark results, the “fault-tolerant PostgreSQL” configuration is only 5% faster than YDB in terms of throughput (tpmC) and suffers from much higher latency.

Naturally, we expected most of the discussion to focus on PostgreSQL tuning, benchmark design, and our comparison methodology. Instead, the most heated debate centered on our claim that PostgreSQL scales only vertically and that solutions like Citus don’t offer full ACID guarantees for multi-shard (i.e., distributed) transactions.

This is due to certain misconceptions and myths that surround sharding, two-phase commit (2PC), and distributed transactions. For example, it may be counterintuitive that 2PC guarantees transaction atomicity but not isolation. That’s why we’ve written this follow-up to help readers dig deeper into these topics and make informed decisions when PostgreSQL is not enough.

Most PostgreSQL sharding solutions are based on a straightforward idea: instead of a single PostgreSQL server, multiple servers are deployed, and each server is responsible for a range of table keys. Key-to-range mapping is handled by a special coordination layer, which becomes a new entry point for user requests. This coordination layer can be server-side (Citus-like solutions) or implemented as part of a client application. It’s worth noting that the PostgreSQL instances in such a setup do not know about each other and do not communicate in any way.

If user transactions affect only one shard, there is no difference between such a setup and a standalone PostgreSQL server. However, once a transaction spans multiple shards, sharded databases can no longer provide full transaction guarantees and instead provide weaker transaction guarantees compared to a monolith. In this regard, sharding fundamentally differs from distributed systems, where transaction guarantees do not depend on how many shards are affected by transactions. In this post, we will use Citus as an example, but the principles also apply to other sharding approaches. Let’s dig in, starting with a bit of theory.

When talking about database transactions, we assume that they have ACID properties:

Atomicity: all parts of the transaction are either applied (committed) or completely discarded (aborted).
Consistency: this one is more of a historical artifact. It was coined mainly to complete the acronym. In practice, consistency is often application-specific rather than database-enforced. A transaction is considered consistent if it moves the system from one valid state to another, as defined by constraints, triggers, and business logic. The concept of a “valid” state usually comes from the application layer: the database simply provides tools for validation, but doesn’t guarantee correctness.
Isolation: transactions executed simultaneously should not interfere with each other. Changes made by one transaction are available to other transactions only after it successfully completes (commits). Ideally, concurrent transactions should behave as if they were executed sequentially. In reality, though, most databases offer weaker isolation levels, and it’s up to developers to choose the appropriate one based on their use case.
Durability: successfully written data is never lost.

We’ve covered isolation levels in more detail in a separate post: “Do we fear the serializable isolation level more than we fear subtle bugs?” In this post, we want to focus on atomicity. Martin Kleppmann, author of the award-winning Designing Data-Intensive Applications book, suggests that “abortability” is a better term. It helps distinguish between atomic commits and atomic visibility. “Abortability” also captures the essence of atomicity more accurately: either everything happens or nothing does.

It’s worth noting that the “atomicity” property applies to each individual transaction, while “isolation” governs the interactions between different transactions. Two-phase commit, widely used in distributed systems, is an atomic commit protocol: after the commit is finished, all participants either apply changes or discard them. However, 2PC provides no guarantees about atomic visibility. In other words, even though all nodes agree on the outcome, there’s no assurance that the changes become visible to all participants at the same time or in a way that preserves the illusion of a single, serializable transaction.

We want to illustrate the implications of atomic visibility — or rather, the lack of it — with a simple example. Consider a database that stores information about two accounts belonging to user Alice: account X and account Y. The data is sharded by account number, so that one server handles account X and another handles account Y. Initially, both accounts have a balance of 100 dollars.

Alice initiates a transfer of 50 dollars from account X to account Y. The transaction is coordinated using 2PC. We also assume that users can only read data that has been committed. Now, during the transaction, Alice’s husband Bob checks their total balance. But instead of seeing the expected $200, he sees only $150. Despite their agreement to save for a new iPhone, something seems off. Here is the illustration of how this happens:

Let’s take a closer look at the diagram above. In this case, the first shard has already committed its part of the transaction — so when Bob performs his read, he sees the updated balance of $50 in Account X. Meanwhile, the second shard hasn’t yet received the commit request from the 2PC coordinator, so Bob still sees the original $100 balance in Account Y. As a result, the total balance appears to be $150 — even though, logically, it should always be either $200 (before the transfer) or anyway $200 (after the transfer). But in practice, there are three possible outcomes: Bob might observe a total balance of $150, $200, or even $250, depending on the timing — specifically, the order in which each shard receives transaction requests from the 2PC coordinator and read requests from Bob.

This behavior is known as eventual consistency: given enough time, the system will converge to the correct value — $200 — once all shards have applied the committed changes. However, because this setup lacks a distributed snapshot mechanism, it cannot provide anything stronger than “read committed” isolation. There’s no global view of the transaction state at a single point in time — which means clients may observe intermediate states, even when all individual operations are technically ACID-compliant.

So, 2PC ensured atomic commit, but when it comes to reasoning about how two transactions interact, we enter the domain of isolation levels, where 2PC offers no guarantees.

Citus is a very popular PostgreSQL sharding solution implemented as a database extension. It’s easy to install and straightforward to use. This ease of use is likely the reason why many developers overlook the fact that Citus effectively transforms PostgreSQL into a different kind of database, with different behavior and guarantees. Everything works as expected only when transactions affect a single shard, or when multi-shard consistency is not required. Let’s take a closer look at why that’s the case.

It’s important to emphasize that we haven’t discovered anything new here — and we’re certainly not the first to write about it. We strongly recommend the following two blog posts by Franck Pachot: “Citus is not ACID but Eventually Consistent” and “How ACID is Citus? (compared to YugabyteDB)”, as well as “Sharded Does Not Imply Distributed” by Denis Magda. We’re not aiming to repeat what’s already been said. Instead, our goal is to approach the topic from a different angle, highlighting some often-overlooked implications of multi-shard consistency and isolation.

In the paper “Citus: Distributed PostgreSQL for Data-Intensive Applications”, authors describe multi-shard transaction behavior as this:

3.7.4 Multi-node transaction trade-offs. Multi-node transactions in Citus provide atomicity consistency, and durability guarantees, but do not provide distributed snapshot isolation guarantees. A concurrent multi-node query could obtain a local MVCC snapshot before commit on one node, and after commit on another. Addressing this would require changes to PostgreSQL to make the snapshot manager extensible.

This behaviour is exactly what we saw in the earlier example, where Bob observed an inconsistent total balance across Alice’s accounts, simply because they were stored on separate shards. What’s surprising is that the Citus documentation barely addresses this issue. For example, in the “Table Management / Limitations” section, the only mention is: “No support for serializable isolation level” is provided. And nothing is written about other isolation levels and multi-shard transactions.

The situation became even more ambiguous after version 11.2, which introduced the ability to specify transaction isolation levels. But the only place where you’ll find what isolation level Citus actually uses by default is in the release notes:

By default, when multiple nodes are involved in a transaction, Citus always sets the remote transaction’s isolation level with BEGIN TRANSACTION ISOLATION READ COMMITTED.” Even this comes with an important disclaimer: “Note that this does not mean that Citus supports comprehensive repeatable read / serializable semantics for all workloads.

In other words, Citus defaults to “Read Committed” for multi-shard transactions, but does not provide any kind of distributed snapshot — and in our humble opinion this isn’t well-documented outside of the release notes.

It’s worth noting that Citus does support consistent backup, implemented through a mechanism described as follows:

Citus supports periodically creating a consistent restore point, which is a record in the WAL of each node. The restore point is created while blocking writes to the commit records table(s) on the coordinator(s), which prevents in-flight 2PC commits while creating the restore point.

According to this description, creating backups might substantially affect performance (we did not test that, though).

Why is Citus implemented this way? Let’s cite the original architecture paper:

Existing distributed snapshot isolation techniques have a significant performance cost due to the need for additional network round trips or waiting for the clock, which increases response times and lowers achievable throughput. In the context of the synchronous PostgreSQL protocol, throughput is ultimately capped by #connections / response time. Since making a very large number of database connections is often impractical from the application perspective, low response time is the only way to achieve high throughput. Hence, we would likely make distributed snapshot isolation optional if we implement it in the future.

Citus is not the only one facing such a hard choice. The following is an excerpt from “The Seattle Report on Database Research”, dated 2022:

There is an ongoing debate between two schools of thought: (a) Distributed transactions are hard to process at scale with high throughput and availability and low latency without giving up some traditional transactional guarantees. Therefore, consistency and isolation guarantees are reduced at the expense of increased developer complexity. (b) The complexity of implementing a bug-free application is extremely high unless the system guarantees strong consistency and isolation. Therefore, the system should offer the best throughput, availability, and low-latency service it can, without sacrificing correctness guarantees. This debate will likely not be fully resolved anytime soon, and industry will offer systems consistent with each school of thought. However, it is critical that application bugs and limitations in practice that result from weaker system guarantees be better identified and quantified, and tools be built to help application developers using both types of system achieve their correctness and performance goals.

We have already written about the frequency of critical bugs caused by weaker isolation levels. Now it’s time to figure out when it’s reasonable to trade consistency for performance and make developers’ lives harder in the process.

Transactions are measured in terms of the number of round-trip times (RTT) between database hosts and the number of input/output operations. Modern NVMe disks are fast enough that we can ignore I/O latency. In this case, PostgreSQL without synchronous replication is extremely fast, with a cost of 0 RTT. Enabling synchronous replication increases the cost to 1 RTT.

Citus adds a two-phase commit, which costs an additional 2 RTTs, bringing the total to 3 RTTs per transaction. A consistent global snapshot, as offered by fully distributed databases, is not free either: for example, a YDB distributed transaction costs approximately 4.5 RTTs, plus an additional half-second for planning and batching.

With these numbers, Citus is theoretically slower than standalone PostgreSQL, and distributed databases appear even worse than Citus. But in practice, everything depends on the actual RTT values, which are negligibly small within a single datacenter. Even multi–availability zone setups often involve data centers located close to one another, with RTTs as low as hundreds of microseconds to a few milliseconds, depending on distance.

The following diagram demonstrates how transaction time varies with RTT:

With RTT below 7 ms, the difference between Citus and distributed databases is only 10 ms, and the total transaction time remains under 50 ms, which is acceptable for most practical applications. Do we really want to sacrifice the correctness guarantees (i.e., consistency) offered by the database for such a marginal gain?

PostgreSQL is, without a doubt, an excellent and highly efficient database. But under heavy load, it eventually reaches a point where its performance is no longer sufficient, forcing developers to choose between sharded PostgreSQL and a distributed database.

It’s crucial for developers to understand that Citus-like solutions are generally not fully ACID-compliant and do not offer the same guarantees as vanilla PostgreSQL. Of course, not all applications require distributed transactions or strict consistency guarantees, but we sincerely hope that banks are not among them. It’s also worth noting that modern distributed databases offer full ACID guarantees for all transactions and are far more efficient than they were once thought to be. Consider using them if PostgreSQL is no longer enough for your needs.

This post was inspired by the publications and talks of Franck Pachot and Denis Magda. We’d also like to thank PostgreSQL expert Evgeny Efimkin and PostgreSQL contributor Andrey Borodin for their help with this publication.

Read Entire Article

Sharded Is Not Distributed: What You Should Know When PostgreSQL Is Not Enough

Related

Physicists explain how cheese rosettes form

Remembering the Indian scientist who challenged the Big Bang...

Sandboxing AI Tools with Guix Containers