The Write Stuff: Concurrent Write Transactions in SQLite

1 month ago 1

Introduction

Versatile as it is, SQLite still suffers from one major drawback. Which is write concurrency. SQLite, using the Write-Ahead-Log (WAL) journaling mode, supports an unlimited number of readers and a single writer any given moment. Thus all write operations are effectively serialized. And while SQLite is very fast at performing writes, it gets stuck in the fsync operation required to ensure the durability of the data after each transaction (the D in ACID).

Engines with a server component that can coordinate concurrent writes are able to group writes together and spread the cost of fsync operations on multiple transactions while still maintaining consistency.

In this article we will explore concurrency modes available for SQLite and touch on a new one I am working on at the moment. Built -for now- specifically for Ruby applications but should be portable to other languages as well. That new model is trying to overcome the current limitations of SQLite’s concurrency and transform it to something suitable for write-heavy, highly concurrent applications.

Current SQLite concurrency overview

Depending on its configuration, and even the branch you are using from the SQLite source tree, you can get different concurrency levels with SQLite.

The DELETE/TRUNCATE/PERSIST journaling modes

In any of these modes, SQLite will require an exclusive lock before committing any write transaction, thus, no other writers, or readers can be accessing the database during the commit operation (just the commit, not the whole write transaction). This is the default SQLite configuration (DELETE journaling mode to be specific, though the others are variants of it) and it offers the least possible concurrency.

The Write Ahead Log (WAL) journaling mode

In this mode writes go to a separate log file and are later merged into the main database file. This mode allows an unlimited number of readers to access the database even while a writer (only a single writer is allowed at a time) is committing its transaction. This is actually a big step up from the default journaling mode and results in much greater performance. Still, each writer gets the database all to itself and prevents all other writers from writing until finished.

The BEGIN CONCURRENT mode

This is not a canonical mode but is available in a branch under the SQLite source tree. It offers a new model of concurrency. To be able to use it, you need to start your transactions with BEGIN CONCURRENT. That does two things. First, it starts the transaction without acquiring a write lock and it never attempt to acquire it until you commit. Second, it tracks every database page you read/write during your transaction and records its version number. Now once the transaction is concluded and you attempt to commit, a write lock is obtained, then the page version your transaction hold are compared to the current state of the database. If no differences are found then the commit proceeds. If there is any mismatch then the transaction fails with a SQLITE_SNAPTHOT error message. This locking model is called optimistic locking, as opposed to pessimistic locking that locks the resources it needs before it uses them.

Pros

Transactions are cheap to start and can truly proceed in parallel
Tracking page versions doesn’t add much over head
Locks are now taken only at commit phase, allowing for much faster writes than WAL mode

Cons

Very coarse grained, causing many transactions to conflict when in fact they don’t
Unfriendly to auto increment or even rowid based primary keys (will conflict often)
Unfriendly to indexes on data the increase/decrease with time (e.g. dates and timestamps)
Requires handling of SQLITE_SNAPSHOT errors in the application code
Requires compiling a specific SQLite branch and not the main/official one

HCTREE

HCTree is a new high concurrency btree for SQLite. It is still work in progress but is showing a of promise already. Much like BEGIN CONCURRENT, this is an optimistic locking implementation. It differs though in the fact that it tracks records visited/modified instead of tracking page versions. This greatly reduces the potential for conflicts and improves transaction throughput.

A (slightly) different approach

I have been faced by some scalability issues when working on write heavy Ruby applications that use SQLite. SQLite will be generally fast, but when you have multiple processes each trying to write to the database you will see some contention on resources and the write performance will actually go down rather than up.

One other issue that was bothering me was that in order to get good write performance from SQLite we have to set the synchronous level to 1. Which meant that writes to the WAL were not followed by an fsync call, lowering the durability of the database and risking loss of data in case of a machine restart/power loss.

Thus I started building a solution with the following characteristics:

Optimistic locking

Transactions proceed freely without acquiring locks, only at commit time are they checked for potential conflicts. This means that a HUGE number of transactions can be in flight without locking contention.

Fine grained conflict detection

Very few cases can potentially cause transaction conflicts. Here we list the different types of operations and whether or not they can potentially cause conflicts:

Read query without an explicit transaction

Cannot conflict

Example:

SELECT * FROM users WHERE id = 1;

Such a query will never cause another transaction to conflict and itself will always have a consistent view of the database throughout its lifetime thus it cannot generate conflicts.

Pure read transaction

Cannot conflict

Example:

BEGIN; SELECT * FROM users WHERE id = 1; SELECT * FROM orders WHERE user_id = 1; COMMIT;

Like the singular read operation above, this transaction is operating on a consistent snapshot of the database. No other transaction can influence what it sees and it doesn’t affect any other transaction.

Write query without an explicit transaction

Cannot conflict, but can cause other transactions to conflict.

Example:

UPDATE user SET approved_at = datetime('now') WHERE id = 1 RETURNING approved_at;

Such a query, even though is returning data cannot conflict with other running transactions. This is because all of its operations happen atomically without any data being returned to the caller DURING the execution of the query. The state change resulting from this operation can cause other concurrent transactions to conflict, but this particular query will always conclude successfully.

Pure write transaction

Cannot conflict (if no data is returned to the caller), but can cause other transactions to conflict

Example:

BEGIN; UPDATE user SET approved_at = datatime('now') WHERE id = 1; UPDATE orders SET approved = TRUE WHERE user_id = 1; COMMIT;

In this case, much like the one above, the transaction does not leak any state from the database to the outside world. Thus, the transaction concludes in an atomic manner, and while it changes the state of the database and can cause other transactions to conflict, it will always conclude successfully.

Mixed read/write transaction

Can conflict

Since these transactions return data to the caller WHILE they are changing the state of the database, there is a potential of them ending up in a conflicting state that renders the data returned to the call outdated. Thus, at commit, this will be detected and the transaction will roll back with an error. This conflict detection is done at a very fine granularity though, helping to reduce the amount of potential conflicts.

Example 1: Conflicting amount in bank account

-- tx A BEGIN; UPDATE accounts SET value -= 100 WHERE id = 123456 RETURNING value; END; -- tx B UPDATE accounts SET value -= 10 WHERE id = 123456 RETURNING value;

In this example, transaction B will always succeed, as it is an implicit transaction, even though it returns values but that happens AFTER its conclusion. On the other hand transaction A will fail if transaction B runs after it started and before it is committed. Since the value reported would be $10 higher than the reality.

Example 2: Conflicting order count

-- tx A BEGIN; UPDATE orders SET approved = TRUE WHERE amount > 100; SELECT count(*) FROM orders WHERE approved = TRUE; END; -- tx B BEGIN; INSERT INTO orders (amount, approved) VALUES (150.0, TRUE); SELECT count(*) FROM orders WHERE approved = TRUE; END;

Transactions A & B can both cause conflicts to each other. As each of them changes the state of the same set of data and make it outdated for the other transaction. Thus, if they run concurrently, they are guaranteed to conflict, and one of them fails, depending on order of operation.

Example 3: Non conflicting updates to the same records

-- tx A BEGIN; UPDATE users SET priority = 'LOW', status = 'approved' WHERE id = 1 RETURNING priority, status; END; -- tx B BEGIN; UPDATE users SET priority = 'high', status = 'pending' WHERE id = 1 RETURNING priority, status; END;

Unintuitively, these two transactions will never conflict, even though they operate on the same record and on the same columns even, but they only return data that they change themselves, and they data they look at (the WHERE condition) is static, thus cannot be changed by other transactions in the system.

Example 4: Non conflicting writes

-- tx A BEGIN; UPDATE orders SET priority = 'high' WHERE amount > 100.0; SELECT count(*) AS high_priority_count FROM orders WHERE priority = 'high'; END; -- tx B BEGIN; INSERT INTO orders (amount, priority) VALUE (150.0, 'high); UPDATE ORDERS SET priority = 'low' WHERE id IN ( SELECT id FROM orders WHERE priority = 'high' ORDER BY amount LIMIT 1 ); END;

These two transactions will not conflict, even though the second one alters the state of the database (entering a new high priority order and demoting another one to low priority) the net effect on the number of high priority orders count is zero. The number returned by the select statement in transaction A remains the same whether or not transaction B ran before, after or even during the operation of transaction A.

In summary, conflicts only depend on the data returned from your transaction to the caller. If no data is returned then no conflicts. If data is returned, then if it is a change done inside the transaction and not dependent on a previous state then no conflicts. If it depends on a previous state then conflicts can happen only if that stat is changed by a different transaction in a conflicting manner

As a result of this, one of the biggest source of conflicts would be this:

CREATE TABLE users ( id INTEGER PRIMARY KEY AUTO INCREMENT, login TEXT NOT NULL UNIQUE, email TEXT NOT NULL UNIQUE ); BEGIN; INSERT INTO users (login, email) VALUES (:login, :email) RETURNING id; END;

Those pointless explicit BEGIN, END calls will results in all concurrent instances of such calls to conflict. As they all are generating a new id based on the existing database state and each of them must conflict with any similarly running transaction.

A solution for that would be to either use implicit transactions, or instead use a better strategy for generating ids, e.g. ulid.

Durability

In traditional WAL operation, if you care about transaction performance, you are required to set the synchronous level to 1 (Normal), which means that SQLite will NOT call fsync after each transaction, instead, after a certain number of database pages have been written (1000 by default) it will do a WAL checkpoint, and only then, it will issue an fsync such that the OS will ensure the data really reaches the disk.

What that means, is that between WAL checkpoints, an OS crash or a power loss can result in data loss. But as we mentioned, this is a required tradeoff, since the performance falls off a cliff once you turn on fsync on every transactions.

With our approach though we can do transaction grouping, thus we can distribute the cost of a single fsync call on many concurrent transactions. Resulting in much higher performance in high durability mode. As we will see from the benchmarks, we are able to outperform the normal WAL mode with synchronous level 1 while operating at the (previously performance killer) synchronous level 2.

Resource Usage

A lot of care was taken to minimize resource usage for the new system. And while there are unavoidable costs, the model itself lends itself to lowering resource usage for larger applications at least. For example these principles come into play

No polling, processing data is on demand
Minimum number of connections
Release connections at the earliest point possible
Use efficient data transport mechanisms
Use efficient data serialization/compression
Optimize hot loops as much as possible

Generally, there is a ~50MB memory overhead for the synchronization and the occasional ~7.5MB for checkpointing. These numbers can momentarily increase based on the size of the transactions but should come back to these levels eventually.

Prototype Evaluation

To evaluate the prototype, a simple rack application was used, the app would start an immediate transaction, insert a record into a table and commit the transaction. Depending on configuration, the app either used the vanilla SQLite drivers or the modified ones that can talk to the new concurrency components.

The application was ran in a Fiber Scheduler based server (it currently requires the fiber scheduler) and wrk was used to send multiple requests to the servers. Only requests to the modified write protocol used fibers, requests that reached the WAL based backend were ran in serial mode, which was the fastest for WAL.

Multiple concurrency levels against multiple number of processes were tested. What is shown here is the performance of the new concurrency system relative to the WAL performance at the same process and client count

Throughput (> 1.0 means higher performance)

Clients/Processes	1 Process	2 Processes	4 Processes	8 Processes
100	0.97	0.72	1.02	0.90
300	1.20	1.76	1.96	1.53
1000	1.17	1.25	2.51	3.76
3000	0.84	1.91	2.38	4.04
10000	1.21	1.75	2.67	3.66

Requests/Sec relative to WAL

P90 Latency (< 1.0 means lower latency)

Clients/Processes	1	2	4	8
100	0.73	0.61	0.44	0.34
300	0.38	0.50	0.24	0.18
1000	0.19	0.25	0.15	0.11
3000	0.23	0.40	0.16	0.08
10000	0.73	0.43	0.14	0.05

P99 latency relative to WAL

P99 Latency (< 1.0 means lower latency)

Clients/Processes	1	2	4	8
100	0.69	0.51	0.42	0.25
300	0.25	0.45	0.23	0.18
1000	0.54	0.06	0.15	0.11
3000	1.34	0.43	0.16	0.08
10000	1.10	0.57	0.17	0.07

P99 latency relative to WAL

From these figures we can conclude the following:

Except for single process servers, the new concurrent writes backend delivers higher throughput for almost all concurrency level, by up to 4X under heavy loads
P90 latencies are reduced for ALL process and client counts, by up to 93%.
For all multi process servers, P99 latencies are reduced across the board, by up to 95%.
All this with high durability writes as compared to WAL operations that operate with normal synchronous level

Current limitations

The current prototype comes short in some areas, including but no limited to:

No ActiveRecord/Sequel integration yet
No support for prepared statements (coming soon)
The build process is quite complex
Incomplete error handling/reporting
Automated retries on conflict detection are not implemented yet

Where is the source code? How can I try this?

The current state of the prototype is not ready yet for public consumption. That said, even when the code shapes up I am not sure if I will release the source code eventually. This development is a corner stone in an upcoming set of features that should enable building platforms for hosting high concurrency, high performance and high availability SQLite based applications (built atop Litestack). Initially focused on Ruby/Rails but these features should be transferrable to other languages.

Keep watching this space as God will, I hope to be sharing more exciting developments in the future.

Read Entire Article

The Write Stuff: Concurrent Write Transactions in SQLite

Introduction

Current SQLite concurrency overview

The DELETE/TRUNCATE/PERSIST journaling modes

The Write Ahead Log (WAL) journaling mode

The BEGIN CONCURRENT mode

Pros

Cons

HCTREE

A (slightly) different approach

Optimistic locking

Fine grained conflict detection

Read query without an explicit transaction

Pure read transaction

Write query without an explicit transaction

Pure write transaction

Mixed read/write transaction

Durability

Resource Usage

Prototype Evaluation

Throughput (> 1.0 means higher performance)

P90 Latency (< 1.0 means lower latency)

P99 Latency (< 1.0 means lower latency)

Current limitations

Where is the source code? How can I try this?

Related

Box of bugs (exploded): Perils of cross-platform development...

Rust in Android: move fast and fix things

An API to monitor the web for state changes