Sometimes Postgres Isn't the Answer

5 hours ago 1

Pomerium v0.31 introduces a new custom local file storage backend with a clustered mode implemented via Raft for automatic failover and recovery as an alternative to our existing Postgres backend. On the face of it, this would seem to fly in the face of much received wisdom regarding system design: that Postgres is sufficient for almost any task, and the panoply of services used in modern microservice system design is overengineering at its worst. And yet here we are, basically rolling our own custom database and moving away from Postgres. How did this happen?

But first: why does Pomerium need a storage backend at all?

Request Flow Sequence Diagram of Pomerium

The Databroker

Pomerium is an identity-aware access proxy, providing centralized authentication and authorization for upstream applications. It consists of several components:

The proxy service handles incoming requests and matches them against route definitions.
Each request results in a call to the authorize service, which enforces access policies for routes.
If a route requires a user to be logged in, requests to that route are redirected to the authenticate service.
The authenticate service initiates an OIDC login with an identity provider.
If the login is successful a session is created with the user’s identity information (user id, claims, etc)
The session is saved to the databroker service and the user is redirected back to the originating request
This new request is handled by the proxy service, which calls the authorize service, which now finds this new session (via a cookie) and uses the identity information to make access decisions

The databroker is a crucial part of this process. It’s where sessions are stored along with other data. But is it actually necessary?

Indeed early versions of Pomerium did not have a databroker. Rather all identity information was stored in the user’s session cookie, thus obviating the need for backend storage. However a cookie-based solution has several deficiencies:

Cookies are size-limited and although things like a user id and most claims are relatively small, other claims can be quite large. For example it may seem like a user would only be a member of a small number of groups, so storing just an array of group IDs ought to be fine, but the reality is that in many organizations users are members of thousands of groups. Furthermore some identity providers store large data in the claims they return, like the entire avatar image for a user as an embedded image.
Although claims are available as part of the OIDC login process, many identity providers provide additional directory data via a separate API, for example Google or Microsoft Entra. This data may not be available in the OIDC claims, but even if it is available, it will only be updated when a user logs in. Ideally, policy enforcement can happen more quickly than that: if a user is removed from a group, they should lose access to applications relatively quickly, on the order of a few minutes, not a few hours (or even days). Therefore, directory sync needs to be a background process happening independently of user login events.
In addition to directory sync, there are other sources of external context data that are useful for authorization decisions.
Data stored in cookies can’t be managed — you can’t see how many sessions are active, nor can you delete them.

Because of these deficiencies, we switched to storing sessions (and other data) in a Databroker — a service that supports key-value lookups, querying, streaming synchronization, distributed leases and change notifications.

At first this was based on groupcache, but a caching approach is inherently volatile and prone to inconsistency (e.g. one component thinks you’re logged in, another does not). We then transitioned to Redis and finally Postgres.

As time went on the sort of data we stored in the Databroker became more sophisticated, and the ability to index data, particularly CIDR data for GeoIP lookups, pushed us into only supporting Postgres.

Postgres Problems

On paper Postgres is great. It has tons of features like CIDR indexing, first-class JSON types, and LISTEN/NOTIFY to support multiple databroker instances. Data is easily explored for debugging and it’s readily available in many different environments. This is important because Pomerium can be installed in a variety of ways: from a Kubernetes ingress controller, to bare-metal VMs and even in air-gapped environments.

But the reality of managing Postgres brings with it its own challenges. It’s one thing to pick Postgres as your main database in a SaaS company. You’re willing to accept the maintenance burdens of that choice: backups, scaling, storage, on-call, etc. Or, as seems increasingly common, you opt for a managed Postgres instance. Which works well. Until it doesn’t.

At scale, Pomerium can do a lot of reads and writes to a Postgres instance. Each session has to be written to the database and replicated to each Pomerium instance. And directory data has to be stored and replicated as well, which for some customer means 10s or even 100s of thousands of users and/or groups. Managed databases often aren’t setup to handle this kind of load. Sometimes they’re provisioned with surprisingly small hardware, or placed in other regions incurring significant latency from traffic going across the internet. Sometimes the same database is used for multiple applications leading to unexpected slow-downs entirely unrelated to Pomerium.

These problems aren’t insurmountable. A DBA would certainly know how to set things up to work well. But as we have discovered, DBAs are a rare breed these days. Some companies have no institutional knowledge about how to operate a Postgres database, which invariably leads to problems with Pomerium, and puts us in the position of providing expertise and support for infrastructure we know little about. Working through those problems can take a long time and is frustrating for everyone involved.

Which led us to re-evaluate our decision to use Postgres as our primary storage backend. Is there something easier to operate and configure that still provides all the features we need?

Towards a File-Based Storage Backend with Automatic Failover

In addition to the Postgres backend, Pomerium has an in-memory backend, primarily used for testing. The in-memory version uses a B-Tree to store data and also supports CIDR indexing using a BART package. However it has two major deficiencies compared to the Postgres version:

It has no persistence, so if Pomerium is restarted all data is lost. The practical outcome of this is that all users are logged out, and external context data will be missing until it is populated again (which for large directories could take a while)
It only supports a single instance. Whereas multiple instances of Pomerium can all use the same Postgres database, the in-memory storage backend can only be used by a single Databroker, and although it is possible to use the same Databroker across multiple Pomerium proxies (via split-service mode), this still leaves you open to a single-point of failure.

Persistence

To solve the first issue we introduced a file-based storage backend using a local key-value store. There are many options here, from SQLite, to go-native key value stores like bbolt or badger. SQLite has the downside of requiring cgo (unless you use modernc.org/sqlite), and our requirements were pretty minimal (largely simple key-values with no relational data) so we opted for pebble instead.

Pebble is based on log-structured merge-trees used in many databases. It is the storage engine used by CockroachDB and is both high performance and rigorously tested.

The Databroker maintains a set of keyspaces, each keyspace being identified by a single-byte prefix:

const ( prefixUnusedKeySpace = iota prefixLeaseKeySpace prefixMetadataKeySpace prefixOptionsKeySpace prefixRecordChangeKeySpace prefixRecordChangeIndexByTypeKeySpace prefixRecordKeySpace prefixRecordIndexByTypeVersionKeySpace prefixRegistryServiceKeySpace )

Each keyspace has a sorted list of keys and values that allow efficient retrieval of data. For example each record is written to the record key space like this:

// record: // keys: prefix-record | {recordType as bytes} | 0x00 | {recordID as bytes} // values: {record as proto}

Typically Pomerium relies heavily on two operations: loading the current values of a given record type, and loading changes as they are made. Because Pebble stores keys in sorted order, both of these operations are very efficient, albeit with some write amplification to support both use cases.

Because of how Pomerium works, state can always be rebuilt from other sources. Users can simply login again to restore sessions, and directory data can be re-synced. Because of this we can relax reliability requirements. The goal of our persistence is to reduce downtime not avoid catastrophic failure, so there’s no need for backups, and it’s up to the user what level of reliability they’re comfortable with (for example, should they use a Kubernetes persistent volume?).

Replication

To support multiple instances we opted to use the Raft protocol. Raft implements distributed consensus for a multi-node cluster in such a way that only a quorum of nodes need to be available for a leader to be elected.

Go has two mature implementations of Raft: hashicorp/raft and etcd-io/raft which means we didn’t have to roll our own.

However full-consistency is not free and comes with a performance cost. It’s also more of a guarantee than we strictly need. If we lose some writes the consequence is a user might get logged out, or the directory data may go stale, but otherwise the system will recover on its own, so we chose to relax the consistency guarantees.

Instead of replicating all the data via the Raft state machine, we use Raft only as a leader election mechanism. Once a leader is in place, we then forward requests from any followers to the leader. In addition the followers are replicating the state from the leader so that they are ready to take over if needed.

If the leader fails a new leader will be elected. That leader will have replicated the state of the previous leader. It may not be entirely up to date, though in practice it usually replicates fairly quickly, and writes aren’t often lost.

Services that use the Databroker use standard gRPC load balancing and will connect to any node that is available. If they were connected to the previous leader that went down, they will re-connect to a different node, and it doesn’t matter if the node they connect to is the new leader or not, the request will get handled either way.

This does mean that all Databroker requests are handled by a single node. Is that an issue? Most of the time, no. The Databroker can handle a lot of requests and Pomerium can usually more easily be scaled by breaking up workloads into separate clusters. Future improvements may consider a multi-leader approach using key ranges.

Conclusion

With the addition of the file-based storage backend and Databroker clustering we believe we’ve struck a nice balance between ease of use and features. Performance is at least as good as the Postgres version, while being significantly easier to manage. The file-based storage backend is available for the Databroker in v0.31 of Pomerium. It can be configured by setting the following options:

databroker_storage_type: file databroker_storage_connection_string: file:///var/pomerium/databroker

Instructions for how to setup Raft can be found in the documentation.

Read Entire Article