Exploring PostgreSQL to Parquet archival for JSON data with S3 range reads

1 month ago 1

PostgreSQL handles large JSON payloads reasonably well until you start updating or deleting them frequently. Once payloads cross the 8 KB TOAST threshold and churn becomes high, autovacuum can dominate your I/O budget and cause other issues. I have been exploring the idea of moving older JSON data (read: cold data) to Parquet on S3 while keeping recent data hot in PostgreSQL daily partitions, then simply dropping those partitions instead of running expensive DELETE operations and subsequent vacuum cycles. What I am about to discuss should be fairly well-known patterns, so lets dive in.

The TOAST problem space

PostgreSQL’s TOAST mechanism kicks in when any column value exceeds 8 KB. The server slices oversized values into 2 KB chunks and stores them in a separate toast relation, keeping the main heap tuples small. This design works well for read-heavy workloads, but creates challenges when large JSON payloads experience frequent updates or deletes.

Consider a typical json_payloads table with columns like id, key, output (the large JSON blob), customer_id, group_id, and created_at. When the JSON in the output column grows beyond 8 KB, PostgreSQL automatically moves it to TOAST storage, creating the update and delete performance issues we’re exploring.

When you update or delete a large JSON payload, PostgreSQL creates a new row version in the main table and typically writes fresh chunks to the toast table. The old row and toast chunks become dead tuples that autovacuum must scan and reclaim. Under high churn, this creates cascading work across both relations, often resulting in vacuum operations that take hours to complete on tables with hundreds of millions of pages. I will have to write another post just on how bad some of these autovacuums can get.

The core issue is that PostgreSQL’s MVCC design assumes most data modifications affect relatively small row versions. Large JSON documents break this assumption, especially when they’re frequently updated or deleted.

Hot and cold data separation

Let’s explore a two-tier storage approach that treats JSON payload data based on access patterns rather than forcing everything through PostgreSQL’s heap and toast machinery. Recent payloads stay hot in PostgreSQL daily partitions for fast transactional access, while older data moves to cold storage in Parquet files on S3. This kind of setup is useful when your query patterns typically seek recent data more often than older data.

The main thing here is using DROP PARTITION instead of DELETE operations. Dropping a partition removes both heap and toast relations instantly, completely avoiding the expensive vacuum cycles that plagued our original approach.

This separation lets us via PostgreSQL focus on what it does best: ACID transactions, indexed lookups, and managing active data. The cold tier handles what S3 and Parquet excel at: cost-effective storage, compression, and analytical-style queries with column projection.

Parquet file structure and schema mapping

This is a good read to familiarize yourself with Parquet: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/

When we export data from our json_payloads table to Parquet, we define a schema that maps our PostgreSQL columns to Parquet’s columnar format:

Parquet’s internal organization enables the selective reading that makes this architecture viable. Each file consists of row groups containing columnar data, with a footer that stores schema information and statistics for efficient pruning.

The footer statistics enable predicate pushdown. When searching for payload_id = 5225, we can examine the min/max values and immediately skip Row Groups 0 and 1, knowing they don’t contain our target. This eliminates the need to download and scan irrelevant data.

Deterministic sharding and file organization

We distribute payloads across shards using a hash function to ensure predictable file locations while controlling file proliferation. Each payload_id maps to exactly one shard within each time period.

payload_id = 5225 shard = hash64(5225) % 32 # Results in shard 7 S3 Key: customer_123/group_456/2025/10/03/hour_14/shard_07_segment_00.parquet

This creates a hierarchical S3 structure that supports efficient partition pruning:

Writers append rows to their assigned shard until reaching approximately 128-256 MB, then roll to the next segment. If data volume is high, we might see multiple segments per shard within the same hour—for example, shard_07_segment_00.parquet, shard_07_segment_01.parquet, and shard_07_segment_02.parquet. Within each file, we target row groups of 1-4 MB compressed to optimize for HTTP range requests. This row group sizing is critical because it determines the granularity of our range requests—small enough to minimize data transfer, large enough to amortize request overhead. This approach caps file proliferation while ensuring any payload_id maps to a single predictable location without requiring an index lookup.

To efficiently query this packed data, we can maintain a lightweight catalog in PostgreSQL that tracks S3 files and enables efficient point lookups without pulling the entire Parquet file on every request. Something like

CREATE TABLE s3_file_catalog ( id SERIAL PRIMARY KEY, customer_id BIGINT NOT NULL, group_id BIGINT NOT NULL, date DATE NOT NULL, hour INTEGER NOT NULL, shard INTEGER NOT NULL, segment INTEGER NOT NULL, s3_key TEXT NOT NULL, min_payload_id BIGINT NOT NULL, max_payload_id BIGINT NOT NULL, total_size_bytes BIGINT NOT NULL, externalized_ranges INTEGER[] DEFAULT '{}', created_at TIMESTAMP DEFAULT NOW() ); CREATE TABLE s3_row_groups ( id SERIAL PRIMARY KEY, s3_file_catalog_id BIGINT REFERENCES s3_file_catalog(id), ordinal INTEGER NOT NULL, offset_bytes BIGINT NOT NULL, size_bytes BIGINT NOT NULL, row_count INTEGER NOT NULL, min_payload_id BIGINT NOT NULL, max_payload_id BIGINT NOT NULL, created_at TIMESTAMP DEFAULT NOW() );

The catalog remains small because it’s O(files) rather than O(payloads). Each file entry contains the metadata needed to determine whether it could contain a target payload_id, while row group entries provide the byte offsets needed for precise HTTP range requests.

The s3_row_groups table serves as a performance optimization that caches parsed Parquet footer metadata. While Parquet footers contain all the information we need—including row group byte offsets and column statistics—parsing this metadata on every lookup would require reading the footer and decoding the statistics.

When we build Parquet files from our json_payloads data, Parquet automatically generates column statistics for each row group, including min/max values for our id column (which contains the business payload_id values). These statistics are stored as encoded bytes indexed by column position—our id column statistics appear at index 0 in the footer’s statistics map.

The s3_row_groups table caches these decoded min/max payload_id values along with the row group byte offsets. This optimization is well-established in data lake architectures—systems like Iceberg and Delta Lake maintain similar metadata catalogs to avoid repeated footer parsing. For our use case, this allows us to skip both the footer read and the statistics decoding on every lookup.

Syncing partitions to Parquet

When a daily partition becomes eligible for archival, we can stream its rows through our Parquet writers that handle sharding, size-based externalization, and metadata capture. These Parquet writers are available in many languages like Go, Rust, Python and so on.

def sync_partition_to_s3(partition_name) writers = {} # "SELECT * FROM #{partition_name} limit...." fetch_data_in_batches.each do |row| shard = hash64(row[:payload_id]) % 32 key = [row[:customer_id], row[:group_id], row[:hour], shard] writer = writers[key] ||= create_parquet_writer(key) if estimate_compressed_size(row[:output]) > 4.megabytes externalize_to_s3(row) row[:output] = nil row[:is_external] = true end writer.append_row(row) if writer.size_bytes > 256.megabytes close_and_catalog_writer(writer) writers[key] = create_parquet_writer(key) end end writers.each { |key, writer| close_and_catalog_writer(writer) } execute("DETACH TABLE #{partition_name}") # When ready, drop the partition end

The close_and_catalog_writer function reads the completed Parquet file’s footer metadata and extracts the information needed for efficient lookups. For each row group, it captures the byte offset, compressed size, and row count from the footer directory. More importantly, it decodes the column statistics for our id column (stored at index 0 in the statistics map) to extract the actual min and max payload_id values from the encoded byte arrays.

This metadata extraction step is crucial because Parquet stores statistics as min_bytes and max_bytes that need decoding. For INT64 columns like our payload_id, we decode these as little-endian 8-byte integers. By doing this work once during file creation and caching the results in PostgreSQL, we enable fast indexed lookups without the overhead of footer parsing and byte decoding on every read.

Note: There are other ways to stream this data too, like using logical replication. This kind of setup works as long as you are looking to stream the INSERTs and your data doesn’t have any UPDATES, and that you can rely on something like S3 lifecycle rule for deletes. There are also other methods like incrementally syncing data. I am leaving the part of efficiently streaming data from PG to S3 in Parquet format for now, but know that there are good levers where you can almost achieve near real time data too.

Point lookup mechanics

When the application requests a specific payload, we follow a sequence designed to minimize both database queries and S3 requests through predicate pushdown and range targeting.

This approach typically completes point lookups in 100-200ms, with most time spent on the S3 range request rather than database operations or Parquet decoding. The main optimization is predicate pushdown combined with precise range targeting.

Predicate pushdown and I/O efficiency

The statistics stored in Parquet footers enable sophisticated predicate pushdown that dramatically reduces I/O compared to naive approaches. Rather than downloading entire files and filtering in memory, we can eliminate entire row groups based on cached metadata from our s3_row_groups table.

Consider a query for payloads where payload_id BETWEEN 5300 AND 5350. Without predicate pushdown, we’d download the entire 256 MB file. With row group statistics, we can identify that only Row Group 1 could contain matching records and fetch just that 1-4 MB segment:

Row Group Analysis (from cached metadata): Row Group 0: min_payload_id=5001, max_payload_id=5200 → SKIP (max=5200 < 5300) Row Group 1: min_payload_id=5201, max_payload_id=5400 → FETCH (range 5300-5350 overlaps) Row Group 2: min_payload_id=5401, max_payload_id=5600 → SKIP (min=5401 > 5350) Result: Single range request for Row Group 1 only S3 GET Range: bytes=1049600-2098175 (1 MB instead of 256 MB)

This predicate pushdown reduces network transfer by over 99% compared to downloading the full file. Your POC demonstrates this efficiency—fetching a single payload downloads just 2.5 MB and reports “4.2% of file” in the logs, showing how row group targeting minimizes I/O while maintaining the flexibility to handle arbitrary payload_id ranges.

Performance characteristics and trade-offs

The two-tier architecture delivers predictable performance across different access patterns. Hot data in PostgreSQL maintains sub-10ms response times for indexed lookups, while cold data in Parquet typically responds in 100-200ms depending on row group size and S3 latency.

Storage costs favor the cold tier significantly. PostgreSQL storage includes not just the JSON data but also indexes, toast overhead, and free space from vacuum operations. Parquet also helps achieves 60-80% compression ratios on JSON data and costs roughly $0.023 per GB per month.

The trade-off comes in operational complexity. We now manage two storage systems with different consistency models, backup strategies, and failure modes. However, this complexity is contained within the storage layer and remains invisible to application code through a unified payload access interface.

The most significant operational benefit is eliminating autovacuum pressure on large JSON workloads. By using DROP PARTITION instead of DELETE operations, we avoid the expensive vacuum cycles that previously dominated our I/O budget during data retention cleanup.

Considerations

This approach works well for append-heavy workloads where older data becomes read-only, but it introduces some trade-offs worth understanding.

Data consistency: S3 Parquet data is eventually consistent with PostgreSQL. During sync failures or partial writes, temporary inconsistencies can occur. The catalog helps detect these, but applications need to handle cases where metadata exists but S3 objects don’t.

Updates and deletes on archived data: Once JSON payloads move to Parquet, updates become expensive since they require rewriting entire row groups. Deletes need tombstone tracking or file compaction. This design assumes older data rarely changes and you could just rely on S3 lifecycle policies for expiration. If you need frequent updates on archived data, keeping it in PostgreSQL may make more sense.

Future optimizations and considerations

Several enhancements could further improve this architecture, such as async I/O to fetch multiple ranges from a single or multiple parquet files at once. Or, Bloom filters in Parquet footers could reduce false positives when row group min/max ranges overlap with query predicates.

The key architectural principle remains unchanged: leverage PostgreSQL for transactional consistency and fast indexed access on hot data, while using Parquet and S3 for cost-effective storage and analytical-style queries on cold data. This separation of concerns allows each system to operate within its optimal performance envelope while providing a unified interface to applications.

Lastly, there are also services like DuckDB that allow you query this same nature of data using SQL as well, which is quite nice and something I plan on exploring a bit more.

Read Entire Article