Optimal GeoParquet Partitioning Strategy

1 month ago 2

My quest for one-size-fits-all format

At the Center for Coastal Climate Resilience, we’ve been chasing a simple dream: one vector storage format in a data lake that does everything. A single, authoritative dataset that our pipelines keep fresh, where scientists can run complex queries in SQL, and where developers can build maps that real people use. No sync jobs. No “oh right, we didn’t re-tile after last night’s run.” No lag between analysis and what the world sees.

GeoParquet feels like it should be that thing. It’s columnar, it’s friendly to predicate pushdown, it lives happily in object storage, and it plays nice with engines like DuckDB. In our vision, the holy grail looked like: write once to GeoParquet, keep it updated in a cloud-hosted data lake, and let both analytics and web visualizations run on the same source of truth.

All that said, I’ve mostly used GeoParquet for analysis. I hadn’t really used it to serve maps from the browser. So I went into this with some naïveté and a bunch of questions. Could a single partitioning strategy make GeoParquet fast enough for web maps while still being performant for analysis? Or will we always be splitting the proverbial baby?

On the visualization side, I already know PMTiles works. PMTiles is ridiculously effective for putting polygons on a map fast and cheaply — host it, range-request it, done. But PMTiles is also a derivative product. It’s a build artifact that has to be kept in sync with whatever your “truth” is in the lake. That’s not a dealbreaker, it’s just a tax. If I could avoid that tax — skip the extra pipeline, the extra storage, the extra failure mode — that’d be a meaningful simplification for small teams like our visualization team at the UC Santa Cruz Center for Coastal Climate Resilience.

The plan is simple enough: keep writing to the lake in GeoParquet, vary partitioning schemes, and measure both ends — analytics with DuckDB and visualization in the browser — without introducing a separate tiling step. If it worked, great: fewer moving parts. If it didn’t, at least I’d know why and where the line gets crossed.

💡 An important note: I’m far from an expert in optimizing GeoParquet. For instance, I have not tried the following tests with Hilbert curve ordering/partitioning in this round (as recommended by Chris Holmes) and that deserves a proper follow-up. I’m sure there’s a whole world of optimizations I could make that improve this setup. I’m definitely interested in your suggestions for improvement. As new approaches emerge, I’ll continue exploring and sharing my results here.

How GeoParquet actually gets read

(feel free to skip if you you’re a parquet whiz)

When I say “read,” I mostly mean “fetch the fewest bytes needed and do the least work possible.” Parquet is columnar with a footer that lists row groups, column chunks, and min/max stats. Over HTTP, readers usually issue one or two range requests to grab that footer, decide which row groups are relevant, then pull only those byte ranges. If the summary stats alone are enough for filtering (e.g., StateAbbr), whole row groups can be skipped. If they’re not (e.g., most spatial filters), you’re paying for geometry bytes anyway—so file size and row group size quietly set your “time-to-first-features.”

In the browser, the common JavaScript stacks — parquet-wasm+Arrow and hyparquet+HySnappy — do support some clever filtering methods but not everything a full SQL engine does. Hyparquet can project specific columns and limit row ranges, and it uses HTTP range requests so it only fetches the byte ranges it needs; you can also read metadata first and stream chunks/pages as they’re ready. That combo helps “first features” a lot. parquet-wasm, by contrast, focuses on Arrow compliance and performant handling of larger datasets. Check out a more detailed comparison in parquet-wasm’s README.

DuckDB changes the equation because it brings an execution engine to the bytes. It can push down predicates, prune partitions (Hive-style paths like state=CA/…), skip row groups via stats, parallelize scans, and run aggregates/joins—before materializing full results. Attribute filters benefit the most; spatial filters still force geometry IO unless you’ve partitioned with spatial locality (which I have not… yet). It’s doing real query planning instead of “decode then filter”.

📖 Definition: predicate pushdown
The ability of a Parquet reader to apply filters by using Parquet’s metadata (row group min/max stats or partition paths) so that irrelevant chunks of data are skipped before being read or decoded.

Why is this harder in JavaScript land? The browser sandbox has tight memory ceilings and higher overhead per message to web workers, and every random read is a network hop. You can parallelize, but you fight the main thread and GC. That’s why WebAssembly (WASM) shows up everywhere here: it lets us run compiled code (vectorized kernels, column decoders, even SQL engines) at near-native speed inside the browser. Tools like DuckDB-WASM narrow the gap — though you still pay startup costs and you’re still bound by network and memory.

All of that is the backdrop for the tests I ran. If the reader can skip big chunks early, “first geometry decoded” happens fast; if not, partition shape and file size dominate UX. That’s the lens I used going forward: favor strategies that minimize bytes to first features in the browser while not kneecapping analytics.

What I tested and how

I picked U.S. Census tracts (~85k polygons) as the test dataset. It’s big enough that you’d never load it all in the browser (source GeoParquet is ~1.6 GB). Also, I’m using this data in a current project so it was convenient ¯\(ツ)/¯

I compared four partitions strategies for the same dataset (all with snappy compression written by DuckDB):

unpartitioned
attribute by state (median file size 25 MB)
H3 (res 3; median file size 1.2 MB ) ****
hybrid (state + H3 res 3; median file size 0.8 MB)

⚠️ An aside: I tried using higher H3 resolutions to get the median file size file size closer to an optimal “tile” size (500KB-ish) but found the sheer number of files from resolution 4+ slowed my analysis queries down so dramatically (> 5 minutes) that it quickly proved unviable.

For analytics, I used DuckDB locally (Apple M2 Pro, 16 GB RAM), reading from S3 over HTTP. The goal on this side was simple: does the partitioning that feels good for maps still feel good for SQL?

For visualization, I ran three browser paths: parquet-wasm+Arrow, then hyparquet+HySnappy, then DuckDB-WASM. Same data, same machine, same network — just different client stacks to see how much work each could skip before drawing anything.

For completeness:

Storage & access: S3 over HTTPS (no CDN), 500 Mbps connection
Browser: Chrome (Chromium 139.0.7258.67)
Key metrics: time-to-first-features (first geometry decoded), total bytes & requests, peak memory; plus DuckDB query time and scanned bytes for the analytics runs

All the code used for these benchmarks can be found in GitHub. Your feedback is appreciated! Full disclosure: Claude 4 Sonnet helped write much of this code (credit where credit is due).

Analytics path: DuckDB against S3

I ran DuckDB locally on my M2 Pro machine, reading Parquet straight from S3 (via httpfs) with the spatial extension loaded. No real performance tuning — all defaults for DuckDB settings and I treated S3 as cold cache.

For partitioned layouts I pointed DuckDB at the whole tree using a wildcard (e.g., s3://…/state=CA/.parquet or …/h3=…/.parquet). That lets DuckDB prune partitions from the path (Hive-style) and skip row groups by Parquet stats when the filters are selective. Attribute filters really benefit here; spatial filters still tend to pull geometry bytes unless the partitioning itself is spatially aware.

Below are the exact queries I used, grouped by the behavior they probe.

1) State-level aggregation (tests partition pruning + column projection)

Counts by StateAbbr. On the state-partitioned layout, this prunes hard and never touches the geometry column.

SELECT
StateAbbr,
COUNT(*) AS tract_count,
COUNT(*) * 100.0 / (SELECT COUNT(*) FROM '<DATASET_PATH>') AS percentage
FROM '<DATASET_PATH>'
GROUP BY StateAbbr
ORDER BY tract_count DESC
LIMIT 10;

2) Simple spatial filtering (tests geometry IO + light compute)

Filter to California and compute areas. This forces geometry decode, but state-partitioned and hybrid layouts reduce how many files we touch.

SELECT
StateAbbr,
Tract,
ST_Area(ST_GeomFromWKB(geometry)) AS area
FROM '<DATASET_PATH>'
WHERE StateAbbr = 'CA'
AND ST_Area(ST_GeomFromWKB(geometry)) > 0.001
ORDER BY area DESC
LIMIT 100;

3) Multi-state analysis (tests pruning across multiple partitions)

A wider IN-list still prunes effectively on attribute-partitioned trees; H3-only layouts tend to touch more files here.

SELECT
StateAbbr,
COUNT(*) AS tract_count
FROM '<DATASET_PATH>'
WHERE StateAbbr IN ('CA', 'TX', 'FL', 'NY', 'PA')
GROUP BY StateAbbr
ORDER BY tract_count DESC;

4) Full table scan (baseline)

A control to see raw scan performance. DuckDB can avoid materializing the geometry column here, so it’s mostly reading lightweight columns and/or using row group metadata.

SELECT
COUNT(*) AS total_tracts,
COUNT(DISTINCT StateAbbr) AS unique_states
FROM '<DATASET_PATH>';

5) Complex spatial + attribute filter (tests hybrid advantages)

Combined attribute pre-pruning with a geometry predicate. This is where the hybrid (state → H3) is supposed to shine: fewer files touched and fewer heavy rows read.

SELECT
StateAbbr,
COUNT(*) AS tract_count,
AVG(ST_Area(ST_GeomFromWKB(geometry))) AS avg_area
FROM '<DATASET_PATH>'
WHERE StateAbbr IN ('CA', 'NV', 'OR', 'WA')
AND ST_Area(ST_GeomFromWKB(geometry)) > 0.001
GROUP BY StateAbbr
ORDER BY avg_area DESC;

I’ll drop the results in the “at a glance” section later. The big things I watched here were scanned bytes, partitions touched, and whether the geometry column showed up in the physical plan at all.

Results at a glance — Analytics

A bar chart representing the performance of various analytics queries results in the alt text of the image below.

no_partition 1.0; attribute_state 4.83; spatial_h3_l3 72.15; hybrid_state_h3 72.88. • Q5 — Spatial plus attribute (W

Top line:

⭐️ The attribute (state) layout is the best overall for analytics ⭐️: ~33.9 s total (~6.8 s/query).
H3-only is the slowest by a mile (~405 s total).
Hybrid (state→H3) helps when filters mix attribute + spatial, but still pays a small-file tax (~241 s total).
No-partition remains a strong baseline for broad scans/aggregations (~119.6 s total).

Patterns I see.

Small-file syndrome hurts analytics. H3/hybrid explode request/metadata overhead on global scans.
Attribute partitioning is the best compromise here: excellent when filters touch StateAbbr, acceptable elsewhere.
No-partition can be fastest when geometry isn’t materialized (column projection wins).
Geometry IO dominates spatial predicates unless the partitioning itself is spatially aware.
Resources. Peak memory stayed modest (≤ ~60 MB). The differences are mostly I/O and pruning, not RAM limits.

Visualization path: three browser pipelines

I took the same Bay Area bbox and ran it through three clients that represent the common ways teams load GeoParquet in the browser. The point wasn’t to build a perfect app — it was to measure how quickly each path can get first geometry decoded on a realistic partition tree, and what it costs in bytes and memory to get there. The three: Arrow.js + parquet-wasm (full download + parse), Hyparquet + HySnappy (incremental streaming), and DuckDB-WASM (SQL with HTTP range requests).

I kept the test set-up dead simple: serve the static clients locally, open each runner, click a strategy, and export JSON with timing + bytes + memory. (There are CLI scripts too, but the gist is “npm install → npx serve → hit /arrow_client, /hyparquet_client, /duckdb_client.”)

💡 Note: I’m not actually rendering these polygons. Due to the large variety of visualization clients with their own intricacies, I’m focusing just on the process of loading GeoParquet in the browser.

Shared test region

All three clients target the California Bay Area bbox, keeping the workload stable across strategies. Each client saves JSON with HTTP patterns and bytes, “time to first batch” (my proxy for first geometry decoded), memory behavior, and simple spatial-filter efficiency.

Client 1 — Arrow.js + parquet-wasm (download + parse)

This path downloads the target Parquet file(s) and decodes with parquet-wasm, then converts to Arrow for filtering. I skipped the no_partition case (1.6 GB is a non-starter in-browser), but ran attribute_state (CA only), spatial_h3_l3, and hybrid_state_h3 with strategy-specific file lists. Metrics captured download time, parse time, Arrow convert time, and memory.

// Full download + Arrow (simplified)
const resp = await fetch(parquetURL);
const buf = await resp.arrayBuffer();
const table = arrow.Table.from(buf);
// ...filter rows to Bay Area bbox, then to GeoJSON for rendering

Client 2 — Hyparquet + hysnappy (incremental streaming)

Hyparquet streams row groups via range requests and lets me stop early (once I’ve decoded enough rows to draw). I used it across the same four strategies, leaning on parallel hexagon pulls for the H3 tree. The harness records request counts/bytes, time to first batch, and stream efficiency.

// Hyparquet streaming (simplified)
const asyncBuffer = await createRangeAsyncBuffer(url);
const readOptions = {
file: asyncBuffer,
{
snappy: hysnappy.uncompress || hysnappy.decompress || hysnappy.decode,
enableSnappyDecompression: true,
trackDecompressionMetrics: true
}
};

table = await hy.parquetRead(readOptions);

Client 3 — DuckDB-WASM (SQL + HTTP range requests)

DuckDB-WASM executes SQL directly against HTTPS Parquet with range requests. I used it for attribute-filtered single files, UNIONs across H3 partitions, and hybrid paths, all with spatial WHERE clauses. The runner also logs range-request patterns, giving a nice peek at how selective reads actually behave in-browser.

-- Executed inside DuckDB-WASM
SELECT *
FROM 'https://…/hybrid_state_h3/StateAbbr=CA/h3=…/*.parquet'
WHERE ST_Intersects(geometry, ST_MakeEnvelope(-122.6, 37.2, -121.8, 38.0));

Results at a glance — Visualization

Bar graph representing the performance of various visualization strategies. Numbers in the Alt text of the image below.

Arrow+parquet-wasm 2.17; Hyparquet 1.35; DuckDB-WASM 5.36.

What we learn about partition strategies (for visualizations)

Small partitions definitely help when the client can read them.
Arrow + Hyparquet had similar levels of performance but Hyparquet was consistent 0.6–0.75 seconds faster (regardless of data size). Not sure why exactly.
No-partition remains impractical for Arrow (skipped due to size) and Hyparquet (very slow initial read), and slowest for DuckDB-WASM (~33 s).

What we learn about visualization clients

Arrow + parquet-wasm (download → parse)

Works well when files are tiny: ~2 s for H3/hybrid with ~20 MB on the wire.
Slows as files grow: attribute ~5.3 s / ~109 MB.
Can’t handle parsing the giant single file in-browser (no-partition skipped).

Hyparquet + HySnappy

The fastest of our client libraries, similar to Arrow + parquet-wasm but with a faster initialization
Slows as files grow: attribute ~4.9 s / ~109 MB.

DuckDB-WASM (SQL + range requests)

Best when the tree is selective: hybrid (5.36 s) fastest; H3 (6.48 s) close; attribute (12.2 s) slower; no-partition (32.6 s) slowest.
(Bytes weren’t logged in this run, but the timing pattern matches the selectivity story.)

⭐ I hadn’t used Hyparquet previously but will definitely reach for it in future projects when time to first feature is key. It’s a bit more involved to set up, but it delivered good end-to-end performance in my tests without the startup costs.

Conclusion

I wanted a single GeoParquet layout that could be the “one true” dataset for both analytics and maps. What I actually found is a familiar trade-off: larger files with categorical partitions (e.g., by state) are fastest for my DuckDB analytics, while smaller, spatially partitioned files (H3 or hybrid) loader faster in browser-side parquet clients. On small datasets this tension mostly disappears; on big, polygon-heavy ones it becomes the whole story.

On the analytics side, DuckDB likes fewer, bigger files plus categorical partitions it can prune from the path. With that layout it can skip whole directories (e.g., StateAbbr=CA/), push down projections so geometry isn’t even read for many queries, and scan row groups in parallel. Flip to thousands of tiny spatial files and performance craters — not because DuckDB can’t filter, but because you’ve traded one big, cheap scan for a storm of HTTP opens, footers, and range reads. Essentially, small-file syndrome strikes again.

In the browser, the trade-offs invert. The core UX metric is “how many bytes do I have to move before I can draw my first polygon?” Spatial partitions (H3 or state → H3) keep the working set tiny, which helps every client I tried — Arrow + parquet-wasm, hyparquet+HySnappy, and DuckDB-WASM. Small, targeted files mean fewer decoded geometries, less memory churn, and quicker first features, even when you still need to do a little spatial filtering client-side.

So can I get one simple layout to rule them all? I haven’t found an obvious option yet. A moderate categorical partition with sensibly sized files can be “good enough” for both sides. But as size and complexity grow, the pragmatic answer is usually two tracks: keep an analysis-optimized GeoParquet (categorical partitions, larger files) as the lake’s source of truth, and generate a viz-optimized path (spatial micro-partitions or a derivative like PMTiles) for maps.

But of course, I can’t claim that this review is comprehensive yet. There are many other partitioning strategies that leverage spatial partitioning in smarter ways (I’m very much still learning—please send ideas). I’ll be exploring other options in future blog posts.

Read Entire Article