Show HN: Read Sort and Write Parquet Faster Than DuckDB in Java

3 months ago 2

A tool that reads a parquet file, performs a sort, and writes it to a destination file. It can outperform DuckDB. Written in pure Java using ParquetForge.

Java can outperform highly optimized ETL tools like DuckDB, especially with GraalVM native-image
You don’t need C++ or explicit vectorization (SIMD) to get blazing-fast Parquet performance
ParquetForge is the missing toolkit to read, write and manipulate Parquet

Efficient parallelized sort order computation (bucket sort or quicksort)
Column-wise in-memory reordering
Parallel processing across row groups
Parallel final file assembly with copying (and not processing) blocks of data

1. Sort Order Computation

A mapping from source row indices to destination row indices is computed based on the sort column in SortOrderer.java:

If the sort column has a narrow range of values (e.g., dates), bucket sort is used.
Otherwise, it uses fastutil's parallel quicksort

In the benchmark case (l_shipdate), bucket sort is used due to the limited range of date values.

2. Column-by-Column Reordering

Parquet is a columnar format, so each column can be processed independently:

Each column is read into memory in the sorted row order
Each column chunk in each row group is read in parallel
The reordered column data is written into temporary Parquet files one per row group in parallel

See: ColumnReader.java

Parquet requires that all column chunks in a row group be contiguous in the file. However, the size of each row group is not known until all columns are encoded and compressed.

To work around this:

Each row group is written to a separate temporary file
After all row groups are written, Dict and Data Pages they are concatenated into a single output file.
The Parquet footer metadata is generated to reflect the byte offsets of the row groups.

This approach avoids buffering entire row groups in memory.

Time to read, sort, and write a 59 million row parquet file

Inspired by this benchmark.

Sorting Software Time to Sort (SF=10)

DuckDB 1.3.2	20.54s
Parquet Sort (Corretto 24)	18.34s
Parquet Sort (GraalVM 24)	24.47s
Parquet Sort (GraalVM Native)	16.30s

Benchmark Environment
OS: Amazon Linux 2023
Kernel: Linux 6.1.147-172.259.amzn2023.x86_64
Instance: EC2 c7i.4xlarge
16 vCPUs, 32GB RAM
50GB EBS SSD (gp3, 16,000 IOPS, 1000 MiB/s throughput)
Filesystem: XFS