A tool that reads a parquet file, performs a sort, and writes it to a destination file. It can outperform DuckDB. Written in pure Java using ParquetForge.
- Java can outperform highly optimized ETL tools like DuckDB, especially with GraalVM native-image
- You don’t need C++ or explicit vectorization (SIMD) to get blazing-fast Parquet performance
- ParquetForge is the missing toolkit to read, write and manipulate Parquet
- Efficient parallelized sort order computation (bucket sort or quicksort)
- Column-wise in-memory reordering
- Parallel processing across row groups
- Parallel final file assembly with copying (and not processing) blocks of data
A mapping from source row indices to destination row indices is computed based on the sort column in SortOrderer.java:
- If the sort column has a narrow range of values (e.g., dates), bucket sort is used.
- Otherwise, it uses fastutil's parallel quicksort
In the benchmark case (l_shipdate), bucket sort is used due to the limited range of date values.
Parquet is a columnar format, so each column can be processed independently:
- Each column is read into memory in the sorted row order
- Each column chunk in each row group is read in parallel
- The reordered column data is written into temporary Parquet files one per row group in parallel
See: ColumnReader.java
Parquet requires that all column chunks in a row group be contiguous in the file. However, the size of each row group is not known until all columns are encoded and compressed.
To work around this:
- Each row group is written to a separate temporary file
- After all row groups are written, Dict and Data Pages they are concatenated into a single output file.
- The Parquet footer metadata is generated to reflect the byte offsets of the row groups.
This approach avoids buffering entire row groups in memory.
Inspired by this benchmark.
| DuckDB 1.3.2 | 20.54s |
| Parquet Sort (Corretto 24) | 18.34s |
| Parquet Sort (GraalVM 24) | 24.47s |
| Parquet Sort (GraalVM Native) | 16.30s |
Benchmark Environment
OS: Amazon Linux 2023
Kernel: Linux 6.1.147-172.259.amzn2023.x86_64
Instance: EC2 c7i.4xlarge
16 vCPUs, 32GB RAM
50GB EBS SSD (gp3, 16,000 IOPS, 1000 MiB/s throughput)
Filesystem: XFS
See: benchmarks.sh - generates the input file and executes the sort benchmark.
This benchmark sorts lineitem.parquet as generated by DuckDB's TPC-H benchmark with a scale factor SF of 10, identical to the source inspiration.
If you have questions/comments/feedback, I'd love to hear from you. My e-mail is in the git commit log.
Sorting the deprecated int96 column type is not supported.
This project is a performance demo and not intended for production use.
.png)


