Show HN: Read Sort and Write Parquet Faster Than DuckDB in Java

3 months ago 2

A tool that reads a parquet file, performs a sort, and writes it to a destination file. It can outperform DuckDB. Written in pure Java using ParquetForge.

  • Java can outperform highly optimized ETL tools like DuckDB, especially with GraalVM native-image
  • You don’t need C++ or explicit vectorization (SIMD) to get blazing-fast Parquet performance
  • ParquetForge is the missing toolkit to read, write and manipulate Parquet

  • Efficient parallelized sort order computation (bucket sort or quicksort)
  • Column-wise in-memory reordering
  • Parallel processing across row groups
  • Parallel final file assembly with copying (and not processing) blocks of data

1. Sort Order Computation

A mapping from source row indices to destination row indices is computed based on the sort column in SortOrderer.java:

In the benchmark case (l_shipdate), bucket sort is used due to the limited range of date values.


2. Column-by-Column Reordering

Parquet is a columnar format, so each column can be processed independently:

  • Each column is read into memory in the sorted row order
  • Each column chunk in each row group is read in parallel
  • The reordered column data is written into temporary Parquet files one per row group in parallel

See: ColumnReader.java


Parquet requires that all column chunks in a row group be contiguous in the file. However, the size of each row group is not known until all columns are encoded and compressed.

To work around this:

  • Each row group is written to a separate temporary file
  • After all row groups are written, Dict and Data Pages they are concatenated into a single output file.
  • The Parquet footer metadata is generated to reflect the byte offsets of the row groups.

This approach avoids buffering entire row groups in memory.


Time to read, sort, and write a 59 million row parquet file

Inspired by this benchmark.

Sorting Software Time to Sort (SF=10)
DuckDB 1.3.2 20.54s
Parquet Sort (Corretto 24) 18.34s
Parquet Sort (GraalVM 24) 24.47s
Parquet Sort (GraalVM Native) 16.30s

Benchmark Environment
OS: Amazon Linux 2023
Kernel: Linux 6.1.147-172.259.amzn2023.x86_64
Instance: EC2 c7i.4xlarge
16 vCPUs, 32GB RAM
50GB EBS SSD (gp3, 16,000 IOPS, 1000 MiB/s throughput)
Filesystem: XFS

Run the Benchmarks yourself!

git clone https://github.com/Earnix/parquet-sort cd parquet-sort ./benchmarks.sh

See: benchmarks.sh - generates the input file and executes the sort benchmark.

This benchmark sorts lineitem.parquet as generated by DuckDB's TPC-H benchmark with a scale factor SF of 10, identical to the source inspiration.

If you have questions/comments/feedback, I'd love to hear from you. My e-mail is in the git commit log.

Sorting the deprecated int96 column type is not supported.

This project is a performance demo and not intended for production use.

Read Entire Article