Rationale for this change
I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.
Deduplication efficiency
The feature enables efficient data deduplication for compressed parquet files on content addressable storage (CAS) systems such as Hugging Face Hub. There is a purpose built evaluation tool is available at https://github.com/kszucs/de used during development to continuously check the improvements and to visually inspect the results. Please take a look at the repository's readme to see how different changes made to parquet files affect the deduplication ratio when they are stored in CAS systems.
Some results calculated on all revisions of datasets.parquet
❯ de stats /tmp/datasets
Writing CDC Parquet files with ZSTD compression
100%|███████████████████████████████████████████████████████████████████████████████████| 194/194 [00:12<00:00, 15.73it/s]
Writing CDC Parquet files with Snappy compression
100%|███████████████████████████████████████████████████████████████████████████████████| 194/194 [00:10<00:00, 17.95it/s]
Estimating deduplication for Parquet
Estimating deduplication for CDC ZSTD
Estimating deduplication for CDC Snappy
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ ┃ ┃ ┃ Compressed Chunk ┃ ┃ Compressed Dedup ┃ Transmitted XTool ┃
┃ Title ┃ Total Size ┃ Chunk Size ┃ Size ┃ Dedup Ratio ┃ Ratio ┃ Bytes ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Parquet │ 16.2 GiB │ 15.0 GiB │ 13.4 GiB │ 93% │ 83% │ 13.5 GiB │
│ CDC ZSTD │ 8.8 GiB │ 5.6 GiB │ 5.6 GiB │ 64% │ 64% │ 6.0 GiB │
│ CDC Snappy │ 16.2 GiB │ 8.6 GiB │ 8.1 GiB │ 53% │ 50% │ 9.4 GiB │
└────────────┴────────────┴────────────┴──────────────────────┴─────────────┴─────────────────────┴──────────────────────┘
Some results calculated on all revisions of food.parquet
❯ de stats /tmp/food --max-processes 4
Writing CDC Parquet files with ZSTD compression
100%|█████████████████████████████████████████████████████████████████████████████████████| 32/32 [10:28<00:00, 19.64s/it]
Writing CDC Parquet files with Snappy compression
100%|█████████████████████████████████████████████████████████████████████████████████████| 32/32 [08:11<00:00, 15.37s/it]
Estimating deduplication for Parquet
Estimating deduplication for CDC ZSTD
Estimating deduplication for CDC Snappy
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ ┃ ┃ ┃ Compressed Chunk ┃ ┃ Compressed Dedup ┃ Transmitted XTool ┃
┃ Title ┃ Total Size ┃ Chunk Size ┃ Size ┃ Dedup Ratio ┃ Ratio ┃ Bytes ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Parquet │ 182.6 GiB │ 148.0 GiB │ 140.5 GiB │ 81% │ 77% │ 146.4 GiB │
│ CDC ZSTD │ 107.1 GiB │ 58.0 GiB │ 57.9 GiB │ 54% │ 54% │ 66.2 GiB │
│ CDC Snappy │ 176.7 GiB │ 79.6 GiB │ 77.2 GiB │ 45% │ 44% │ 101.0 GiB │
└────────────┴────────────┴────────────┴──────────────────────┴─────────────┴─────────────────────┴──────────────────────┘
Chunk size shows the actual storage required to store the CDC chunked parquet files in a simple CAS implementation.
What changes are included in this PR?
A new column chunker implementation based on CDC algorithm, see more details in the docstrings. The implementation is added to the C++ Parquet writer and exposed in PyArrow as well.
Are these changes tested?
Yes. Tests have been added to the C++ implementation as well as the exposed PyArrow API.
Are there any user-facing changes?
There are two new parquet writer properties on the C++ side:
- enable_content_defined_chunking() to enable the feature
- content_defined_chunking_options(min_chunk_size, max_chunk_size, norm_factor) to provide additional options
There is a new pq.write_table(..., use_content_defined_chunking=) keyword argument to expose the feature on the Python side.
I marked all user-facing changes as EXPERIMENTAL.