Show HN: Sempress – 2× better compression for numeric data

3 hours ago 1

Research → Open Source → Product

Sempress learns patterns in IoT sensors, time-series metrics, and ML features—delivering 50-125% better compression than gzip while preserving precision.

5.8×

Average compression ratio on numeric-heavy data

125%

Improvement over gzip on IoT telemetry

100%

Lossless preservation of locked columns

Real-World Performance

Tested on 400K+ rows across IoT sensors, ML features, and financial data

Telemetry (IoT)

8.08× Sempress

3.58× Gzip

Sensor Physics

5.88× Sempress

2.76× Gzip

ML Features

5.46× Sempress

3.09× Gzip

Financial Data

3.80× Sempress

2.51× Gzip

Results vary by data characteristics. Best for numeric-heavy tables (>60% numeric columns).

Built for Modern Analytics

Reduce storage costs and transfer times for data-intensive workloads

🌐 IoT & Telemetry

Compress sensor data 2× better than gzip. Perfect for industrial IoT, smart cities, and fleet management where millions of numeric readings flow continuously.

🤖 ML Feature Stores

Reduce S3 costs for training data. Store high-dimensional continuous features with near-zero error, enabling efficient model training at scale.

💰 Financial Analytics

Archive tick data with lossless precision. Bounded reconstruction error meets compliance requirements while saving 50% on storage.

How It Works

Semantic compression via learned vector quantization

1. Learn Structure

K-Means VQ per column learns semantic patterns in numeric data. Temperatures cluster around 20-25°C, prices follow smooth distributions.

2. Preserve Fidelity

Auto-locks strings and categoricals for lossless storage. Optional residuals eliminate quantization error on precision-critical columns.

3. Package Smart

Msgpack + Zstd container with uncertainty tracking. Self-describing format includes schema and reconstruction metadata.

Research Paper

Sempress: Semantic Compression for Numeric Tabular Data

Traditional compression algorithms treat tabular data as byte streams, ignoring semantic structure. We present Sempress, achieving 50-125% better compression than gzip on numeric-heavy datasets through column-wise vector quantization.

Published: January 2025
Authors: Keaton Anderson (Independent Researcher)
License: Open access

Open Source

Install with pip, integrate in minutes

# Install $ pip install sempress # Encode $ sempress encode \ --in data.csv \ --out data.smp \ --lock-cols id,timestamp \ --k 64 # Decode $ sempress decode \ --in data.smp \ --out data.csv

Ready to compress smarter?

Join the research community building the future of semantic compression

Read Entire Article