See – Searchable JSON Compression Beyond ZSTD

2 hours ago 1

combined ≈ 19.5% • lookup p50 ≈ 0.18 ms • skip ≈ 99%

Why it matters SEE reduces both the data tax (storage/egress) and the CPU tax (decompress/parse) by keeping JSON searchable while compressed. It may not always be smaller than Zstd, but searchability + low I/O + random access leads to better TCO/ROI for many workloads.

① Download (Release)② OnePager (ROI)③ Try in 10 minutes

Enterprise / NDA inquiryPrivate contact form Under NDA: full VDR pack available. Please provide a company email (no confidential data required).


  • Schema-aware JSON compression: combines structure × delta × Zstd (+ Bloom / Skip) to stay searchable while compressed, with page-level random access.
  • Design trade-off: favors low I/O & low latency (ms) and ~99% skip rate over minimal size.
  • Combined size: ≈19.5% of raw
  • Lookup present (ms): p50 ≈ 0.18 / p95 ≈ 0.28 / p99 ≈ 0.34
  • Skip ratio: present ≈ 0.99 / absent ≈ 0.992, Bloom density ≈ 0.30

Savings/TB = (1 − 0.195) × Price_per_GB × 1000 Example: $0.05/GB → ≈$40/TB, $0.25/GB → ≈$200/TB


python samples/quick_demo.py

Prints compression ratio, skip rate, Bloom density, and lookup latency (p50/p95/p99).

Demo package (Release v0.1.0):

  • Includes Python wheel, .see files, demo scripts, metrics, and OnePager PDF.

  • Reproducible on Windows / macOS / Linux.

  • Verify integrity using:

    pwsh tools/verify_checksums.ps1 # or manually check SHA256SUMS.txt

KPI (demo): combined ≈ 19.5%, lookup p50 ≈ 0.18 ms, skip ≈ 99%, bloom ≈ 0.30. Tradeoff: not always smaller than Zstd, but stays searchable while compressed, cutting I/O and CPU costs.


  • Zstd-only can be smaller, but not searchable; you still pay I/O + CPU to decompress and parse JSON.
  • SEE trades a small size increase for millisecond lookups and page-level random access, reducing I/O and CPU — resulting in better TCO.

  • Q. Will it ever be larger than Zstd? A. Sometimes yes; in return you get ms lookups and ~99% skipping. For I/O/CPU-bound workloads, TCO decreases.

  • Q. Best-fit data? A. Repetitive JSON/NDJSON such as logs, events, telemetry, and metrics.

  • Q. How long to reproduce? A. About 10 minutes using the included Demo ZIP.

  • Q. Why not build a separate index? A. Separate indexes add extra I/O, space, and consistency risk. SEE keeps searchability inside the storage format, reducing random I/O and parsing overhead.

  • Q. How to tune for different data? A. Adjust Bloom density (default ≈0.30, works best in 0.25–0.55). Demo prints all metrics for validation.


What’s included in the Release ZIP

  • Python Wheel (.whl)
  • Demo scripts: samples/quick_demo.py, samples/quick_bench.py (prints KPIs)
  • OnePager (PDF) and metrics/ summaries
  • Integrity check script: tools/verify_checksums.ps1
  • README_FIRST.md — concise reproduction guide

Note: The GitHub Discussions “Enterprise (NDA)” category is public. Do not post confidential information or emails there — use the private form above.


Optional: For reproducibility or citation

If you reproduce benchmarks or use SEE in your research, please cite:

SEE (Semantic Entropy Encoding) https://github.com/kodomonocch1/see_proto

  1. Clone and run the 10-min demo to verify KPIs.
  2. Read the OnePager (ROI) for TCO and savings formulas.
  3. For enterprise evaluation under NDA, submit your company email via the private form.

✅ About signals/stars.csv

Keep signals/stars.csv as-is — do not modify or remove it. It’s an automated log generated by GitHub Actions to record stars (timestamp + user). It’s perfectly fine to leave it in the repo; it doesn’t expose sensitive data and shows healthy engagement growth.


Would you like me to also generate a shorter “README_FIRST.md” version (for the ZIP demo folder) to match this English tone? It should be about 10 lines with install → verify → demo steps.

Read Entire Article