Icechunk 1.0: Production-Grade Cloud-Native Array Storage Is Here

4 months ago 1

A year ago, we made an important internal decision which set Earthmover on a new course—we decided to refactor and open source our core technology for storing array-based data in the cloud. This took the form of the Icechunk project, an open source package and specification enabling database-style transactions against massive array datasets using only cloud object storage as infrastructure. Icechunk is an important part of our company mission, benefiting both Earthmover customers and the broader scientific data community.

Today, we’re excited to announce the 1.0 release of Icechunk. This release is a declaration that the Icechunk format is stable, the software is robust and correct at scale, and that we’re committed to maintaining compatibility with any data written from this point forward. If you’ve been waiting to adopt Icechunk because of its beta status, now’s the time: Icechunk is ready for production.

True Database-Like Features for Array Data

Multidimensional arrays (a.k.a. tensors) are a foundational data structure for scientific and technical computing, spanning applications from weather to neuroimaging to fusion simulations. The data in these domains don’t fit into the standard tabular / relational data model and need the richer array-based model. (For more on why, check out Tensors vs Tables.) Many of these fields have already adopted the popular Zarr storage format.

Icechunk works together with Zarr, augmenting the Zarr core data model with features that enhance performance, collaboration, and safety in a cloud-computing context. Version 1.0 solidifies Icechunk’s core value proposition: bringing database-grade reliability to array storage.

Key features include:

Transactional Safety: The key improvement that Icechunk brings on top of regular Zarr is to introduce the concept of transactions–groups of operations that either succeed or fail together–and to provide consistent serializable isolation between these transactions. This means that Icechunk data is safe to read and write in parallel from multiple uncoordinated processes, making it suitable to use as a production database.

Efficient Versioning: Icechunk never copies or rewrites your data. The storage used to maintain versions is simply the storage needed for the new data you write in each version. This makes branching, experimentation, and rollbacks practically free, enabling data version control.

Unmatched I/O Performance: Icechunk’s Rust-based I/O layer is designed to fully saturate the connection between compute and storage, enabling the fastest possible data analytics and maximum GPU utilization for demanding AI/ML workloads.

Virtual HDF5, NetCDF, and more: Icechunk can be layered on top of archival file formats like HDF5, NetCDF, GRIB, TIFF, and more, unlocking transactions and cloud-native performance without duplicating data. This massively simplifies the migration of legacy data archives to the cloud.

The Road to 1.0

Since the initial release back in October, our journey 1.0 focused on three areas of development: a) adding additional features needed to support end-to-end workflows; b) evolving the on-disk format to ensure suitable scalability for anticipated use cases, and c) verifying performance, correctness and stability via both automated testing and real-world usage.

Some of the important recent developments include:

Manifest Splitting – With 1.0, users can customize how the chunk manifests are split across files, enabling Icechunk to scale to repos with tens of millions of chunks. For an example at this scale, check out the ERA5 sample dataset below.
Distributed Writes – Icechunk 1.0 supports large-scale distributed writes with Xarray, Dask, or any similar parallel execution framework, enabling highly scalable data processing pipelines (many TB and beyond).
Advanced Conflict Detection and Resolution – Icechunk 1.0 can detect conflicts between uncoordinated commits, preventing accidental data corruption, while also offering strategies for resolving the conflicts. (Read about the details in this blog post.)
Data Expiration and Garbage Collection – Icechunk never overwrites data—as a result, data can accumulate on disk as new versions are created. Icechunk 1.0 enables sophisticated controls over when past versions expire and old data is removed, allowing users to fine-tune their repos to balance storage costs with time travel capabilities. (Read about the details in this blog post.)
Detailed Open Spec – The Icechunk specification enables other projects (such as Neuroglancer) to interact directly with the underlying on-disk files if needed. (Read the spec.)

We’ve also validated the performance and correctness of Icechunk through a range of complimentary approaches.

Building on Rust’s strong typing system and memory safety, which eliminates an entire category of potential bugs.
Extensive stateful testing based on the excellent Hypothesis package is used to verify correctness and find edge cases.
Our pilot with NASA demonstrated the ability of Icechunk to provide 100x speedups in timeseries extraction against existing NetCDF files, enabling cloud-native access to the GPM IMERG dataset stored in S3 without a costly transformation to a new format.
We explored the scale-out performance of Icechunk in a blog post, achieving throughput of over 230,000 chunk reads per second.
We created a 30 TB sample dataset from the ECMWF ERA5 weather reanalysis. The dataset is publicly accessible via the Earthmover platform, as described below.
We’ve migrated all Earthmover customers to Icechunk over the past few months, processing nearly 1 PB of data. Data teams at companies like Sylvera and Pelmorex are using Icechunk in production today.

Icechunk: A Growing Ecosystem

In the meantime, despite the beta status of Icechunk, we’ve been thrilled to see a growing community of adopters exploring creative applications of this new technology. Some notable adopters and projects include:

NASA is implementing Icechunk for optimized access to cloud-based data archives, in partnership with Development Seed.
The team at Dynamical is adopting Icechunk as the format for their analysis-ready, cloud-optimized weather data. In the words of Alden Keefe, CTO of Upstream Tech and co-Founder of Dynamical:

We see Icechunk as completing the missing features of Zarr. Our datasets need to be updated and read from simultaneously and Icechunk ensures this is done correctly. We are working on implementing Icechunk for our datasets now.

Data scientists at the UK Atomic Energy Agency are adopting Icechunk for a petabyte-scale archive of fusion experiment data.
Neuroglancer has implemented support for Icechunk, enabling sophisticated visualization and exploration of brain imaging data.
Pranav Sateesh developed ParamLake. Built on Icechunk, ParamLake provides Git-like version control for AI models, enabling collaborative ML development with complete model history, branching, merging, and time travel capabilities.
Tobias Hölzer developed Smart Geocubes, a high-performance library for intelligent loading and caching of remote geospatial raster data, built with Xarray, Zarr and Icechunk.
Icechunk has reached 451 stars on Github. More importantly, it has 32 contributors, including 24 from outside Earthmover!

We know this is just the beginning. The 1.0 release and the commitment to a stable on-disk format now enable organizations to finally adopt Icechunk in production.

Icechunk in the Earthmover Platform

Icechunk is fully functional in its standalone, open-source form. It’s completely serverless and requires only access to storage (object store or local filesystem). But it’s even better as part of the Earthmover Platform! Our goal is to give our customers everything they need to operationalize Icechunk and build array-based data products in production—without the hassle of managing cloud infrastructure.

Here’s what the platform provides on top of Icechunk’s high-performance on-disk format:

Data Delivery via Flux APIs – Instantly connect your Icechunk data to geospatial APIs such as Web Map Service (WMS), Environmental Data Retrieval (EDR), and OPeNDAP. Effortlessly build interactive visualizations like this one (from a recent webinar).

Interactive Data Catalog – List, browse, and search your icechunk repos from the web, CLI, or Python client. A central source of truth for all your team’s data breaks down data silos and enhances collaboration.

Fine-Grained Access Controls – Managing permissions and access in the cloud is not only hard—it’s a critical challenge for organizations to implement robust data governance and compliance, especially in regulated industries such as finance, insurance, and energy. With the Earthmover platform, all data reside in your own cloud storage bucket—our platform brokers access via a sophisticated “credential delegation” mechanism, ensuring the right teams can see the right data without compromising performance. This enables you to safely deliver data to internal and external stakeholders.

Automatic Optimization – Arraylake automatically optimized your Icechunk data for performance and cost savings! Configure a data expiration window and let the platform handle garbage collection, keeping your storage bill in check.

Get Started with our Earthmover ERA5 Sample Dataset

Want to get hands on some large-scale Icechunk data and see how useful the Earthmover platform is for scientific data teams everywhere? We have prepared a sample from one of the most important and useful weather / climate datasets: the ECWMF ERA5 reanalysis.

What’s inside

Eighteen single-level variables: 2 m temperature, 10 m winds, surface pressure, soil moisture, cloud cover, snow depth, total-column water vapour, and more.
Hourly data for every day from 1 January 1975 to 31 December 2024—that’s 438,312 time steps spanning five decades.
Roughly 30 TB of data stored as about 7.9 million chunks, ready to stream straight from object storage.

Why it matters

Cloud efficiency – open the dataset on demand and read only the bytes you actually need.
Spatial-slicing speed – designed for quick access to whole maps or sub-regions
Temporal depth – half a century of hourly records is ideal for training ML/AI weather models, validating forecasts, or running climate-trend analyses.
Standards compliance – CF-1.6 metadata lets the dataset slide straight into your Python/xarray workflow with zero schema wrangling.

You can find more details about the dataset on the Earthmover documentation. To access the data, head over to app.earthmover.io, log in, and start exploring!

Or for a quick start, just follow these instructions:

# first: pip install arraylake icechunk xarray import xarray as xr from arraylake import Client client = Client() client.login() repo = client.get_repo("earthmover-public/era5-surface-aws") ds = xr.open_zarr(repo.readonly_session("main").store, group="spatial", chunks=None, zarr_format=3, consolidated=False)

What’s Next

Icechunk 1.0 is stable, but there is still lots of exciting development ahead. Check out our Roadmap or open a discussion on GitHub about how we can support your use case.

Ready for Your Workload

Icechunk 1.0 arrives at a pivotal moment. Zarr adoption is surging. Organizations of all kinds sizes are increasingly choosing Zarr for a diverse array of data products; weather, climate, geospatial and more. With major platforms like Google Earth Engine developing Zarr support and the new Copernicus Earth Observation Processing Framework (EOPF) leveraging Zarr for storing Level 1 and 2 datasets, Icechunk provides the production-grade foundation this ecosystem needs.

Icechunk 1.0 represents more than a technological milestone—it’s a commitment to stability, performance, and reliability. Whether you’re processing satellite imagery, running ML pipelines on massive datasets, or building the next generation of scientific computing applications, Icechunk provides the transactional, versioned, cloud-native storage layer you need.

The future of array storage is here.

Ready to get started? Visit icechunk.io for documentation and examples, or check out the GitHub repository to dive into the code.

Read Entire Article

Icechunk 1.0: Production-Grade Cloud-Native Array Storage Is Here

True Database-Like Features for Array Data

The Road to 1.0

Icechunk: A Growing Ecosystem

Icechunk in the Earthmover Platform

Get Started with our Earthmover ERA5 Sample Dataset

What’s Next

Ready for Your Workload

Related

US Army to buy 1 million drones, in major acquisition ramp-u...

Chrom{e,ium} CORS breaks when using certain VPN setups

Show HN: Gametje – A casual online gaming platform