Data version control over multiple heterogenous storage back ends

2 days ago 1

Today’s organizations don’t just use a single data storage solution – they operate across on-prem servers, multiple cloud providers, and hybrid environments. This distributed approach has become necessary, but it comes with significant costs: teams struggle with siloed tools, duplicated processes, and an endless cycle of environment management that diverts focus from delivering actual value.

At lakeFS, we believe your data infrastructure should accelerate innovation, not create friction. That’s why we’re excited to introduce Multiple Storage Backends support – a new capability that unifies data management across all your storage systems.

The Problem: Distributed Data, Disconnected Management

Whether you’re operating across AWS, Azure Blob and MinIO, separating dev and prod environments, or maintaining storage-specific copies of data for compliance or performance, the reality is this:

You’re managing multiple storage systems

Until now, lakeFS instances were tightly coupled to individual storage backends. This forced users to either deploy multiple instances, or ignore valuable datasets spread across silos.

In our deep dive on distributed data management challenges, we outlined how fragmentation increases operational burden, slows delivery, and complicates collaboration. However, managing distributed data shouldn’t mean giving up visibility, consistency, or control.

The Solution: One lakeFS, Multiple Storage Backends

With Multiple Storage Backends support, lakeFS now allows you to register and manage data stored in multiple storage systems – all from a single lakeFS installation.

This means:

A single lakeFS instance can version data across multiple object stores and control access to all of them through one interface.
You can define one repository per storage backend, and manage them all through a unified API.
You can now build reproducible pipelines and enforce standards across storage boundaries.

What You Gain by Managing Multiple Storage Systems with lakeFS

What You Gain Description

Unified Data Access	Interact with data across AWS, Azure, GCP, and any S3-compatible system using one API and unified namespace. One interface, one set of tooling – just lakefs://. This unified access model simplifies workflows, reduces context switching, and creates a seamless developer experience, regardless of where the data lives.
Centralized Governance	Enforce policies, access controls, and audits across all storage systems using lakeFS RBAC and hooks, eliminating the need to duplicate access control logic in every cloud account.
Lineage Across Storage Systems	Track data lineage from ingestion to production seamlessly, even as data flows between different storage environments.
Lower Operational Overhead	Managing fewer lakeFS instances translates to lower maintenance requirements and a cleaner, more efficient architecture.

Example: Managing a Multi-Cloud Data Lake with lakeFS

Imagine your organization stores data across three cloud providers:

Raw data in AWS S3
Refined data in Google Cloud Storage
Sensitive models in Azure Blob Storage

Traditionally, managing this kind of multi-cloud setup would require three separate lakeFS instances – each siloed, with its own configuration, governance, and policies.

With Multiple Storage Backends support, you can now manage all three from a single lakeFS instance. Each backend is registered as a separate blockstore, and each storage environment is versioned via its own lakeFS repository.

Multi-Storage Backends Support Configuration Example

Here’s how you’d connect these storage systems to a single lakeFS instance:

blockstores: signing: secret_key: some-secret stores: - id: raw-data description: 'raw data storage, s3-based' type: s3 s3: region: us-east-1 - id: refined-data description: 'refined data storage, gcs-based' type: gs gs: credentials_json: | { "type": "service_account", "project_id": "my-gcp-project", ... } - id: sensitive-models description: 'sensitive models data, azure based' type: azure azure: storage_account: models-account storage_access_key: EXAMPLE45551FSAsVVCXCF

Once configured, each storage backend becomes a fully supported blockstore in lakeFS. You can create repositories on top of them, apply version control, and manage data access and governance – all through a single, unified lakeFS instance.

For full setup instructions and advanced options, refer to the official configuration guide.

Creating Repositories

Once lakeFS is connected to your storage systems, you can create repositories to start managing data in each one.

Here’s how to do it using the lakeFS UI:

Step 1: Select the storage backend you want to use

Step 2: Enter the repository name and details

For example, let’s create a repository for raw data ingested by a data pipeline. We’ll store this data on the AWS S3 backend we previously configured.

Note: Each lakeFS repository is tied to a single storage system. In the next section, we’ll explore strategies for tracking and versioning data across multiple repositories connected to different backends.

To continue our setup, we will also create a second repository, curated-data, which uses the GCS blockstore for refined datasets.

At this point, we’ve created two repositories:

raw-data (on S3)
curated-data (on GCS)

Centralized Access Control

With our repositories in place, the next step is to manage who can do what using lakeFS Role-Based Access Control (RBAC).

lakeFS RBAC serves as a centralized access control layer, allowing you to define and enforce fine-grained permissions across all connected storage systems – without duplicating or syncing roles in each cloud provider.

Here’s how to use it:

Step 1: Define Groups

We define two groups:

DataPlatform: responsible for ingesting and transforming data
DataScience: focused on exploration and model development

The DataPlatform team needs full read/write access to both repositories.
The DataScience team, however, should only have read-only access to the raw data repository.

Step 2: Create Policies & Assign Permissions

We create the following lakeFS policy to restrict DataScience team’s access:

{ "creation_date": 1746452961, "id": "RawDataReadOnly", "statement": [ { "action": [ "fs:ReadObject", "fs:ListObjects" ], "effect": "allow", "resource": "arn:lakefs:fs:::repository/raw-data-ingestion/object/*" } ] }

Then, we attach the policy to the DataScience group.

The DataPlatform team is attached to the pre-configured FSFullAccess policy, which grants them full permissions across all repositories.

Building Cross-Storage Workflows

Now that we’ve connected multiple storage systems to lakeFS, let’s see what it’s like to actually work across them. The following example uses the high-level lakeFS Python SDK to build a simple transformation workflow that:

Reads raw data stored in AWS S3
Writes refined results to GCS
Tracks lineage between the two using commit metadata

And does all of that without needing to worry about which storage system is being used.

import pandas as pd import lakefs # Step 1: Create a transformation branch in the raw-data repo (S3) raw_repo = lakefs.Repository("raw-data") raw_branch = raw_repo.branch("daily_transactions_to_preferences_0506").create(source_reference="main") raw_head_commit = raw_branch.head.id # Step 2: Read raw CSV data into a DataFrame with raw_branch.object("daily_transactions/events.csv").reader() as reader: df = pd.read_csv(reader) # Step 3: Perform transformation (e.g., purchase preference profiling) transformed_df = ( df.groupby("user_id")["purchase_category"] .value_counts() .unstack() .fillna(0) .reset_index() ) # Step 4: Create a target branch in the curated-data repo (GCS) curated_repo = lakefs.Repository("curated-data") curated_branch = curated_repo.branch("daily_transactions_to_preferences_0506").create(source_reference="main") # Step 5: Write the transformed data to the curated repo csv_bytes = transformed_df.to_csv(index=False).encode("utf-8") curated_branch.object("shopping_preferences/preferences.csv").upload(csv_bytes) # Step 6: Commit with lineage metadata commit_metadata = { "raw_data_source": "lakefs://raw-data/main/daily_transactions", "src_commit_id": raw_head_commit } curated_branch.commit( message="Transformed daily transactions to shopping preferences", metadata=commit_metadata ) # Step 7: Merge into main curated_repo.branch("main").merge(curated_branch.head.id)

This example shows that whether your raw data lives in S3 and your analytics layer is in GCS, or vice versa. lakeFS gives you a consistent interface, seamless branching and committing, and built-in lineage tracking.

You can go one step further: now that refined data is available in the curated-data repository, sensitive models stored in Azure can be read from it directly through lakeFS. This sets the stage for reproducible, cross-cloud ML pipelines – all under one control plane.

Getting Started

Want to try lakeFS Enterprise and multiple storage backend support out? Contact us to get access!

Tal Sofer is a product manager at Treeverse, the company behind lakeFS, an open-source platform that delivers a git-like experience to object-storage based data lakes. Tal is a former engineering manager who led engineering teams building scalable tools for developers and started her journey at Treeverse as an R&D team lead. Tal holds a B.sc in Computer Science and Chinese studies from the Hebrew University of Jerusalem. In her free time you can find her running, cooking or brushing up on her Chinese.

Read Entire Article