Using data version control to build better AI factories

5 hours ago 1

It usually starts the same way. A late night message pops up in the channel:

“Which v5_final dataset did we use to train the model that just failed in production?”

You sigh, open three S3 buckets, and begin the forensic treasure hunt.
If that sounds familiar, you’ve already built a portion of an AI Factory. And you should finish building it.

What is an AI Factory and how is it different from a data center?

Think of it this way: traditional factories transform raw materials into products. AI factories transform raw data into intelligence at scale, operating like an assembly line with familiar stations. GPUs provide compute power, storage systems hold raw materials, training frameworks shape the product, and deployment infrastructure ships it.

An AI Factory operates through four interconnected stations, each critical to the production pipeline:

Compute

The engine room where GPUs and specialized processors deliver the raw computational power needed for AI workloads. Think of it as the heavy machinery that makes everything else possible.

Storage

The warehouse for your raw materials: training datasets, model checkpoints, and feature stores. Modern AI Factories use object storage systems that can handle petabyte-scale data while maintaining  high throughput for parallel training jobs.

Training

The assembly line where raw data gets transformed into intelligence. Here, frameworks like PyTorch and TensorFlow work alongside experiment trackers to iterate on models, test hypotheses, and optimize performance.

Deployment

The shipping dock where finished models get packaged, tested, and released into production. This includes model serving infrastructure, monitoring systems, and the APIs that deliver predictions to your applications.

The distinction from traditional data centers also matters. While data centers simply store and process information, AI factories are purpose-built for one thing: producing intelligence as their primary output. Each station has a job, and when they work in concert, success isn’t measured in storage capacity or compute cycles, but in token throughput: the real-time predictions and decisions that drive business value.

This represents a fundamental shift. We’re moving beyond extracting insights from existing data to generating new intelligence. Whether that’s personalized content, automated decisions, or predictive capabilities, the outcome is the same: value at scale. Companies like Uber, Google, and Netflix have already made this transition, turning AI from an R&D initiative into their competitive moat.

The promise is compelling: AI factories optimize your entire AI lifecycle from data ingestion through high-volume inference, delivering performance gains that traditional infrastructure can’t match. Built to handle compute-intensive workloads at scale, they grow with your ambitions. No architectural overhauls. No rip-and-replace nightmares. Just seamless expansion when demand spikes.

And unlike walled gardens that trap your data, modern AI factories embrace open standards. Plug in your favorite tools. Swap components as better options emerge. Your data, your models, your choice of infrastructure.

The shift is fundamental: from science project to production line. From experimental to industrial. From hoping it works to knowing it scales.

But how can teams tell if their AI Factories are running efficiently?

3 new KPIs measure the efficiency and value of AI Factories

Not long ago, leaders of AI programs stopped asking how many GPUs you provisioned and started asking how much intelligence those GPUs generate. The new KPI, according to eminent semiconductor analyst Ben Bajarin, is $/token, and AI Factories should be judged by dollars per token of useful inference, which directly links infrastructure spend and model output. To put a finer point on it:

“Infrastructure isn’t overhead anymore – it’s the product.”

AI Factories: Reframing Infrastructure from Cost Center to Profit Center

When Bajarin penned that line, he also handed every AI leader three scoreboard numbers:

1. Cost per tokenHow much does each prediction cost to generate?
2. Revenue per token How much value can each prediction carry?
3. Time-to-monetization How quickly can a new model start paying its way?

Now you know which KPIs to use when evaluating the efficiency and value of your AI Factory. But what factors influence these metrics? And is there anything you can do to improve them?

While some things, like fluctuations in GPU pricing, are out of your control, others are well within your reach and can have a dramatic impact on the quality of inference per token. 

The highest ROI lever you have? Your data infrastructure.

Data infrastructure and its outsized impact on token economics

Most AI teams already have sophisticated monitoring for their models and infrastructure. They can tell you GPU utilization down to the second, but ask them which version of the training dataset powers their production model, and you may get three different answers from three different engineers.

Why is this a problem? Your AI factory will hemorrhage value in three ways. Teams waste days playing detective to reconstruct training datasets, burning engineering hours on archaeology instead of innovation. Risk-averse behavior sets in—why experiment when every model failure triggers a week-long investigation? And when compliance comes calling, models sit in review purgatory because you can’t prove what data influenced which decisions. The result: higher costs, slower innovation, and delayed time-to-market. Every missing data trail is money left on the table.

If you can’t point to, or trace, the exact data that shaped a model, the scoreboard does not move in your favor.

So where exactly does traceability break down on the factory floor? Let’s walk the line and find the weak link.

The ripples from a lack of traceability

Here’s where teams see traceability break down in each station:

  • ML Pipelines can’t reproduce results because they’re pulling from “latest” data that changed overnight
  • Model Training teams can’t compare experiment results because they’re not sure which datasets were actually used
  • Model Registries store models but lose the connection to their training data sources
  • Edge Deployments fail, but rollback strategies only address model versions, not the data that trained them

Just as a physical assembly line sputters or stops when parts can’t be traced, an AI Factory behaves similarly when data is untraceable. But here’s what makes it worse: the failures compound.

The Cascade Effect:

  • Engineers stop experimenting because failed experiments can’t be debugged
  • Innovation velocity drops as teams play it safe with “known good” datasets
  • Technical debt accumulates as workarounds pile on workarounds
  • Compliance audits stretch from days to months as teams scramble to prove model lineage
  • Customer trust erodes when you can’t explain why models made certain decisions
  • Talent leaves for companies that have their data house in order

The Hidden Costs: Beyond the obvious productivity hit, poor traceability creates shadow costs. Teams over-provision compute “just to be safe,” burning more resources than necessary. Models get completely retrained instead of incrementally updated. Data scientists become data janitors, spending 80% of their time on archaeology instead of innovation.

Data versioning: Your AI Factory’s quality control

So, how do you stop the bleeding? The same way software engineering solved this problem decades ago. Remember Git? Data versioning does the same thing for your datasets. Every transformation, every cleaning operation, every augmentation gets stamped with a unique ID—creating an unbreakable chain of custody for your data.

Teams can branch off experimental datasets without breaking production. Merge the winners, discard the losers. Need to know exactly what data trained that problematic model from six months ago? One commit ID reconstructs the entire dataset, transformations and all.

Think of data versioning as the barcode and tracking system in a factory. When production needs to ramp up, a tracking system is required to source materials for the product. When a defective product is shipped, isolating the origin of the problem is mandatory to fix the error.

In your AI Factory, data versioning enables:

  • Instant rollback – Defective data detected? Revert to any previous version in seconds
  • Parallel production lines – Teams experiment on isolated branches without contaminating the main assembly line
  • Complete traceability – Every dataset has a barcode showing its full history and transformations
  • Efficient operations – Branch and merge massive datasets without duplicating storage

Each commit is a barcode on your assembly line. Branches become routing decisions. This batch goes to experimental training, that one to production. When compliance comes knocking, you don’t scramble through logs. You have a commit ID. Done.

This isn’t just about compliance. It’s about moving fast without breaking things. About running hundreds of experiments in parallel without data chaos. About knowing that when something works, you can reproduce it exactly. Not “pretty close” or “I think we used this version.”

Data versioning in an AI Factory: Real life example from Lockheed Martin

The best way to understand an AI Factory isn’t through abstract metrics – it’s by walking the production floor. At NVIDIA’s conference in Washington, DC, Lockheed Martin pulled back the curtain on their factory floor:

Most companies already run half of those stations: Airflow jobs here, a feature store there, a GPU cluster in the corner. Look at that list again. Storage? Check. Training? Check. Deployment? Check. 

But Lockheed Martin’s example shows that, without data versioning, your factory isn’t a factory at all. At best, it’s several disconnected workshops trying to generate intelligence. And, when Lockheed Martin eventually brought lakeFS into their AI Factory, they solved these challenges: 

Experiment monitoring and data lineage

Thanks to enterprise-scale data version control, Lockheed Martin engineers can now track the complete lineage of every model, capturing which datasets, parameters, and configurations produced each result. This comprehensive audit trail ensures that successful experiments can be reliably reproduced and validated, while failed approaches are documented for organizational learning. For organizations operating across classified and unclassified environments, this level of data provenance is essential for maintaining compliance with federal AI standards.

Security and compliance 

With increasingly stringent regulatory requirements for AI systems, especially in organizations handling sensitive data, AI Factories need robust data governance capabilities. lakeFS provides Lockheed Martin immutable versioning and granular access controls that create an auditable chain of custody for all training data and model artifacts. This enables teams to demonstrate compliance, implement data retention policies, and ensure that sensitive datasets remain isolated – critical requirements when deploying AI across different security domains.

Scalability and collaboration

Modern AI Factories like Lockheed Martin’s must coordinate thousands of engineers and data scientists working on interconnected projects. lakeFS enables this scale by providing isolated development environments through its branching mechanism, allowing teams to experiment safely without impacting production data pipelines. The platform’s open-source foundation aligns with the composable, vendor-agnostic architectures that enterprises require, integrating seamlessly with existing MLOps tools while maintaining the flexibility to adapt as requirements evolve.

Lockheed Martin NVIDIA AI FactorySource: https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1052/

Building a durable AI Factory

Returning to the new KPIs for an AI Factory, enterprise-scale data version control brings three immediate gains:

MetricHow lakeFS moves the needleResults
Cost per token ↓ Data branches create test environments without copying data Reduces the cost of errors & eliminates duplicate storage costs
Revenue per token ↑ Automated quality controls block bad or incomplete data, raising model quality 75% fewer data quality issues
Time-to-monetization ↓ Immutable commits tie every model to its training data. Audits & rollbacks take minutes, more models pass regulation faster 80% faster delivery of data products, from weeks to days; get value-generating models to market faster

The implementation itself is surprisingly straightforward. Whether you’re retrofitting a complex system or building fresh, the core actions remain the same: point lakeFS at your object store, create branches for safe experimentation, and add validation hooks to protect production. 

But the real transformation isn’t in the deployment steps. It’s in what happens to your daily operations once that tracking system goes live.

Life with reliable data versioning

Remember that late-night forensic investigation from the beginning? It’s now a five-minute fix. Here’s what changes when every dataset carries a traceable ID:

When models fail: Pull the commit history, identify the exact dataset version, and rollback. No more archeology expeditions through S3 buckets.

When audits are needed: Hand a complete chain of custody for every model in production to the auditor. Who changed what data, when, and why? It’s all there in the commit log. Compliance reviews that used to take weeks now take days.

When teams need to collaborate: Isolated branches let everyone experiment safely from the same versioned truth. No collisions. No confusion. No “which dataset version?” debates.

When edge deployments go sideways: Rollback both model and training data to the last known good state. Your factory recalls are surgical, not scattershot.

The AI Factory with data versioning doesn’t just run smoother. It fundamentally changes how teams interact with their data. Trust replaces uncertainty. Speed replaces caution. And that infrastructure you’ve been building? It finally starts paying dividends.

Where smart money is going next

Turning infrastructure into a profit center isn’t just about faster GPUs or smarter schedulers. It starts with data infrastructure that turns chaos into confidence, archaeology into innovation, and weeks of debugging into minutes of diagnosis. Put that foundation in place and the rest of the AI factory falls into line. Pipelines, registries, edge deployments, all working in concert to deliver more valuable tokens for fewer dollars.

Ready to build a durable AI Factory?

Getting started with enterprise-scale data version control doesn’t require overhauling your infrastructure. lakeFS integrates directly with your existing data stack, whether you’re using cloud object storage like S3, on-prem object storage like VAST Data, deep learning frameworks like PyTorch or TensorFlow, or experiment trackers like MLflow. The implementation is straightforward: connect lakeFS to your storage, configure access controls, and teams can begin versioning data immediately. Most organizations deploy their first production use case within weeks.

Let’s discuss how data versioning fits into your AI Factory roadmap.

Einat Orr is the CEO and Co-founder of lakeFS, a scalable data version control platform that delivers a Git-like experience to object-storage based data lakes. She received her PhD. in Mathematics from Tel Aviv University, in the field of optimization in graph theory. Einat previously led several engineering organizations, most recently as CTO at SimilarWeb.

Read Entire Article