State of Data Engineering Report 2025

4 months ago 17

Since 2021, we’ve published the annual State of Data Engineering Report, which includes a summary of all key categories that directly impact data engineering infrastructure.

In 2025, we see five primary trends that influence the categories that will be covered in this report.

2025 State of Data and AI Engineering Report

Trend #1: MLOps space is slowly diminishing

The MLOps space is slowly diminishing as the market undergoes rapid consolidation and strategic pivots. Weights & Biases, a leader in this category, was recently acquired by CoreWeave, signaling a shift toward infrastructure-driven AI solutions. Other pivoting examples include ClearML, which has pivoted its focus toward GPU optimization, adapting to the growing demand for high-efficiency compute solutions.

Meanwhile, DataChain has transitioned to specializing in LLM utilization, again reflecting the powerful AI-related technology trends. Many other MLOps players have either shut down or been absorbed by their customers for internal use, highlighting a fundamental shift in the MLOps landscape.

Trend #2: LLM accuracy, monitoring, and performance solutions are blooming

Offering model accuracy monitoring has been a core category within ML for some time, but in 2024, the focus noticeably shifted toward monitoring the accuracy of LLMs, including the outputs generated by RAG pipelines and autonomous agents.

As the industry evolved, many older tools like Arize AI and Deepchecks pivoted their offerings to address these new challenges, while a wave of new startups emerged specifically to tackle the complexities of LLM evaluation and trustworthiness – think Galileo or Patronus AI. This transition reflects a broader realignment in the AI ecosystem, where ensuring the reliability of generative models has become just as critical as their performance.

Trend #3: AWS Glue is the only way out of catalog vendor lock-in, but for how long?

While BigQuery, Databricks, and Snowflake support the federation of Iceberg REST catalogs in a read-only mode, AWS Glue stands out by enabling both read and write operations when integrated with Databricks and Snowflake.

This capability positions Glue as a powerful, neutral catalog that helps prevent vendor lock-in, unlike proprietary solutions such as Databricks’ Unity Catalog or Snowflake’s internal catalog.

By using Glue as a flexible data access layer, organizations can maintain greater control over their data strategy, ensuring platform interoperability without being tied to a specific vendor’s ecosystem.

Breaking news!

Snowflake just announced a read/write federation of Apache Iceberg REST catalog using Catalog-linked Databases, becoming the first of the Big Three to relax its hold on the catalog. Now we can only wait for the actual GA release, as other federation announcements – for example Databricks – never came into fruition.

Trend #4: Storage providers prioritize performance

In response to the growing demand for ultra-low-latency storage, Google Cloud introduced the GCS Fast Tier, designed to compete directly with AWS’s S3 Express and high-performance storage offerings from providers like CoreWeave.

These advancements reflect a broader trend in which cloud providers and specialized infrastructure companies are racing to meet the storage needs of AI and real-time analytics workloads, emphasizing not just capacity and cost but also access speed and efficiency as key differentiators.

Trend #5: BigQuery is leaving Databricks and Snowflake in the dust

BigQuery, on the market since 2011, has grown dramatically, and Google revealed just how dominant it has become: BigQuery now has five times the number of customers as both Snowflake and Databricks combined!, underlining its strength as a foundational piece of Google Cloud’s broader data and AI strategy.

Now that we got you interested, let’s dive into the details of each category. 👀

The State of Data & AI Engineering in 2025

Ingestion

This layer consists of streaming technologies and SaaS services that continuously build pipelines to move data from operational systems into storage and analytical platforms.

In 2025, ingestion tooling has shifted toward fully managed, event-driven architectures with built-in change data capture (CDC) support. Platforms like Confluent Cloud, Striim, and Materialize lead this space, offering native integrations with databases, cloud object stores, and AI feature stores. Streaming-first ingestion is now the default, with transformation, schema evolution, and observability increasingly embedded at the ingestion layer to support real-time machine learning, LLM fine-tuning, and low-latency RAG pipelines.

Data Lakes

This layer consists of object storage technologies that serve as data lakes, providing scalable, low-cost repositories for structured, semi-structured, and unstructured data. They form the foundation for analytical engines, machine learning pipelines, and real-time querying.

In 2025, data lake architectures have standardized around open table formats like Apache Iceberg and Delta Lake, decoupling compute from storage and enabling multi-engine interoperability. Cloud providers like AWS, Google Cloud, and Azure continue to optimize their object storage for high-throughput AI and analytics workloads, while vendors like Tabular and Onehouse drive the adoption of lakehouse-native metadata management and transactional consistency at scale.

Metadata Management

Open Table Formats (OTF)

The Open Table Formats category is anchored by three major open-source projects: Apache Hudi, Apache Iceberg, and Delta Lake. Each of these formats provides transactional consistency, schema evolution, and time-travel capabilities on top of object storage, fundamentally transforming raw data lakes into ACID-compliant data warehouses.

In 2025, these formats have become essential for enabling multi-engine compute, simplifying governance, and avoiding vendor lock-in in cloud-native analytics architectures.

Metastores

Metastores provide the critical metadata layer for data lakes, enabling SQL-based querying, transaction management, and schema discovery across open table formats.

While BigQuery and Databricks support read-only federation of Iceberg REST catalogs and Snowflake announced their read/write federation of Apache Iceberg REST catalog using catalog-linked databases, only time will tell how this announcement will impact the Big Three. This positions Glue as the neutral catalog that prevents vendor lock-in: a game-changer for data strategy flexibility.

Read more on our coverage of how Hive’s Metastore is “open” and why the leading candidates to replace it are closed.

Data Version Control, or Git for Data Lakes

This category includes data version control systems that open the door to implementing engineering best practices for data products.

Recent advances in data version control have significantly enhanced the management of data products, particularly with the growing complexity of data pipelines and AI workflows. Solutions like lakeFS are evolving to offer better versioning of large datasets and model artifacts.

And if you’re ready to test your data versioning skills in the lake, the best place to start is by spinning up a local environment.

Compute

Distributed Compute

This category includes technologies for distributed computation.

In May 2025, the Apache Spark Kubernetes Operator was officially launched as a subproject, bringing support for Spark 3.5+ and the latest Kubernetes capabilities. With a rapid release cadence – v0.1.0 debuting in early May followed by v0.2.0 just weeks later – the Spark community demonstrated a strong and active commitment to the project’s development.

Another notable innovation comes from AWS: AWS Sagemaker Unified Studio. SageMaker Unified Studio is a comprehensive data and AI development environment that provides seamless access to all organizational data and enables action across any use case using best-in-class tools. It unifies capabilities from AWS services like Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Bedrock, and Amazon SageMaker AI.

Analytics Engines

This category includes databases that provide analytics capabilities for data analysis.

In 2025, analytics engines are evolving rapidly to meet the demands of AI and real-time processing at scale. Platforms like Presto, Trino, and Apache Flink continue to dominate for interactive querying and stream processing, while tools like Clickhouse and Elastic are expanding into the vector search domain, aligning with the growing need for LLM data management. These engines now serve multiple roles within data architectures, making them more versatile and manageable.

Orchestration and Observability

Orchestration

This category includes solutions for designing and managing data pipelines.

In 2025, orchestration and observability solutions for data pipelines are advancing to support increasingly complex, multi-cloud, and AI-driven workflows. With the rise of data mesh and data fabric architectures, orchestration platforms are becoming more flexible, offering better visibility and governance across decentralized data systems.

At the forefront, platforms like Dagster, Prefect, and Flyte are embedding AI to enable context-aware scheduling, anomaly detection, and dynamic DAG generation. These systems leverage metadata, data lineage, and predictive analytics to anticipate failures, optimize resource usage, and automate remediation.

Integrations with LLMs and agent-based frameworks like LangChain and Microsoft AutoGen are pushing boundaries further by enabling natural language-based pipeline creation and autonomous task coordination.

Meanwhile, cloud-native tools such as Azure Data Factory Copilot and Google Cloud’s Vertex AI are democratizing orchestration through conversational interfaces and AI-assisted workflow design. This shift marks a move from static, code-heavy orchestration to more responsive, intelligent, and self-healing systems that adapt in real time to the evolving nature of data and business logic.

Observability

This category includes tools that provide data quality testing and monitoring and that monitor the health of data pipelines.

Solutions like Monte Carlo and WhyLabs are at the forefront of observability, offering real-time monitoring of data quality, lineage, and drift, thereby ensuring the accuracy and reliability of AI models. However, we’re seeing the Observability and AI/ML Observability spaces converging, considering the positioning of both Monte Carlo and WhyLabs.

Data Science + Analytics Usability

In 2025, the Data Science and Analytics Usability landscape saw rapid innovation, with several of dbt’s competitors—such as Transform, Mozart Data, Knoema, and Y42—gaining success by enhancing the accessibility and collaboration of data workflows.

These platforms introduced features like visual data modeling, AI-powered query builders, and seamless integration with popular cloud warehouses, allowing less technical users to contribute to data transformation and analytics.

For example, Y42 emphasized end-to-end data orchestration with a low-code approach, while Mozart Data simplified setup and maintenance for smaller teams. This evolution in usability pushed the industry toward more democratized and agile data operations, challenging dbt to evolve beyond its developer-centric model.

End-to-End MLOps tools

The MLOps space is shrinking as rapid consolidation and strategic pivots reshape the market. Once crowded with general-purpose platforms promising full-lifecycle support, the market is now shrinking as companies either exit, specialize, or get acquired.

For example, Weights & Biases, a pioneer in experiment tracking and model monitoring, was acquired by CoreWeave to strengthen its vertically integrated AI infrastructure. Meanwhile, ClearML shifted its focus from general MLOps tooling to GPU resource optimization, aligning with the growing demand for efficient model training at scale.

These moves reflect a broader trend: organizations increasingly seek tools that deliver specific, high-performance capabilities rather than sprawling end-to-end platforms. As a result, many legacy MLOps vendors have either shut down or been absorbed by their enterprise customers, who prefer to internalize core components. The current market favors leaner, interoperable tools that integrate seamlessly into custom pipelines, signaling a new phase of maturity in operational AI infrastructure.

Data-Centric AI/ML

Data-centric ML tools innovate by treating data as a first-class citizen, not just a static input to models. They automate, optimize, and standardize how data is labeled, cleaned, monitored, and evaluated across the entire ML lifecycle.

Recent innovations focus on intelligent data labeling through weak supervision and active learning (Snorkel, Labelbox), real-time data observability to detect drift and anomalies (WhyLabs, Evidently AI), and robust data versioning systems (lakeFS) that ensure reproducibility across experiments.

The emergence of synthetic data generators (Gretel AI, YData) further addresses challenges like class imbalance and privacy constraints. Collectively, these innovations shift the focus from building ever-more complex models to ensuring the data feeding them is clean, representative, and continuously monitored, leading to more robust and scalable ML systems.

Vector Databases

Vector databases have seen significant advancement driven by the increasing demand for high-performance search and retrieval in AI, especially for applications like LLMs and real-time analytics.

For example, to improve retrieval accuracy, Pinecone introduced cascading retrieval, combining dense and sparse vector retrieval methods. This approach, along with new reranking technologies, enhances the performance of AI applications, with reported accuracy improvements of up to 48%.

Notebooks and workflow management

Advances in the LLM space also impacted notebooks. For example, NotebookOS is a GPU-efficient platform designed for interactive deep learning training (IDLT). It employs a replicated notebook kernel design, where each kernel consists of three replicas distributed across separate GPU servers synchronized via Raft. This architecture allows dynamic GPU allocation only during active notebook cell execution, significantly improving GPU utilization and reducing costs associated with idle resources.

Another interesting tool is BISCUIT (JupyterLab Extension prototype), which enhances the user experience in computational notebooks by introducing ephemeral user interfaces that scaffold LLM-generated code.

Google’s NotebookLM evolved into a more interactive AI-assisted note-taking tool as well. In late 2024, it introduced features like “Audio Overviews,” which summarize documents in a conversational, podcast-like format, and interactive AI hosts that users can converse with in real time.

Catalogs, permissions, and governance

Data catalogs provide a central repository for metadata, enabling easier discovery and understanding of data assets. Permissions are becoming more granular and automated to ensure appropriate access control, which is particularly important in the context of AI. Governance frameworks are also evolving to include machine learning and automation for enhanced data quality and compliance.

The data catalog space has evolved rapidly, with tools introducing key innovations that go beyond traditional metadata storage. For example, Atlan has pioneered the concept of an active metadata platform, turning the catalog into a collaborative workspace with embedded context, Git-style versioning, and integrations with modern data tools like dbt and Snowflake. Collibra leads in enterprise-grade governance, offering robust policy enforcement, data stewardship workflows, and compliance tracking, making it a cornerstone for organizations with strict regulatory needs.

Meanwhile, Gravitino introduces a cutting-edge approach by combining data virtualization with cataloging, allowing federated access to distributed data sources and real-time metadata synchronization. These tools reflect the shift from static metadata repositories to intelligent, interconnected platforms that power discovery, governance, and data democratization.

Conclusion

The 2025 State of Data AI Engineering Report showcases a transformative year marked by consolidation, innovation, and strategic shifts across the data infrastructure landscape.

The MLOps space is contracting as companies pivot toward AI infrastructure and LLM-specific applications, reflecting the changing needs of the market. In contrast, solutions focused on monitoring LLM performance and trustworthiness are rapidly expanding, underscoring the industry’s new priorities.

AWS Glue emerges as a crucial tool in overcoming vendor lock-in, offering unmatched flexibility in data catalog operations. Meanwhile, storage providers are prioritizing ultra-low-latency performance to support demanding AI and analytics workloads. Lastly, BigQuery’s explosive growth positions it well ahead of Databricks and Snowflake, solidifying its role as a central pillar in Google Cloud’s data and AI strategy.

Together, these trends define a data engineering ecosystem that is increasingly AI-driven, performance-focused, and strategically realigned.

Take a look back at previous reports and compare the evolution of the State of Data & AI Engineering:

Einat Orr is the CEO and Co-founder of lakeFS, a scalable data version control platform that delivers a Git-like experience to object-storage based data lakes. She received her PhD. in Mathematics from Tel Aviv University, in the field of optimization in graph theory. Einat previously led several engineering organizations, most recently as CTO at SimilarWeb.

Read Entire Article