Why I'm not a fan of zero-copy Apache Kafka-Apache Iceberg

3 weeks ago 2

Over the past few months, I’ve seen a growing number of posts on social media promoting the idea of a “zero-copy” integration between Apache Kafka and Apache Iceberg. The idea is that Kafka topics could live directly as Iceberg tables. On the surface it sounds efficient: one copy of the data, unified access for both streaming and analytics. But from a systems point of view, I think this is the wrong direction for the Apache Kafka project. In this post, I’ll explain why.

Zero-copy is a bit of a marketing buzzword. I prefer the term shared tiering. The idea behind shared tiering is that Apache Kafka tiers colder data to an Apache Iceberg table instead of tiering closed log segment files. It’s called shared tiering because the tiered data serves both Kafka and data analytics workloads.

The idea has been popularized recently in the Kafka world by Aiven, with a tiered storage plugin for Apache Kafka that adds Iceberg table tiering to the tiered storage abstraction inside Kafka brokers.

But before we understand shared tiering we should understand the difference between tiering and materialization:

Tiering is about moving data from one storage tier (and possibly storage format) to another, such that both tiers are readable by the source system and data is only durably stored in one tier. While the system may use caches, only one tier is the source of truth for any given data item. Usually it’s about moving data to a cheaper long-term storage tier.
Materialization is about making data of a primary system available to a secondary system, by copying data from the primary storage system (and format) to the secondary storage system, such that both data copies are maintained (albeit with different formats). The second copy is not readable from the source system as its purpose is to feed another data system. Copying data to a lakehouse for access by various analytics engines is a prime example.

There are two types of tiering:

Internal Tiering is data tiering where only the primary data system can access the various storage tiers. For example, Kafka tiered storage is internal tiering. These internal storage tiers form the primary storage as a whole.
Shared Tiering is data tiering where one or more data tiers is shared between multiple systems. The result is a tiering-materialization hybrid, serving both purposes. Tiering to a lakehouse is an example of shared tiering.

Two approaches to populating Iceberg from Kafka

There are two broad approaches to populating an Iceberg table from a Kafka topic:

Internal Tiering + Materialization
Shared Tiering

Option 1: Internal Tiering + Materialization where Kafka continues to use traditional tiered storage of log segment files (internal tiering) and also materializes the topic as an Iceberg table (such as via Kafka Connect). Catch-up Kafka consumers will be served from tiered Kafka log segments, whereas compute engines such as Spark will use the Iceberg table.

Option 2: Shared Tiering (zero-copy) where Kafka tiers topic data to Iceberg directly. The Iceberg table will serve both Kafka brokers (for catch-up Kafka consumers) and analytics engines such as Spark and Flink.

On the surface, shared tiering sounds attractive: one copy of the data, lower storage costs, no need to keep two copies of the data in sync. But the reality is more complicated.

Zero-copy shared tiering appears efficient at first glance as it eliminates duplication between Kafka and the data lake. However, rather than simply taking one cost away, it shifts that cost from storage to compute.

By tiering directly to Iceberg, brokers take on substantial compute overhead as they must both construct columnar Parquet files from log segment files (instead of simply copying them to object storage), and they must download Parquet files and convert them back into log segment files to serve lagging consumers.

Richie Artoul of WarpStream blogged about why tiered storage is such a bad place to put Iceberg tiering work:

“First off, generating parquet files is expensive. Like really expensive. Compared to copying a log segment from the local disk to object storage, it uses at least an order of magnitude more CPU cycles and significant amounts of memory.

That would be fine if this operation were running on a random stateless compute node, but it’s not, it’s running on one of the incredibly important Kafka brokers that is the leader for some of the topic-partitions in your cluster. This is the worst possible place to perform computationally expensive operations like generating parquet files.” – Richard Artoul, The Case for an Iceberg-Native Database: Why Spark Jobs and Zero-Copy Kafka Won’t Cut It

Richard is pretty sceptical of tiered storage in general, a sentiment which I don’t share so much, but where we agree is that Parquet file writing is far more expensive than log segment uploads.

Things can get worse (for Kafka) when we consider optimizing the Iceberg table for analytics queries. I recently wrote about how open table formats optimize query performance.

“Ultimately, in open table formats, layout is king. Here, performance comes from layout (partitioning, sorting, and compaction) which determines how efficiently engines can skip data… Since the OTF table data exists only once, its data layout must reflect its dominant queries. You can’t sort or cluster the table twice without making a copy of it.” – Beyond Indexes: How Open Table Formats Optimize Query Performance

Layout, layout, layout. Performance comes from pruning, pruning efficiency comes from layout. So what layout should we choose for our shared Iceberg tables?

Offset-based layout (good for Kafka, bad for analytics)

To serve Kafka consumers efficiently, the data must be laid out in offset order. Each Parquet file would contain contiguous ranges of offsets of one or more topic partitions. This is ideal for Kafka as it needs only to fetch individual Parquet files or even row groups within files, in order to reconstruct a log segment to serve a lagging consumer. Best case is a 1:1 mapping from log-segment to Parquet file; otherwise read amplification grows quickly.

However, for the analytics query, there is no useful pruning information in this layout, leading to large table scans. Queries like WHERE EventType = 'click' must read every file, since each one contains a random mixture of event types. Min/max column statistics on columns such as EventType are meaningless for predicate pushdown, so you lose all the performance advantages of Iceberg’s table abstraction.

Analytics-based layout (good for analytics queries, bad for Kafka)

An analytics-optimized layout, by contrast, partitions and sorts data by business dimensions, such as EventType, Region and EventTime. Files are rewritten or merged to tighten column statistics and make pruning effective. That makes queries efficient: scans touch only a handful of partitions, and within each partition, min/max stats allow skipping large chunks of data.

With this analytics optimized layout, a lagging Kafka consumer is going to cause a Kafka broker to have a really bad day. Offset order no longer maps to file order. To fetch an offset range, the broker may have to fetch dozens of Parquet files (as the offset range is spread across files) due to analytics workload driven choices for partition scheme and sort orders. The read path becomes highly fragmented, resulting in massive read amplification. All of this downloading, scanning and segment reconstruction is happening in the Kafka broker. That’s a lot more work for Kafka brokers and it all costs a lot of CPU and memory.

So what do we do?

Normally I would say: given two equally important workloads, optimize for the workload that is less predictable, in this case, the lakehouse. The Kafka workload is sequential and therefore predictable, so we can use tricks such as read-aheads. Analytics workloads are far less predictable, slicing and dicing the data in different ways, so optimize for that.

But this is simply not practical for Apache Kafka. We can’t ask it to reconstruct sequential log segments from Parquet files that have been completely reorganized into a different data distribution. The best we could do is some half-way house, firmly leaning towards Kafka’s needs. Basically, partitioning by ingestion time (hour or date) so we preserve the sequential nature of the data and hope that most analytical queries only touch recent data.

The alternative is to use the Kafka Iceberg table as a staging table for incrementally populating a cleaner (silver) table, but then haven’t we just made a copy anyway?

In practice, the additional compute and I/O overhead can easily offset the storage savings, leading to a less efficient and less predictable performance profile overall. How we optimize the Iceberg table is critical to the efficiency of both the Kafka brokers, but also the analytics workload. Without separation of concerns, each workload is trapped by the other.

The first issue is that the schema of the Kafka topic may not even be suitable for the Iceberg table, for example, a CDC stream with before and after payloads. When we materialize these streams as tables, we don’t materialize them in their raw format, we use the information to materialize how the data ended up. But shared tiering requires that we write everything, every column, every header field and so on. For CDC, we’d then do some kind of MERGE operation from this topic table into a useful table, which reintroduces a copy.

But it is evolution that concerns me more.

When materializing Kafka data into an Iceberg table for long-term storage, schema evolution becomes a challenge. A Kafka topic’s schema often changes over time as new fields are added and others are renamed or deprecated, but the historical data still exists in the old shape. A Kafka topic partition is an immutable log where events remain in their original form until expired.

If you want to retain all records forever in Iceberg, you need a strategy that allows old and new events to coexist in a consistent, queryable way. We need to avoid cases where either queries fail on missing fields or historical data becomes inaccessible for Kafka. Let’s look at two approaches to schema evolution of tables that are tiered Kafka topics.

Schema evolution approaches

The Uber-Schema (Superset) Approach

One common solution is to build an uber-schema, which is essentially the union of all fields that have ever existed in the Kafka topic schema. In this model, the Iceberg table schema grows as new Kafka fields are introduced, with each new field added as a nullable column. Old records simply leave these columns null, while newer events populate them. Deprecated fields remain in the schema but stay null for modern data. This approach preserves history in its original form, keeps ingestion pipelines simple, and leverages Iceberg’s built-in schema evolution capabilities.

The trade-off is that the schema can accumulate many rarely used columns, and analysts must know which columns are meaningful for which time ranges. Views can help, performing the necessary coalesces and such like to turn a messy set of columns into something cleaner. But it would require careful maintenance.

The Migrate-Forward Approach

The alternative is to periodically migrate old data forward into the latest schema. Instead of carrying every historical field forever, old records are rewritten so that they conform to the current schema.

Missing values may be backfilled with defaults, derived from other fields, or simply set to null.
No longer used columns are dropped.

This limits the messiness of the table’s schema over time, turning it from a shameful mess of nullable columns into something that looks reasonable. While this strategy produces a tidier schema and simplifies querying, it can be complex to implement. No vendors support this approach that I have seen, leaving this as a job for their customers, who know their data best.

Enter the trap

From the lakehouse point of view, we’d prefer to use the migrate-forward approach over the long term, but use the uber-schema in the shorter term. Once we know that no more producers are using an old schema, we can migrate the rows of that schema forward.

But that migration can cause a loss of fidelity for the Kafka workload. What gets written to Kafka may not be what gets read back. This could be a good thing or a bad thing. Good if you want to go back and reshape old events to more closely match the newest schema. Really bad for some compliance and auditing purposes; imagine telling your stakeholders/auditors that we can’t guarantee the data that is read from Kafka will be the same as was written to Kafka!

The needs of SQL-writing analysts and the needs of Kafka conflict. Kafka wants fidelity (and doesn’t care about column creep) and Iceberg wants clean tables.

In some cases we might have a reprieve if Kafka clients only need to consume 7 days or 30 days of data, then we can use the superset method for the period that covers Kafka, and use the migrate-forward method for the rest. But we are still coupling the needs of Kafka clients with lakehouse clients. If Kafka only needs 7 days of data and Iceberg is storing years worth, then why do we care about data duplication anyway?

Zero-copy may reduce data duplication but does not necessarily reduce cost. It shifts cost from storage to compute and operational overhead. When Kafka brokers handle Iceberg files directly, they assume responsibilities that are both CPU and I/O intensive: file format conversion, table maintenance, and reconstruction of ordered log segments. The result is higher broker load and less predictable performance. Any storage savings are easily offset by the increased compute requirements.

On the subject of data duplication, there are reasons to believe that Kafka Iceberg tables may degenerate into staging tables, due to the unoptimized layout and proliferation of columns over time as schemas change. Such usage as staging tables eliminates the duplication argument.

But even so, there is already plenty of duplication going on in the lakehouse. We already have copies. We have bronze then silver tables. We have all kinds of staging tables already. How much impact does the duplication avoidance of shared tiering actually make?

Also consider:

With a good materializer, we can write to silver tables directly rather than using bronze tables as tiered Kafka data (especially in the CDC case). A good materializer can also reduce data duplication.
Usually, Kafka topic retention is only a few days, or up to a month, so the scope for duplication is limited to the Kafka retention period.
Bidirectional fidelity is a real concern when converting between formats like Avro and formats like Parquet, which have different type systems and evolution rules. Storing the original Kafka bytes as a column in the Iceberg table is a legitimate approach as an insurance against conversion issues (but involves a copy).

The deeper problem with zero-copy tiering is its erosion of boundaries and the resulting coupling. When Kafka uses Iceberg tables as its primary storage layer, that boundary disappears.

Who is responsible for the Iceberg table? If Kafka doesn’t take on ownership, then Kafka becomes vulnerable to the decisions of whoever owns and maintains the Iceberg table. After all, the Iceberg table will host the vast majority of Kafka’s data! Who’s on call if tiered data can’t be read? Kafka engineers? Lakehouse engineers? Will they cooperate well?

The mistake of this zero-copy thing is that it assumes that Kafka/Iceberg unification requires a single physical representation of the data. But unification is better served as logical and semantic unification when the workloads differ so much. A logical unification of Kafka and Iceberg allows for the storage of each workload to remain separate and optimized. Physical unification removes flexibility by adding coupling (the traps). Once Kafka and Iceberg share the same storage, each system constrains the other’s optimization and evolution. The result is a single, entangled system that must manage two storage formats and be tuned for two conflicting workloads, making both less reliable and harder to operate.

This entanglement also leads to Kafka becoming a data lakehouse manager, which I believe is a mistake for the project. Once Kafka stores its data in Iceberg, it must take ownership of that data. I don’t agree with that direction. There are both open-source and SaaS data lakehouse platforms which include table maintenance, as well as access and security controls. Let lakehouses do lakehouse management, and leave Kafka to do what it does best.

Materialization isn’t perfect, but it decouples concerns and avoids forcing Kafka to become a lakehouse. The benefits of maintaining storage decoupling are:

Lakehouse performance optimization doesn’t penalize the Kafka workload or vice-versa.
Kafka can use a native log-centric tiering mechanism that does not overly burden brokers.
Schema evolution of the lakehouse does not impact Kafka topics, providing the lakehouse with more flexibility leading to cleaner tables.
There is less risk and overhead associated with bidirectional conversion between Kafka and Iceberg and back again. 100% fidelity is a concern to all system builders working in this space.
Materialized data can be freely projected and transformed, as it does not form the actual Kafka data. Kafka does not depend on the materialized data at all.
Apache Kafka doesn’t need to perform lakehouse management itself.

The drawbacks are: we need two data copies but hopefully I’ve argued why that isn’t the biggest concern here. Duplication is already a fact of life in modern data platforms.

Tools such as Kafka Connect and Apache Flink already exist for materializing Kafka data in secondary systems. They move data across boundaries in a controlled, one-way flow, preserving clear ownership and interfaces on each side. Modern lakehouse platforms provide managed tables with ingestion APIs (such as Snowpipe Streaming, Databricks Unity Catalog REST API, and others). Kafka Connect can work really well with these ingestion APIs. Flink also has sinks for Iceberg, Delta, Hudi and Paimon. We don’t need Kafka brokers to do this work and do maintenance on top of that.

Unifying operational and analytical systems doesn’t mean merging their physical storage into one copy. It’s about logical and semantic unification, achieved through consistent schemas and reliable data movement, not shared files. It is tempting to put everything into one cluster, but separation of concerns is what keeps complex systems comprehensible, performant and resilient.

Read Entire Article