Stream processing engines are hard to build. But as stream processors advanced in the last decade — it somehow feels like the technology is becoming increasingly more complex and costly to use as the underlying theory & tech evolved.
In this blog — I’ll share a-bit of my thoughts on why streaming technologies are hard to build, and my thoughts/ beliefs on the future of streaming!
Why Is Streaming Inherently Hard?
Fundamentally, I do believe stream processing is inherently harder and more complex than batch processing. This is for 2 main reasons:
1 — Stream processors are highly coupled to other data products
By definition, a stream processor is meant to input data, transform it, and push it to downstream data products (DBs/ data stores)— not to be the final “endpoint” for other systems (backend / BI / end-users) to pull data from it.
Unlike a database (/batch processor), that can usually be a full “end-to-end” data solution — ingesting, transforming and serving data to the backend — a stream processor can almost never be implemented without an additional data product (database / batch processor) that enables users to access, index, and search over its output. A stream processor without a database after it is like email without email inboxes — the data flows, but there’s no easy way to view or interact with anything after it’s sent.
If your backend reads/writes to a specific PostgreSQL instance that utilizes a partitioned schema — even the best stream processor in the world probably won’t be helpful unless it can properly ingest from and sink data to your specific PostgreSQL instance/version, and if it can’t properly handle the scenario of a new partition being added.
As a live testimony for this — some of the hardest bugs we ever encountered while building Epsio, were actually bugs in other data products we integrated with. Since we needed to rely so heavily on the specific format and way each data store does things — the smallest of glitches in a source / destination database could easily translate into a glitch in Epsio.
2 — Stream processors are batch processors, plus more
Fundamentally, batch processing is just a “subset” of stream processing. A batch processor only works on the CURRENT data you have. A stream must both process the CURRENT data you have, and also “prepare” (/build structures) for any FUTURE data that arrives.
A great example of this difference is how JOIN operations are handled (specifically, hash joins). In batch processing, since the system knows it will never receive additional data, it can optimize the JOIN by building a hash table only for the “smaller” side of the join — say, the left side — and then iterate over the current rows in the larger side (the right) in an un-orderly fashion, constantly looking up the corresponding rows in the other side’s hash table.
A stream processor in contrast, needs to build not only a hashtable for the small side, but ALSO a hashtable for the larger side of the JOIN — to make sure it can quickly lookup corresponding rows if any change in the future is made to the smaller side of the JOIN. It needs to build all the structures a batch processor needs for the same operation, PLUS MORE.
(Some stream processing systems — or similar systems like TimescaleDB’s continuous aggregates — mitigate this overhead by limiting support for processing updates to only one side of the join).
As evidence of a stream processor being “a batch processor, plus more”, you can frequently find stream processing systems that also have support for batch queries (e.g., Flink, Materialize, and Epsio soon ;)), while no batch processor supports streaming queries. To build a batch processor — simply take a stream processor and remove some components!
How Were These Difficulties Overcome Until Today?
Historically, the creator of Kafka (Jay Kreps) talked about tackling the difficulties of stream processing by breaking down these huge complexities into many small “bite size” components, thereby making it easier to build each one of them: “By cutting back to a single query type or use case each system is able to bring its scope down into the set of things that are feasible to build”. Specifically, he portrayed a world where many “small” open source tools play together to form the complete “data infrastructure layer”:
Data infrastructure could be unbundled into a collection of services and application-facing system apis. You already see this happening to a certain extent in the Java stack:
- Zookeeper handles much of the system co-ordination (perhaps with a bit of help from higher-level abstractions like Helix or Curator).
- Mesos and YARN do process virtualization and resource management
Embedded libraries like Lucene and LevelDB do indexing
- Netty, Jetty and higher-level wrappers like Finagle and rest.li handle remote communication
- Avro, Protocol Buffers, Thrift, and umpteen zillion other libraries handle serialization
- Kafka and Bookeeper provide a backing log.
If you stack these things in a pile and squint a bit, it starts to look a bit like a lego version of distributed data system engineering. You can piece these ingredients together to create a vast array of possible systems. This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented, but it might be a path towards getting the simplicity of the single system in a more diverse and modular world that continues to evolve. If the implementation time for a distributed system goes from years to weeks because reliable, flexible building blocks emerge, then the pressure to coalesce into a single monolithic system disappears.
Even though this architecture definitely succeeded in improving the “specialization” of specific components and advanced the capabilities of the streaming ecosystem dramatically — it kind of feels like somewhere along the road, the end user that was supposed to be safeguarded from this implementation choice (“This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented”.) ended up paying the price.
In today’s world — there is nearly no technical barrier for great stream processing. Whether it’s scale, complexity or robustness — nearly any application can be built with the tools the market has to offer. The interesting question though is — at what cost?
Even setting aside the sheer number of components users must manage — Debezium, Kafka Connect, Kafka, Flink, Avro Schema Registry, and more — the internal knowledge required to operate these tools effectively is staggering. Whether it’s understanding how Flink parallelizes jobs (does an average user need to understand how PostgreSQL parallelizes queries?), thinking about the subtleties of checkpointing, watermarks, or how to propagate schema changes to transformations — the end users is definitely not abstracted away from the internal implementation choice.
A New Generation Of Stream Processors
Innovation is dirty. And perhaps this historical unbundling and complexity must have taken place in order for us to reach where we are today and to overcome all these difficulties. But similar to the evolution of batch processors — where the “dirty” and “complex” innovation of Oracle must have taken place to allow the rise of the easy to use PostgreSQL, in the past couple of years — a new generation of stream processors is now rising that focus their innovation on the end user, not just technical capabilities.
Similar to the movement from Oracle to PostgreSQL — stream processors in this new generation are not necessarily less internally “complex” than the previous ones. They compound all the “internal” innovation the previous generation had, while somehow still abstracting away that innovation from the end user.
In this new generation — simplicity is the default and complexity a choice. An end user is able dive deep into the complexities and configurations of the stream processor if he wishes to — but can always have the option to use a default version/configurations that “just work” for 99% of the use cases.
For this new generation of stream processors to be built, a radical change in the underlying technology had to happen. The endless process of stitching and gluing together generic old components was broken — and a new paradigm, a much more holistic one, had to be built.
New stream processing algorithms, with much stronger end-to-end guarantees (differential dataflow & DBSP with their internal consistency promises) are used and top-tier replication practices inspired by great replication tools like peerdb, fivetran, etc… are incorporated.
In this new generation of stream processors, each component is designed specifically for it’s role in the larger system — and with strong “awareness” to all the external systems and integrations it must serve.
The concept of “database transactions,” for example, is enforced in all internal components in order to abstract away ordering and correctness issues from the user. While stream processors in the old generation (Debezium, Flink, etc.) by default lose the bundling of transactions when processing changes (meaning large operations performed atomically on source databases might not appear atomically on sinked databases) — our new stream processor must (and does!) automatically translate a single transaction at the source database to a single “transaction” in the transformation layer, and into a single “transaction” in the sink database.
The transformation layer is aware of the source database it transforms data from so it can build the most optimal query plans (based on schemas, table statistics, etc..), and the replication layer is aware of the transformation layer (so it can automatically start replicating new tables that new transformations require).
Parallelism, fault tolerance, and schema evolution are all abstracted away from the user — and a single simple interface is built to control all the stream processors mechanism (ingesting data, transforming data, sinking data).
Moving Forward
Streaming is hard. But as the barrier of adoption in the market shifted from a theoretical one (is reliable stream processing at scale even possible / a real solution with real use cases?) to a usability one (how hard is it for me to use?) — the underlying technologies that power streaming had to and need to change.
Although probably always harder than batch processing, if we’re able to truly make stream processing (almost) as easy as batch processing — the implications to the data world would be staggering. Heavy queries/pre-calculation would be easily pre-calculated (and kept up-to-date) using incremental materialized views, data movement between data stores would be a solved issue, dbt model/ETL processes could be made incremental using a streaming engine, and front ends would be much more reactive (e.g. https://skiplabs.io/). The cost of data stacks will drop dramatically, and engineers would be able to focus much more on what they are supposed to — building application logic!
Similar to how Snowflake revolutionized the data warehousing world with its ease of use and scalability — I truly believe as a software engineer that an “easy to implement” yet robust stream processor can break the existing trade-off between batch processing and stream processing (easy to use vs. real-time/performant) — and unlock a new and exciting world in the data ecosystem!
; Shameless plug
If you are interested in learning more about one of these "newer generation" stream processors — welcome to check out Epsio's docs in this link!