Hydrolix Handles Late-Arriving Data

3 months ago 2

Late-arriving data is one of the biggest challenges in real-time analytics. If a service goes down or sensors don't transmit data immediately, you might not get logs until hours or even days later. And it's common for devices like mobile phones to send late-arriving data.

Many real-time analytics platforms don’t do a good job of handling late-arriving data. Some platforms can’t handle it at all. Then there's also the issue of sorting late-arriving data, which is out of order in addition to being late. Out-of-order data leads to issues like inaccurate data sets and inefficient queries. Some use cases also rely on data being in the correct order.

Hydrolix is a streaming data lake optimized for timestamped data that uses a merge service to sort late-arriving and out-of-order data. Whether the data arrives a minute late or months late (or even a year late), Hydrolix can handle it. Let's take a look at some of the challenges of late-arriving data at scale and how Hydrolix merges and sorts late-arriving data.

The Challenges of Late-Arriving Data at Scale

In complex distributed systems, many different services emit logs, and they aren't always punctual about it. Real-time analytics platforms need to process this data as soon as it arrives, store it, and then make it available for dashboards, queries, and alerts.

Even in a best-case scenario, late-arriving data can skew your dashboards and alerts. Your dashboard may show a low error count over the last five minutes, but what if you have a service that's failing and sends log data late as a result?

Everything may look fine in your dashboards, but you just haven't received the bad news yet.

To make matters worse, many platforms have limited tolerance for late-arriving data, and any data that arrives after a certain period is simply discarded. This leads to inaccurate data and creates significant issues for use cases ranging from observability to machine learning.

Some enterprises use complex ELT (extract-load-transform) pipelines to handle late-arriving data, duplicate and reprocess entire tables, or rely on transactional OLTP datastores to handle updates.

None of these solutions are ideal for data at petabyte scale. They are costly, less performant, overly complex, or some combination of the three.

This doesn't even account for the fact that the data is not just late but also out of order. Even if a solution has a mechanism for sorting out-of-order data, it may involve reprocessing entire tables.

If you are working with data at scale, late-arriving data isn't just theoretical but an operational reality.

Large systems can behave unpredictably. There are simply too many different services in too many different places for all data to arrive in an orderly manner. Even in a best-case scenario, there can be network bottlenecks when transmitting logs from services that are distributed globally. There are edge services, sensors, and devices which won't always send log data in a timely fashion. And then there are more problematic scenarios, such as code issues and network outages.

Can Your Data Solution Handle Late-Arriving Data?

Late-arriving data is a headache, and ideally your data solution should take care of it for you so you. Here are a few considerations to keep in mind when evaluating whether a solution can handle data that arrives late:

Can your solution handle late-arriving data at all? Solutions that simply discard late-arriving data can leave you with unacceptable gaps in your data.
How long will your solution handle late-arriving data? Some solutions may handle late-arriving data for a short period of time and then discard it. If your log data is arriving outside of this window, it can have a negative impact on the accuracy of your analytics.
Is late-arriving data handled efficiently? Depending on the architecture, late-arriving data can result in costly upserts (slower writes) or costly queries (due to disorganized partitioning). And some solutions may delay pipeline processing to wait for late-arriving data, which isn't just inefficient—it essentially makes all of the data late, not just some of it.
Can you monitor ingest pipelines? Late-arriving data can be planned (such as scheduled batch jobs) or unexpected. You should have visibility into unexpected latency so you can find and fix any underlying issues that are causing it. Late-arriving data will skew your real-time analytics and impact use cases like observability even if you have a system that efficiently handles data that arrives late.

Let's take a look at how Hydrolix handles late-arriving data to address the first two considerations. In a future post, you'll learn how to monitor latency in your ingest pipelines.

How Hydrolix Handles Late-Arriving Data

Unlike many other platforms, Hydrolix handles late-arriving and out-of-order data by design, so there's no separate or special process for handling late-arriving data. It's one of the many benefits of partitioning data by time. As partitions are merged and reordered, sorting happens naturally. So whether your data is fresh or aging (hopefully gracefully), Hydrolix ingests, merges, and makes it available for query in exactly the same way.

Ingesting Both Fresh and Late-Arriving Data

At ingest time, Hydrolix transforms, indexes, compresses, and partitions incoming data. And yes, that is a lot of things to do, especially when millions of rows per second are arriving in real-time, but with Kubernetes and massive parallelism in play, the time from ingest to being available for query is typically 15-30 seconds.

Hydrolix orders and partitions data by time during ingest, with basic information about partitions including their minimum and maximum timestamps kept in a database catalog.

Even though Hydrolix essentially handles fresh and late-arriving data the same way, there are separate configurations for both kinds of data. You can fine-tune ingest and query performance to meet your use case, so if you really don't care much about that late data coming in, you can put fewer resources into making it available quickly. All of these settings are configurable as shown in the stream settings documentation. If you need late-arriving data to have the same level of priority as fresh data, it's a simple configuration change.

Hydrolix distinguishes between fresh, real-time data and older, late-arriving data by checking the timestamp of incoming logs. By default, data is fresh if its timestamp is no older than three minutes. Hydrolix flushes fresh partitions to storage more frequently to ensure that they are immediately available for querying, and there is a smaller amount of time between their minimum and maximum time stamps (1 minute as opposed to 5).

The next image shows how Hydrolix stores incoming data in both fresh and late-arriving data partitions based on the defaults.

This data enters Hydrolix in real time but out of order. Intake heads, which consist of scalable, stateless Kubernetes infrastructure, examine the primary time stamp, then sort the log lines into partitions. In this example, two of the log lines are over six hours old, but their primary timestamps are within 5 minutes of each other, so Hydrolix writes both to the same late-arriving data partition.

Note that in this example hot_data_max_age_minutes is set to the default of three minutes, so data that arrives less than three minutes after it's generated is considered fresh. Whether it's treated as late or not depends on how you've configured hot_data_max_age_minutes. And how late can data arrive? You can set cold_data_max_age_days to whatever setting you need. For instance, if you set it to 365 days, even if data arrives a year late, Hydrolix will handle it. Anything beyond cold_data_max_age_days will be rejected.

By design, partitions are very small at ingest time, regardless of whether they include fresh or late-arriving data. (The default max row size of a partition is the same for both fresh and late-arriving data.) Hydrolix prioritizes real-time availability for incoming data, so initially, data in partitions can be out of order and partitions may not have an optimal structure yet. The merge service is where the magic happens.

Merging and Sorting Late-Arriving and Out-of-Order Data

Hydrolix's background merge service compacts and further optimizes these smaller partitions into larger partitions. These larger partitions typically have greater compression rates and data is in order. During the process, Hydrolix automatically sorts both recent and late-arriving data. After the merge process, all data is partitioned by its primary timestamp regardless of whether it arrived late or out-of-order.

No Issues With Stale Caches or Misses

Unlike many solutions that use object storage, Hydrolix doesn't rely on caching for performance. While query caching can be a valuable tool, caches quickly become stale when working with real-time data. Late-arriving data adds more potential inaccuracy to caches for solutions that rely on it. With Hydrolix, you don't have to worry about stale caches or misses even though partitions are regularly changing over time.

Hydrolix Summary Tables Account for Late-Arriving Data

Many solutions include support for aggregation tables or materialized views. Aggregation tables precompute and store the results of aggregations like sums, quantiles, and uniques so you can quickly retrieve them without querying the underlying raw data. A materialized view saves the result of a query for fast retrieval and doesn't necessarily need to be an aggregation.

Platforms that use aggregate tables or materialized views often aren't able to incorporate late-arriving data. This leads to inaccurate aggregations and views, which will negatively impact dashboards, queries, and alerts.

That's not the case with Hydrolix. By design, Hydrolix retrieves data for aggregates only once from parent tables and then stores that data in summary partitions regardless of whether the data is fresh or late-arriving. Hydrolix merges and optimizes summary partitions just like the partitions for raw data. Hydrolix uses intermediary states to recalculate aggregates like averages and quantiles, which is why the data only needs to be retrieved once from parent tables. We'll be writing more about summary tables in a future post.

Late to the Party? Try Hydrolix

If you're using a platform that either can't handle late-arriving data or handles it inefficiently, it may be time to reconsider your solution.

Handling late-arriving data is just one of the advantages of Hydrolix. You also get sub-second querying even on trillion-row datasets, streaming ingest at terabyte scale, and perhaps most significantly of all, costs that are typically 75% lower (or more) than other platforms. That's because Hydrolix uses stateless architecture, decoupled object storage, and 20x-50x compression to dramatically reduce the cost of your data.

Start a trial or try a demo.

Read Entire Article