Restate 1.4: We've Got Your Resiliency Covered

8 hours ago 1

We’re excited to announce Restate v1.4, a significant update for developers and operators building and supporting resilient applications. The new release improves cluster resiliency and workload balancing, and also adds a multitude of efficiency and ergonomics improvements across the board. Experience less unavailability and achieve more with fewer resources.

Restate v1.4 enhances multi-node clusters with a new network fabric and gossip-based failure detection. This means lower overheads, more predictable latencies, and much improved reconfiguration and recovery during partial failures like network partitions.

A significant improvement in v1.4 is the new gossip-based failure detection system. One of the hardest problems to crack in distributed systems is to reliably and correctly decide whether a given node is healthy, or if it is just temporarily unreachable from another node’s point of view. The new detection mechanism and partition placement deliver up to 10x faster detection of network partitions or node failures. Gossip is also used to disseminate partition leadership and other metadata updates for faster reaction to cluster reconfigurations.

Gossiping liveness and leadership information

The faster failure detection and reconfiguration is visible in our Jepsen tests where we measure request latencies while inducing network partitions. The test runs a 3 node cluster and randomly partitions a single node from the rest before connectivity is reestablished. With Restate v1.4, the request latency was significantly lower, compared to Restate v1.3, showing the effectiveness of detecting node failures and reconfiguring the cluster to remain available.

Request latency under network partitions

The new messaging fabric creates separate network connections for different classes of inter-cluster traffic. This ensures that high-volume data intensive traffic doesn’t delay latency-sensitive traffic, maintaining predictable quality-of-service across the board. We also replaced GZip with Zstd, a more CPU-efficient compression algorithm, to reduce the size of messages that benefit from it. All of this means your applications will experience more consistent performance during normal operation and recover faster when nodes fail or network issues occur, improving overall system responsiveness and availability.

Restate clusters will automatically perform partition rebalancing across configured nodes. This means better space and compute utilization across the board. Previously, partition scheduling made decisions based on the best available information at the time a decision was needed. However, over time, as nodes leave and re-join the cluster, this could result in uneven partition spread across cluster members. With v1.4, the cluster will actively rebalance the load and revisit previous placement decisions that are no longer optimal. Another improvement is that partition processors now go through explicit drain and warm-up phases, which further minimises the disruption during a leadership change – this in turn means shorter pauses and more predictable tail latencies for your applications.

The Replicated loglet provider is a key component at the core of our Bifrost distributed log, and it is now the default in v1.4. When we shipped Restate v1.3, enabling cluster support required explicitly opting into the Replicated loglet as the segmented log backend. In keeping with our “batteries included” philosophy, and thanks in no small part to the design of Bifrost, migration is completely seamless.

If you have not configured a log provider, Restate will automatically migrate your existing setup to the Replicated loglet implementation, even on single nodes, giving you enhanced performance and a smoother path to distributed deployments. The Local loglet remains available if explicitly configured, and downgrades to v1.3 are safe and supported.

We have further streamlined configuration by removing the previously deprecated local metadata backend. The metadata store holds a few tiny yet crucial pieces of information that need to be carefully managed whether in single-node or cluster deployments. The previously deprecated local metadata backend for single-node deployments is removed in v1.4. The default is the replicated metadata server, and nodes will similarly perform an automatic migration on startup if local metadata was previously in use.

Whether you are running Restate locally on your development machine or rolling out containers to multi-node clusters, nobody likes to wait. Restate startup latency has improved substantially in v1.4 – the server is now ready to serve requests up to 2.5x faster compared to v1.3.

As always, the fastest way to get started with Restate is by following the quickstart guide.

Whether you are still considering Restate for your project, have questions about the upgrade path, or feedback about the new features, we are always keen to hear from you on Discord or Slack.

Restate is open, free, and available at GitHub or at the Restate website. Star the GitHub project if you like what we are doing!

Restate is also available as a fully managed service running in our cloud or yours. Try Restate Cloud for free and contact us for more information.

Read Entire Article