We Bet on Rust to Supercharge Feature Store at Agoda

1 month ago 4

Agoda Engineering

By Worakarn Isaratham

Press enter or click to view image in full size

Rust has become the language everyone seems to be talking about. From startups to tech giants, companies are rewriting core systems and touting dramatic gains in performance, reliability, and safety. With so much buzz, it’s easy to wonder if Rust is just the latest industry fad or if there’s real substance behind the hype.

At Agoda, several teams had already begun experimenting with Rust for performance-critical workloads, and their positive experiences caught our attention. Still, our decision to adopt Rust for Feature Store wasn’t about chasing trends or joining the crowd. It was a response to real, pressing challenges: unpredictable latency, scaling bottlenecks, and the limits of our existing JVM-based stack. We needed a solution that could deliver consistent, high-performance under heavy load, and Rust seems like the perfect answer for that.

This post isn’t about Rust as a silver bullet. It’s about what it actually took for a team with no prior Rust experience to migrate a critical, high-traffic system. We’ll share why, for us, the Rust “bandwagon” turned out to be the right move.

Architecture

Feature Store is a centralized repository for managing and serving machine learning features. At Agoda, the scope of our Feature Store extends beyond the traditional storage to include tools to facilitate generating features from internal Kafka pipelines, help train models using historical features from our data lake, and more. That said, this post focuses on the more fundamental service of delivering features for model inference, which is handled by a component we refer to as Feature Store Serving.

Press enter or click to view image in full size

The component has a straightforward job: fetch features from ScyllaDB storage and deliver them to online services within Agoda. The real challenge is performance. One of our key SLAs is a P99 latency of 10 ms, even while handling millions of requests per second. This latency budget includes cache lookups, database engine time, network, and transport delays. Despite our efforts to optimize cache and database performance, there’s very little headroom left for Serving itself.

Quick History

Feature Store Serving was not built from scratch. We started off as a fork of Feast, a popular open-source feature store built in Java using Spring Boot. This gave us a solid foundation, but as our requirements grew and our code evolved, we diverged significantly from what Feast provided out of the box.

Several pain points pushed us toward migration. For example, we wanted to change the details of our gRPC calls to better match how features are stored in our database, reducing unnecessary data conversion in Serving. Feast’s support for multiple storage backends, while flexible, became a burden for us, since we only used a single storage system. Any interface change meant dealing with extra complexity and code paths we didn’t need. Additionally, parts of the codebase were challenging to make fully asynchronous, resulting in inefficiencies and performance issues due to blocking, synchronous code.

Over time, these challenges compounded. Maintenance friction continued to increase, and it became clear that we needed a solution that better integrated with the rest of our stack. Since all other parts of Agoda Feature Store were already written in Scala, migrating Serving to Scala at the end of 2022 allowed us to reuse code, leverage internal frameworks, and tap into our colleagues’ Scala expertise (Scala was Agoda’s primary language at the time, now largely replaced by Kotlin).

This setup worked well for a couple of years, but as our traffic and data volume continued to grow, new limitations began to surface. We began to experience increased latency, higher resource consumption, and a growing number of operational issues as we attempted to optimize the JVM's performance. The most critical issue was garbage collection, which started to impact our P99 latency in unpredictable ways. We needed a solution that could deliver consistent, low-latency performance and make more efficient use of hardware resources. This forced us to seriously consider another migration, and this time, away from the JVM entirely.

Why Rust?

With the decision to move away from the JVM, we began evaluating our options for a new language and runtime. Rust quickly stood out for its reputation: high performance, predictable low latency, and efficient resource usage. Its strong safety guarantees and zero-cost abstractions offered the kind of control and reliability we simply couldn’t achieve with JVM-based languages. On paper, Rust appeared to be the perfect fit for our needs.

However, we were also well aware of Rust’s steep learning curve. None of us had significant prior experience with the language, and the ecosystem was still relatively new to us. The prospect of ramping up on a new language, especially one known for being unforgiving, was daunting.

What tipped the balance was the experience of a handful of other teams at Agoda who had already adopted Rust for performance-critical workloads. Their positive feedback, along with the opportunity to consult with them if needed, gave us the confidence to move forward. The potential benefits were too compelling to ignore, and we decided to take the leap and rebuild Feature Store Serving in Rust.

Kicking Off the Migration

To validate our choice, we began with a Rust proof of concept. We focused on the core serving logic, intentionally omitting unit tests, some of the more nuanced business logic, and most integrations. For example, measurement and monitoring hooks were skipped, and we quickly discovered that some Agoda-specific integrations didn’t have existing Rust bindings.

The POC was completed by one developer in about a week, a pace comparable to our earlier migration to Scala. This was surprising, considering our team’s much greater proficiency in Scala compared to Rust. The Rust compiler itself was a huge help. Its comprehensive and clear error messages made it much easier to understand and fix mistakes as we learned. GitHub Copilot was also a big enabler, helping us navigate Rust’s syntax and ownership model, suggesting idiomatic code, and letting us move quickly despite being new to the language.

Importantly, pairing Copilot with the Rust compiler created a natural check-and-balance: Copilot could suggest code rapidly, while the compiler helped catch mistakes or questionable suggestions, reducing the risk of subtle errors or hallucinations slipping through. The learning curve was still real, as Rust’s strictness around ownership and lifetimes took some getting used to. Still, the productivity boost from AI, combined with the compiler’s guidance, was undeniable.

Once the POC was ready, we ran benchmarks to compare its performance with our Scala implementation. On identical VM specs, the benchmarks showed a dramatic improvement in favor of the Rust version, to no one’s surprise. As shown in the graphs below, Rust outperformed Scala across all key metrics: requests per second, CPU utilization, and memory usage, demonstrating a clear advantage in efficiency and scalability.

Press enter or click to view image in full size

Press enter or click to view image in full size

It’s worth noting that, when building the POC, our priority was to get things working quickly, rather than fully optimizing the code. This meant there was still plenty of room for improvement in our benchmark results.

As an example of sub-optimal code from the POC stage, here’s how we retrieved the list of available feature sets on every request:

Press enter or click to view image in full size

Experienced Rustaceans will notice that, while valid, this approach is highly inefficient. Instead of cheaply cloning the shared pointer, it performs a deep copy of the entire data structure, wasting resources when all we needed was shared access. This happened because our original code expected an owned data structure, but we later introduced shared ownership using Arc. Rather than refactoring the interface, we tried to make the old code work, resulting in unnecessary deep copies and defeating the purpose of using Arc in the first place.

The correct approach is to simply clone the Arc and update the code to accept and work with it directly:

After fixing this, latency dropped sharply, as shown in the graph below.

Press enter or click to view image in full size

Even with this inefficiency, Rust still outperformed Scala by a wide margin. This gave us confidence to move forward with a full migration, knowing that the development process was manageable, the performance gains were real, and with more experience and tuning, we could push performance even further, even for a team new to Rust

Ensuring Correctness

Migrating a critical system like Feature Store Serving isn’t just about performance; it’s essential to guarantee that the new implementation behaves exactly as expected. While we relied on unit and integration tests to catch most issues, we were aware that there was still a risk of missing edge cases or subtle behavioral differences. To address this, we built a “Shadow Testing” system to capture any discrepancies.

Press enter or click to view image in full size

In this setup, we deployed a special version of the Rust server that, after processing each request, would also forward the same request to the legacy Scala service. The responses from both systems were then compared in real time. To avoid impacting production, we used Istio to mirror only a small percentage of live traffic to this comparator module.

This approach enabled us to validate the Rust implementation against real-world scenarios and data, identifying any discrepancies that might have been overlooked in other forms of testing. The feedback from shadow testing was invaluable. It helped us catch and fix a few subtle bugs before going live, and also confirmed that the vast majority of responses were identical. This gave us the assurance we needed to fully switch over to the Rust implementation in production, knowing we weren’t risking production stability.

After Migration

Compared to where we were a year ago, our traffic, measured as the number of entities delivered by Serving, has grown by about 5x. The first chart below shows this steady increase over time.

Press enter or click to view image in full size

Despite this surge, our resource usage has improved dramatically. At the time of migration, CPU usage dropped to just 13% of the Scala peak, and even now, as we handle five times the traffic, it still remains at only 40% of that original level. Memory usage saw an even more dramatic improvement, initially dropping to just 1% of the Scala baseline and currently holding steady at around 15%. The following charts illustrate these trends in CPU and memory usage.

Press enter or click to view image in full size

Press enter or click to view image in full size

These efficiency gains have had a direct impact on our infrastructure costs, reducing our annual compute expenses by 84%, compared to if we had continued using Scala.

As a side effect of our improved performance, we’ve started to uncover bottlenecks in deeper infrastructure layers. For example, we found that a default CPU limit in our private cloud was hampering performance across teams, and we also identified excessive logging in Istio as another source of overhead. While these issues are outside the scope of this post, they highlight new opportunities for optimization that extend beyond just our application.

Looking at the data, it’s clear we made the right move. Rust hasn’t just helped us keep up with growth; it’s put us in a position to handle even more in the future.

Key Takeaways

  • Rust delivered real, measurable gains. Migrating to Rust enabled us to handle 5x more traffic while dramatically reducing CPU and memory usage.
  • Cost savings are substantial. If we had stayed with Scala, our daily compute cost would be about 6.3x higher at today’s traffic levels.
  • AI tools and the Rust compiler made adoption feasible. Even with no prior Rust experience, Copilot and the compiler’s error messages helped us ramp up quickly and safely.
  • Testing and validation were critical. Shadow testing caught subtle bugs and gave us confidence to switch over without risking stability.
  • The migration was worth it. Rust hasn’t just kept us afloat. It’s positioned us to scale even further and tackle new challenges with confidence.
Read Entire Article