Inditex handles traffic spikes: Building a custom load test platform

15 hours ago 1

Inditex Tech

Press enter or click to view image in full size

For any e-commerce business, it is crucial to deliver a seamless shopping experience that leads to a frictionless checkout.

This is especially critical during those times of the year when sales are concentrated over a short period. In our case — Zara, Massimo Dutti, Pull&Bear and the rest of Inditex brands’ case –, our e-commerce platforms must withstand spikes of massive concurrent sessions, resulting in an exponential increase in traffic compared to regular days.

Managing this surge requires a robust system that ensures real-time inventory updates, smooth navigation, and reliable checkouts, all without performance interruptions.

The challenge: Handling traffic spikes

One of the most effective ways to ensure a website can handle traffic increases is load testing. It simulates high traffic to measure how well a system performs under stress.

Load tests track key metrics, like requests or orders per minute, to spot issues such as slow page loads or server bottlenecks early, so they can be fixed and the platform can handle them before they occur and have a real effect in production.

At Inditex, we initially used a commercial load-testing solution. However, as demand grew every year, it ended up falling short on real-time observability and long setup times. Given the scale and complexity of our global e-commerce operations, we needed a more flexible, faster, and precise solution to simulate and prepare for extreme scenarios.

So, we developed our load-testing platform: ICaRUS (Inditex Chaos and Resilient User System).

Why build a custom load-testing platform?

Why did we opt for building ICaRUS, our custom load testing solution?

  • Reliability: We required a quick-stop feature based on thresholds that would halt tests before they impacted the live site, even during high-traffic events.
  • Scalability: ICaRUS can simulate real-world traffic, allowing us to optimize service capacity.
  • Tailored scenarios: With ICaRUS, we can design specific load scenarios to meet our exact needs.
  • Real-time monitoring: To track performance during tests and adjust as needed, we needed to send real-time logs (with data sampling to reduce volume but still capture key insights) and metrics every 15 seconds.

Architecture

To support the magnitude of our ecommerce operations, ICaRUS architecture needed these 6 key components:

  1. DSL to define the tests.
  2. An orchestrator.
  3. An operator to manage the lifecycle of a load test scenario.
  4. Autostop feature to stop the test whenever we wanted.
  5. Load balancer cache.
  6. Observability of the whole test lifecycle.

Press enter or click to view image in full size

1. Domain-Specific language (DSL)

The reason for building ICaRUS was to be able to answer beforehand questions like whether the platform is resilient enough to handle extraordinary traffic surges, or how it would perform under high-demand in specific markets.

These questions are answered by defining likely customer actions (called scenarios) such as adding items to the cart, browsing the catalog, or registering users under stress conditions.

And these scenarios are defined, or written, by testers in a domain-specific language DSL, which is specific to the domain or field to be tested.

Our DSL is called ICaRUS DSL and, on top of DSL traditional features, allows for test case management by adjusting the importance of each action during the test (called relative weight) and controlling the load ramp-up.

For example, if adding items to the cart happens more often than registering users, testers can assign a higher weight to the “add to cart” action. How? By defining a higher number of virtual users (VUs) for it.

Another important characteristic of ICaRUS DSL is that testers can specify the instances to use (either regular or spot), improving cost efficiency.

2. Orchestrator

The load testing platform needs an orchestrator to act as a bridge between the tester (who defines the tests in DSL) and the rest of the components (mainly the operator) to trigger or stop the tests.

Ours is called ICaRUS Server and it integrates with our internal authorization and auditing mechanisms. It handles authorization and auditing and validates the test scenarios to ensure all parameters (like virtual users (VUs), duration, and real-time monitoring) are properly configured before execution.

3. Operator

Since our testing platform is Kubernetes-based, we decided to use an operator to execute the load tests and oversee the test lifecycle, ensuring smooth execution, in the right order, and providing testers with tests updates.

We use k6 Operator to run load tests in a distributed way. But, since we also needed to provide additional functionality on top of k6, we have also built ICaRUS Operator.

k6 Operator

k6 is an open-source tool and its native integration with Kubernetes made it a perfect fit for our setup. We chose k6 because of its:

  • High performance: It runs large load tests fast and efficiently, helping to spot performance issues early on and fix them before they impact users, especially during peak traffic periods.
  • Built-in metrics: k6 generates detailed performance data(such as request rates, response times, error ratios, and active virtual users) that integrates with Grafana, which we already used for monitoring. k6 checks, iterations, and VUs allow us to track system health and performance in real-time.
  • Efficient scenario creation: Its intuitive scripting model mimics real-world user behavior, making it easier to create complex scenarios. This saves time and improves test accuracy.
  • Flexibility: With JavaScript, k6 allows rapid development and iteration of load-testing scripts, allowing to quickly adapt to new needs and test scenarios.
  • Active community support: k6 is constantly updated, with a large community providing plugins and detailed documentation, ensuring it stays reliable and up-to-date with the best testing practices. As part of this ecosystem, we also contributed by releasing several open-source plugins tailored to common testing needs.

Contributions to k6 Operator

We did have to customize K6 Operator to fit to some scenario conditions we did have to address

We contributed to k6 with this couple of improvements:

Press enter or click to view image in full size

ICaRUS Operator

After validating the scaling rules, ICaRUS Server (the orchestrator) triggers the load test by communicating with the operator.

Although k6 Operator is the operator and the piece that executes the tests, we needed to extend its functionality to fully integrate with some particularities of Inditex development environment and infrastructure.

So, we created ICaRUS Operator, which makes all the preparations so that k6 can execute the load test, as well as managing the autostop feature and real-time observability in Grafana.

Additionally, and since we use Kubernetes clusters, ICaRUS Operator creates a Custom Resource Definition (CRD) in the cluster, which is like a formula that calculates the resources needed to execute the load test according to the defined test scenario. In our case, this CRD is a custom API called IcarusLoadScenario that includes:

  • A complete test scenario, including VUs (virtual users), duration, frequency, and other test parameters.
  • Cluster scaling information to ensure the cluster can handle the load during the test.
  • Downloading the necessary scripts and datasets to execute the test.
  • Real-time observability setup, including logs, metrics, alerts, and stop mechanisms.

Press enter or click to view image in full size

4. Autostop

Tests may impact productive environments, causing service interruptions, errors in the purchase process, or slowing down the page loading. So, for us, it is crucial to have a quick-stop feature based on thresholds to prevent the test from negatively impacting our business.

Actually, this autostop feature was one of the reasons why we opted for building our own load testing solution in the first place.

Testers can set specific thresholds (for example, a maximum response time of 5 seconds), and if these limits are reached, the autostop component automatically halts the load test.

Besides stopping the test, Autostop also logs the reason why the test stopped, helping testers quickly find and fix the problem.

In the picture below, you can see the dashboard that displays the test results, with clear alerts when limits are exceeded. In this instance, Autostop was triggered by too many HTTP 500 errors (highlighted in red) with detailed “stop reasons” listed.

Press enter or click to view image in full size

5. Load balancer cache

Our ecommerce uses a paired cluster model with a load balancer that distributes traffic across multiple cluster instances. This allows for high availability, prevents crashes, and avoids overloading any cluster.

So, to replicate real case scenarios, ICaRUS needs a load balancer too that handles the traffic generated by k6.

To balance the load, we use Varnish. It has a dual role, acting as a:

  • Load balancer: to evenly distribute traffic across clusters.
  • Caching system: to implement caching mechanisms to optimize efficiency. It filters out certain test-generated requests before they reach the cluster.

We deployed a specific number of Varnish instances for each scenario, with the load-balancing configuration defined by the tester. Each instance has a custom load balancing setup, dynamically generated using VCL (Varnish Configuration Language).

6. Observability

Our previous load-testing solution only offered basic reports after the test had finished, but we needed…

  • Real-time log access: Instant visibility into logs to spot issues and adjust parameters during tests.
  • Quick access to metrics: Key performance stats available as the test runs (not only after completion) for proactive action.
  • Customizable metrics: Flexibility to add custom metrics for different testing scenarios and needs.

… to catch and fix issues in real time.

That’s why we built ICaRUS with a strong observability component.

To manage logs and metrics efficiently, we integrate Vector, a highly efficient tool for building observability pipelines while offering flexibility for processing and transforming it. Vector allows us to:

  • Whitelist k6 metrics to focus on what matters most.
  • Reduce the cardinality of the metrics sent.
  • Have policies for sending the functional metrics generated in k6 scripts.
  • Monitor by functional groups (k6 groups) or individual monitoring by test case.

And it also offers sampling techniques, so we capture only the most relevant logs to avoid system overload

For metrics, we use the xk6-output-prometheus-remote extension, which enables real-time monitoring with just a 15-second delay.

In this setup, k6 sends the metrics to Vector, which acts as a `remote_write`. This means that metrics go straight to Prometheus, centralizing the data and reducing system overload.

This setup is integrated into Grafana so we can have live insights during tests.

Press enter or click to view image in full size

Platform context

Given Zara business and magnitude, Inditex platform operates through multiple clusters globally to support the business expansion in all its aspects (e-commerce and physical stores, but also the associated transport and logistics).

In each cluster, we have created dedicated infrastructure for launching load tests. More specifically: ICaRUS Operator and k6 Operator.

The rest of the services (load balancer, autostop, or observability) are created when the test is launched and are deleted (if specified) once the test has finished.

When launching a load test, testers choose in which cluster to execute the test. They also choose which Azure instances to use: either regular or spot.

Press enter or click to view image in full size

Launching load tests with ICaRUS

With a clear picture of ICaRUS architecture and main components, now it’s time to understand how launching load tests with ICaRUS looks like.

First, the tester defines the test scenario (in ICaRUS DSL).

Then, ICaRUS Server (the orchestrator) communicates (via CRD) with ICaRUS Operator, which calculates the resources needed for the test, factors in the geographic location of the System Under Test (SUT) to ensure the test environment matches the production setup, and selects the Kubernetes cluster for execution.

This is when the test properly begins. It has four phases:

1. Scale, to set and scale test infrastructure.

2. Prepare, to load datasets and configure test resources.

3. Run/Stop, to execute the test and stop and restart if needed.

4. Tear down, to remove the test infrastructure.

Press enter or click to view image in full size

1. Scale: Cluster scaling for load tests

The first phase involves creating virtual machines for the test (based on the DSL-defined scenario).

To avoid interference, we keep each test scenario independent and isolate the resources for each test. We scale the Kubernetes cluster by allocating a set number of nodes based on test requirements. To automate this process, we use MachineSets, focusing on the infrastructure provider.

In this phase, testers can choose to either manually remove nodes after the test or set a TTL (Time-To-Live) to automatically clean up resources after the test, preventing wasted resources.

Once scaling is complete, the environment is ready for the test.

Press enter or click to view image in full size

2. Prepare: Setting up the test scenario

Once the VMs are created, the DSL specs are sent to the infrastructure, which initializes all virtual users and loads datasets into memory.

Setting the test environment with ICaRUS solves two challenges we had with our previous load-testing system:

  • Faster setup: With ICaRUS, we cut setup times by 90%, so we can run more tests in less time.
  • Handling traffic spikes: k6 handles rapid load ramp-ups much better, making it ideal for scenarios with fast increases in traffic.

We also use the k6 REST API to monitor the initialization status in real time. This ensures the test is only marked as “Ready” once all virtual users are fully set up.

As the final check, the Autostop mechanism is tested before launching the load test to ensure it can be stopped immediately if needed.

Press enter or click to view image in full size

3. Run/Stop: Execution and monitoring

When the test environment is ready, the test is executed. Since metrics are collected every 15 seconds, testers have full visibility on the test performance. Plus, they can stop and restart the test as needed.

Testers have real-time information on:

  • Scenario overview: High-level stats, like VUs, HTTP requests, error rates, and iterations.

Press enter or click to view image in full size

  • Test case overview: Metrics for each test case, such as request counts, response times, and errors.

Press enter or click to view image in full size

  • Groups overview: Organizes requests into single-user actions, helping to spot issues in scripts or transactions.

Press enter or click to view image in full size

  • Virtual machine monitoring: Tracks CPU, memory usage, and costs, ensuring efficient resource use throughout the test.

Press enter or click to view image in full size

The Run/Stop phase can end in three ways:

  • Natural conclusion: The test finishes as planned.
  • Manual stop: The tester stops the test at any time if issues come up or objectives are met.
  • Autostop: The test automatically stops if any preset thresholds are exceeded, protecting production environments.

4. Tear Down

Upon test completion, ICaRUS dismantles the scenario-specific infrastructure.

Conclusion: ICaRUS for cost optimization and performance excellence

ICaRUS has significantly improved our load testing efficiency and reliability. We’ve achieved:

1. Cost reduction: 40% saving on cloud expenses

By building ICaRUS in-house, we avoided third-party licensing fees and optimized cloud resources. We can now run more tests with fewer resources, reducing cloud expenses by 40%.

2. Performance improvements

ICaRUS helps us simulate massive traffic spikes by gradually increasing the load, with faster ramp-ups that prepare our systems to handle peak stress just before high-demand events. Overall, preventing overloads.

We’ve reduced test preparation times to just 2–3 minutes, and we can now double and sometimes triple the number of tests we can run, boosting productivity during peak periods.

3. Reliability in production

The Autostop feature automatically halts tests if issues arise (e.g., slow response times), helping to mitigate overload risks during peak periods.

4. Improved observability

ICaRUS provides real-time performance insights and tracking metrics like virtual users, requests, errors, and resource usage. This helps to identify and fix issues faster.

5. Better data for troubleshooting

When problems arise, we can quickly pinpoint the cause. ICaRUS speeds up troubleshooting and reduces test disruptions. We can also analyze past tests to improve future ones and prevent recurring issues, ensuring better performance during high-demand events.

Read Entire Article