Don't Settle for Mediocre Front End Testing

2 days ago 6

You’re moments away from finishing a feature you’ve been working on for the last two weeks when you get a Slack notification that the frontend test pipeline has failed for the 824th time that year.

It’s the same handful of flaky tests that fail whenever there’s a half-moon.

You make a note to fix these tests and get back to finishing that feature.

We were in this situation and asked ourselves whether we enjoyed building and maintaining our frontend test system. The answer was no, so we tore it down and built something we could be proud of.

Getting your frontend testing infrastructure stable is tough. Timing is tricky to get right when your tests are at the whim of network requests and browser rendering cycles. However, with the right tools and a solid foundation, you can do it, and it’s worth the effort.

A promising testing pipeline isn’t just about catching issues; it’s a force multiplier for your development team, empowering them to move faster, confidently, and focus on doing their best work.

In this post, we walk through why we switched from Cypress to Playwright, how we made the switch, and what the outcomes have been.

What was wrong with our old testing system anyways?

Our Canary Console frontend has decades of person-years poured into it. We sweat the small details. A while back, we settled on Cypress for running frontend integration tests, but over time, it’s become that mystery Tupperware in the office fridge — nobody wants to open it because of what you might find, but it’s still there, singing its siren song.

Our integration tests need to interact with live Consoles, and each Console has various settings and features that can be enabled or disabled (for example, we can turn SAML authentication on or off for each customer, which affects part of the frontend). We were hitting limitations in Cypress and the testbed architecture, which took too much time to work around.

Here’s an outline of how frontend testing was integrated into our CI flow:

The code landed on our main branch.
A Github Action would start a Cypress agent.
The Cypress agent interacted with a single, persistent, live Canary Console.
The Action would send failures to Slack.

In this design, we never reset the Console state between testing rounds. If tests don’t clean up after themselves, the next round of tests will run against an unknown state.

Since the tests shared a single Console, we couldn’t run them in parallel, so either developers were bottlenecked on the single testbed, or they tested on non-standard testbeds.

It was a mess; so much that no one wanted to write tests, and our frontend team didn’t want to be involved with the project. So, how did we get folks excited about writing end-to-end tests?

Yearning for the ocean

If this system was perfect, what would that look like? Why not aim at that? We came up with:

The tests would run on every commit, even on feature branches.
The test suites would have zero flaky tests. Why can’t the tests run for months without failing due to flakiness?
The test suite would execute rapidly. Prompt feedback is essential.
The test system would be portable. It would run in our CI and on our developers’ local machines.
Writing tests should be enjoyable; at least, we shouldn’t want to bang our heads on our desks while writing them. If the tests aren’t easy to write, how will you craft long, comprehensive test cases?

Having these points as our guiding constellation, we knew that getting close would mean raising our team’s floor to a new level. The plan became:

Make a repeatable test environment in containers.
Move from Cypress to Playwright for writing and running tests.
Focus on flaky tests.
Ensure our main UX flows were covered.

Foundation

We moved our entire stack into a collection of Docker containers we could spin up anywhere and anytime.

We could run the container system with a fixed pristine database that resets each time the stack starts up, meaning our state is always the same. The only thing that would change would be the code. Using Docker was a great start at solving our stability issue.

Our setup solved 1 and 4 of our dream list, and gave us a headstart on 2.

For #1, Github Actions can run arbitrary Docker Compose setups. We could spin up our entire stack for each commit and run the new code against that isolated environment.

For # 2, The stability afforded us by being able to reset our state when we spun up the containers meant that we had a good base for a system that wasn’t flaky, thanks to our pristine database that never changes. There was still work to fix individual tests, which we’ll explore in more detail later.

For #4, Since portable environments are what Docker is all about, the testing environment could run in GH Actions and on our local machines. Both testing environments are identical.

The foundation set us up for success, but we needed to ensure that our test cases followed suit.

Joy for devs and force multiplier (why we moved to Playwright)

As trite as it sounds, writing Cypress tests didn’t spark joy. There is a list of areas where Cypress pales compared to Playwright (and we’ll get into those shortly), but we didn’t like using the tool at a base level.

We decided to trash all our Cypress tests and start again in Playwright. If you have a collection of Cypress tests that work well, keep those around. We didn’t have that many, and almost none of those tests were useful. After we weighed up the pros and cons, switching was a no-brainer.

Cypress Eccentricities

The Cypress test framework has many idiosyncrasies that add unneeded friction to writing and reading tests.

Cypress commands aren’t promises. They aren’t synchronous, either. Commands are added to a queue and only run after the function returns. Take a look at this example:

const username = cy.get('#username').invoke('val'); console.log(username); // This logs a "Chainable" object, not the actual value

Logging the username variable returns a Cypress chainable, not a resolved DOM element or value. When console.log() runs, Cypress has yet to execute the cy.get command. To get the value, you need to use a .then(), which waits for the command to resolve.

Consider a test that verifies whether the sum of a shopping cart’s item prices, VAT, and shipping cost matches the total displayed.

In Cypress, there are two ways to do this. You either need to use then, which waits for your previous command to complete and then executes your callback:

describe('Cart total exact match validation', () => { it('should exactly equal item price + shipping + VAT of $10.00', () => { cy.visit('/cart'); cy.get('[data-testid="cart-item"] [data-testid="item-price"]').invoke('text').then((itemPriceText) => { const itemPrice = parseFloat(itemPriceText.replace('$', '')); cy.get('[data-testid="shipping-cost"]').invoke('text').then((shippingText) => { const shippingCost = parseFloat(shippingText.replace('$', '')); cy.get('[data-testid="vat-amount"]').invoke('text').then((vatText) => { const vatAmount = parseFloat(vatText.replace('$', '')); const actualTotal = itemPrice + shippingCost + vatAmount; const expectedTotal = 10.00; expect(actualTotal).to.equal(expectedTotal); }); }); }); }); });

Or you can use aliases, which are essentially variables you can reference in other parts of your Cypress test:

describe('Cart total exact match validation with aliases', () => { it('should exactly equal item price + shipping + VAT of $10.00', () => { cy.visit('/cart'); // Get item price and alias it cy.get('[data-testid="cart-item"] [data-testid="item-price"]') .invoke('text') .then(text => parseFloat(text.replace('$', ''))) .as('itemPrice'); // Get shipping cost and alias it cy.get('[data-testid="shipping-cost"]') .invoke('text') .then(text => parseFloat(text.replace('$', ''))) .as('shippingCost'); // Get VAT and alias it cy.get('[data-testid="vat-amount"]') .invoke('text') .then(text => parseFloat(text.replace('$', ''))) .as('vatAmount'); // Use cy.then to access all aliases cy.then(function () { const actualTotal = this.itemPrice + this.shippingCost + this.vatAmount; const expectedTotal = 10.00; expect(actualTotal).to.equal(expectedTotal); }); }); });

Playwright Sensibilities

There is only one style in Playwright; it doesn’t contain any framework-specific knowledge. If you’re familiar with writing JavaScript, you’ll understand it quickly.

Native JavaScript promises form the foundation of Playwright, which allows us to wait for the result of any async operation and store it in a variable. This style of programming aligns with how we write code daily:

test('Cart total should exactly equal item price + shipping + VAT of $10.00', async ({ page }) => { await page.goto('/cart'); const itemPriceText = await page.locator('[data-testid="cart-item"] [data-testid="item-price"]').textContent(); const itemPrice = parseFloat(itemPriceText?.replace('$', '') || '0'); const shippingText = await page.locator('[data-testid="shipping-cost"]').textContent(); const shippingCost = parseFloat(shippingText?.replace('$', '') || '0'); const vatText = await page.locator('[data-testid="vat-amount"]').textContent(); const vatAmount = parseFloat(vatText?.replace('$', '') || '0'); const actualTotal = itemPrice + shippingCost + vatAmount; const expectedTotal = 10.00; expect(actualTotal).toBe(expectedTotal); });

As developers, we want the ideas in our heads to flow freely into our IDEs (or terminals, if you’re a psychopath), with as little friction as possible. Framework-specific knowledge comes at a cost. If you aren’t writing Cypress tests every day, you’ll need to refresh your memory each time you need to write a test. Continuity between your tests and code means moving between the two is effortless.

Developer Experience

Playwright has great support inside VS Code. Want to step through your test code with breakpoints? You’ve got it. You just fixed a test and want to test it to see if it works. Rerun it directly from inside your IDE with a click of a button. It’s smooth, easy, and a treat to use.

Speed

Playwright’s performance was the feature that initially caught our attention. We planned to stay with Cypress and rework our foundation (moving our stack into Docker containers). Out of interest, we looked at which testing framework the industry used, and there was a consensus that Playwright was the new “it girl”.

We started with our longest-running Cypress test (creating one of each of our Canarytokens), which took around 5 minutes to execute with Cypress using Chrome only.

Migrating the identical test to Playwright, it now takes 1 minute and 30 seconds to complete, running against Chrome, Firefox, and Webkit. For a developer, few things are as gratifying as watching performance numbers drop—it’s like pressure-washing a grime-caked driveway.

How is Playwright so much faster? Playwright runs each test in a separate browser context (essentially an incognito window). It interacts with the browser contexts by means of the DevTools protocol for Chromium and custom protocols for Firefox and Webkit. Because tests are isolated, Playwright can run multiple tests simultaneously. Cypress runs tests in-process, meaning the tests run inside the browser in the same execution context as the app you are testing. Cypress can only run a single test simultaneously; parallelism requires using Cypress Cloud.

All adds up

Ensuring your team doesn’t wallow in Cypress code, trying to remember framework-specific gotchas, or waiting three hours for the test suite to finish means your team can focus on writing useful test cases.

One of our team members pointed out that our CI test report artifact was 700mb, and downloading it from GH took 30 minutes. We could have shrugged it off, but we took half an hour to investigate whether we could change the tracing to only include failed test data. We could. It’s a tiny thing, but had we left it, we might never have gotten around to it, and a 30-minute wait would have become the norm.

Concise code makes it easier to understand a test’s flow. Respecting your developers’ time means they’ll have more time to do the important stuff. Having a system people enjoy makes pushing through tough patches easier.

Fight the Flake and knowing your systems

It’s easy to ignore that one flaky test. “It only fails every other time, it’s fine.” Eventually, as you add more and more tests, you’ll find yourself in a situation with 20 flaky tests. Twenty flaky tests distracting your developers every time they fail.

We set a zero-flaky-test policy. Whenever we completed our tests, we’d have to run the suite with –repeat-each=50, which reran the new tests 50 times. If they failed once, we’d need to figure out why and fix it.

Running the tests repeatedly might seem harsh, but it meant that these tests would be as solid as a Nokia 3310. It also forced us to get to know Playwright, our testing infrastructure, and our UI.

Initially, running tests concurrently would lead to loads of inconsistent failures. As the frontend team, we could have left this and chalked it up to flakiness, but we had a no-flakiness policy. We had to figure out why this was happening. Debug web server startup scripts aren’t built for concurrency. Our production Consoles use Nginx and uWSGI to facilitate our frontend communication with our backend. So, we aligned our test setup with our production services. No more flaky test failures; as a bonus, our tests ran even faster.

This increase in speed also allowed us to find a race condition that occurred while deleting a Flock, which would break the UI on page load. This bug had been present for years. We could have ignored it but that damn “no flakiness” policy! *shakes fist*

While building our user management test suite, we noticed the tests took longer and longer to run each time they were repeated (--repeat-each=50). Every time we deleted a user, every other admin user would get an email mentioning that somebody had removed a user. This growing list of emails would start interfering with the test timings, and tests would fail. The compounding delay could mean a user might wait 10 seconds for an API request to return. Also a bug that was present for a while, and we wouldn’t have found this if we had ignored it.

These things might seem unimportant for standard test runs (that don’t repeat), but as you add more tests, the flakiness will start to add up. Squash the bugs when you find them so you can return to the important stuff.

Make sure you’re testing the important stuff

Our transition between Cypress and Playwright wasn’t just a straight port. We also wanted to write new test cases that completely covered core feature functionality. We started with the five most important features our customers use.

For us, we’re all about letting customers Know. When it matters. For the tests, this meant focusing on alert management and testing our most popular Canary configuration flows to ensure folks could get those alerts. Alerting and setting up Canaries are the most critical aspects of our interface. Customers spend their time in these areas, and we want them to be bug-free.

In our Cypress test suite, we had little to no focus on what was or wasn’t an important test. For example, we had a test to assert that the text on our 404 page was correct. Now, by itself, that test is okay, but why would you write a test for that when the Cypress suite has no alert management tests, our company’s primary focus?

We set a plan by focusing on the critical stuff and ruthlessly avoided busy work.

What’s a good test?

A great test behaves as a user would. It should mimic the actions a real person performs. Making sure the elements are visible isn’t enough. For example, suppose the feature is a configuration wizard to set up a device. In that case, the test should step through the entire wizard, fill in forms and tick checkboxes, and then complete the wizard with an assertion to confirm the device settings have changed. The tests should cover all permutations of the wizard, looking for bugs in every nook and cranny.

Paul wrote a suite of tests for our Windows File Share service, one of our most popular Canary services.

Users can tailor their file share through the UI by adding, removing, renaming, or uploading custom files. An optional feature also generates a file structure automatically. Here’s an overview of the feature from a recent webinar:

There’s a lot of moving parts. Here’s what we tested:

Turning the service on and off
Validate user input
Turning System Shares on and off
Turning alerting on and off on Authentication attempts
Enabling and disabling of a File Share Instance
Adding a new File Share Instance
Validate user input for a File Share Instance
Creating a new folder in the file tree
Editing a folder name
Deleting a folder
Adding all of our [7] basic file types (.docx, pdf files)
Turning on and off Industry Specific Shares (our feature that generates the share based on your industry)
Selecting an industry generates a new share
Adding miscellaneous departments to the new share
Regenerating the Industry Specific Share
Custom upload file size limit validation

These tests ensure that our Windows File Share works as expected, even when we’re not around. It’s important to note that we’re testing features as if we were a user. We’re testing actual functionality and not just ensuring the word “Windows File Share” appears in the DOM. Anyone making changes to this feature in the future can rest assured that we’ve got their back.

Writing tests like this is hard work, and ensuring they aren’t flaky is also hard. The thing with writing good tests upfront is that they become the blueprint for all the following tests. Solid tests provide value every time your CI runs. They hold the shape of your product and ensure you aren’t letting your customers down.

How Close did we get to our ideal test system

In the end, we got pretty close. We’ve had 20 failure alerts in 148 commits (around 2 months of work) to main.

A handful of these were due to an upstream bug, which we’ve worked around by installing the latest Docker Compose.

A few of them were due to legitimate bugs we introduced; whoop, whoop!

Some were due to flaky tests that managed to sneak past our --repeat-each=50 rule. We put these tests on pause while we looked at their implementation. We needed to overhaul a few of these tests before re-enabling them. We decided to leave the more stubborn tests disabled because we concluded that the value they added wasn’t worth the effort. Remember, only focus on the important stuff! Low-value tests that distract your developers aren’t worth it!

As of writing this, we wrote 311 rock solid test cases across our main feature set:

Alerting and Alert Management
Canary Configuration
Canarytoken Configuration
User management
Authentication

We’ve got an excellent foundation for building even better tooling in the future.

Tear Down

One of our core principals at Thinkst is that we are a refuge from mediocrity. It’s difficult to do something well, it takes work. We put in the work. If something isn’t where we want it to be, we iterate until it gets there.

We have always known that there was coolness to find in fixing our test pipeline, but we were always busy doing more important work.

If someone had asked me a few months ago whether our Windows File Share feature worked, I wouldn’t have known unless I manually checked. Now, I know it works. If it didn’t, our test suite would have notified me.

Make time for your team to explore things like this. You’ll find a lot of great stuff here, and it might be a force multiplier for your team. You might build something you can’t imagine doing without.

Read Entire Article