A common problem when building modern websites with searches, content recommendations and personalisation is that standard unit and integration testing can fairly easily show if the site is working, functionally, but its much harder to show if these complex content structures return content as expected by the user, as that require an understanding of what the user is expecting and a semantic understanding of the content they get back.
Semantic testing is a test design technique that focuses on evaluating the meaning and logical correctness of inputs and outputs based on the intended behavior of the system. It aims to detect errors related to misunderstandings of requirements, domain rules, or data interpretations rather than just syntactic mistakes.
Some examples:
- If I search for “black dress shoes” does my store return fancy shoes, or black dresses?
- If the site does content recommendations, does these work well with the primary content and do they match what a given user is expecting?
- Does the content in our articles and product pages follow the specified tone of voice
Using standard testing tools, you can get an indication, but would require html parsing, knowledge of the html element structure etc, to get a red/green judgement, hence we have historically used QA teams of humans to catch these kind of issues.
Instead we can introduce an LLM into our testing setup, as I’m deploying my website in a container, it’s a low barrier to just use Jest, Testcontainers and Docker Model Runner which both run on my mac, and in CI/CD (with gpu available)
All snippets in this post are shortened down for brewity, but can be found in this github repo
Setup
Get a local model running, and ensure it follows a specific output schema, for jest to be able to assert success/fail on.
and then access it using the normal openai js library:
Importantly, for all test responses I instructured the model to always respond in json using a simple format that allows me to grade quality and get a reason for failing tests, this is appended to all test prompts:
For the testcontainers part, we simply reuse the existing compose file and get the hostname and port of the site
Order of blog articles
Determine if all the articles linked on the page have fitting headlines and they are sorted in a way that makes logical sense either by date or alphabetically.
For this I created a small agent (narrator: this is not an agent)
What I personally find lovely here, is the sheer readability of test, I can see what is happening, what the test is supposed to do, just by reading the prompt.
Jest runs, spins up Umbraco, performs the test and throws away the environment:
Tone of voice
Okey, lets try something else, namely determining whether content on a page matches the corporate tone of voice guidelines. To do this, we will provide the LLM with a styleguide from mailchimp and then asking the test to validate html on a given url.
This also passes, but if I add the requirement in the styleguide that content should always be written in danish, the test breaks:
Jest failure:
This took a couple of prompt tweaks, some times it started returning the reasoning in danish, some times it just ignored it, so the quality of your test very much comes down to your prompt engineering ability.
Search Results
Next, lets try to validate the search results we get back, against a repository of matching and not-matching content, For your specific use case, you can add in common edge cases, misspellings etc, the Umbraco out-of-the-box search isvery basic, so we will keep it basic as well:
For this we need the LLM for 2 things, one to generate content on flying cats, and one on flying cars, I want to ensure our search page correctly ranks pages, for both “flying”, “cars” and “flying cars”
I’ve created an editor agent, which I can give a writer persona and then instructions on what to write, reusing the styleguide alongside the expected output schema format that I want the content in.
Validation is a bit crude here, and the prompts could likely be built out further, or give the LLM more context for simplicity I used the same persona/intent validation, but search results validation could benefit from more specific input on the expected behavior of the system.
This is a very basic setup, to validate the ideas, but I believe this could be a useful tool, with a number of other test scenarios:
- Clarity of CTAs
- Bias checks
- Validate metadata with page content, title etc
- Compare content variants or translations
- Determine helpfullness of error messages/labels/etc
Again, source code is here