CI/CD for AI: Running Evals on Every Commit

2 hours ago 2

Nobody argues against automated testing: it gives us confidence to change code, and it tells us when something breaks, fast. But that only works if we run tests constantly, ideally after every change, so we can spot exactly when a bug was introduced.

The same is true when working with LLMs. We don’t just want automated evals (evaluations); we want them running as often as possible. In this post, I’ll explore ways to make that part of the daily workflow with almost no manual effort.

This post assumes some familiarity with evals in the context of AI engineering practices. If you aren’t familiar with evals, I suggest you first read this post: “You Can’t Just Trust the Vibes”: A Deep Dive on AI Evaluations with Sarah Kainec | Focused.

Over the past few months, I've been working on a side project that, among its features, allows users to call an LLM to generate a grocery shopping list based on a recipe. From an example input 'homemade pizza', it generates a list with, say, 'tomato sauce, flour, mozzarella cheese'... pretty simple, right?

As I started tweaking the prompt, **solving issues became like playing whack-a-mole at a carnival**: every time I solved a problem, a new one would pop up. Sometimes, old issues I thought were resolved resurfaced. So naturally, I looked at evaluations to help me put an end to this.

Once I had some evals in place, I started looking into ways to run them as often as possible. As this is a personal project, my usual workflow is to commit frequently, but not push every time, because nobody else would care about the latest version except me. This is why my first approach wasn't jumping straight to a CI pipeline step, but involved a local pre-commit hook.

Getting a pre-commit hook up and running was easy. I created a dataset in Langchain with a few example runs, an evaluator function that would score LLM responses from 0.1 to 1.0, and ran the evaluations in a Python script. However, it resulted in being VERY impractical. Why?

This takes WAAAY too long

A pre-commit hook needs to finish **before** committing new code to a VCS, to prevent issues from being integrated into the system. So, a useful check would have to wait for all of the experiments to finish before allowing devs to commit. My first, naive approach was to wait for the whole evaluation to finish:

This could work if the size of your dataset allows the run to be relatively quick. However, in my case, it was taking more than a minute (with a really small dataset), 

and honestly, I've seen devs turn off pre-commit hooks for less. 

I experimented with increasing the concurrency of the evaluation runs (with the `max_concurrency` param), and it led to significant improvements. However, this speed increase might not be enough for teams that work with much larger datasets than the small one I used in my experiment.

Asynchronous triggering

Since blocking the user from committing for minutes until all evaluations were done was out of the picture, the other alternative to try was running them async.

I couldn't find a fire-and-forget method in the Langchain API. There is an asynchronous `aevaluate` method, but the evaluation results still need to be awaited for them to run. When I tried just to await the `aevaluate` call and end the script, it only created an empty experiment with no eval runs.

I started playing around with the idea of spawning a subprocess that would trigger the evaluation runs, but I was losing faith at this point. This would still achieve the goal of running tests often, but if it wouldn't prevent me from committing code that would decrease the quality of my application, what was even the point of using a pre-commit hook? If only there were a way to create an asynchronous process that would get triggered on code changes and would notify me of the results…

The good thing about ideas that end up not working is that we still learn a lot in the process. In this case, even though I abandoned my pre-commit hook, I already had everything needed to run the evaluations as part of a CI/CD process:

  • A curated dataset
  • An evaluator function
  • A script that runs the evaluations!

Great, this achieves the goal of running our tests often! 

There's something I purposefully left out of both approaches: what to do with experiment results. What’s the point of performing an experiment if you ignore the results?

In the case of the pre-commit hook, we aim to reject the commit if we determine that the change will decrease the quality of our application. For the CI action, we want a big, ugly red X indicating there has been a failure in the evaluations step. Sometimes, that might not be as straightforward as running a Jest unit test.

Unlike traditional automated tests, evaluation results are not always binary (Pass/Fail). I set up a correctness evaluator to test various aspects of my grocery list generator, such as recipe correctness, language (since I wanted it to work in English and Spanish, depending on the input), and conciseness. The AI-as-judge acts like a teacher grading a student’s work, going over the responses and assigning them a % grade depending on the criteria defined above. This ‘grade’ is helpful to see how well your prompt evolves in different aspects, but it’s not enough on its own to know when to approve or reject changes. Just like with exams, a passing grade has to be defined. But how? 

To understand what an acceptable result is, we first need to run enough evaluations to get stable results. Once we have them, we can look at averages, or results that we consider unacceptable, to set a reasonable limit. If a change scores below this limit, we will reject it. 

Of course, there are other things we might care about in our evaluation results. 

Flawed averages

If we only assert over the average evaluation result, we might be missing outliers in big enough datasets. This is something to take into account, especially for examples that represent a whole equivalence class of use cases. To avoid that, I also added a check to verify that no result falls below a minimal individual threshold:

Change over time

Instead of testing against a static target, we might be interested in how our application quality changes over time, to ensure there are no unexpected decreases.

If we ensure we use the same experiment prefix for all evaluator runs (in this example, 'ci-run'), we can use the Langchain SDK to fetch previous evaluator runs and get an average of that:

After that, we can compare the average of the current run with the average of the past three runs to gauge whether quality has improved or declined over time.

Next steps

I now have a much clearer picture of how my AI application is performing, and whether my changes are actually improving it. In particular, I identified a few use cases that were not covered by my prompt and were obscured by the average score. Therefore, my immediate next step is to add support for these cases, guiding my development process through evaluations, TDD-style. This is a fundamental engineering technique that I’m excited to explore further, called Eval-Driven Development.

What about your next steps? If you are curious about how we use LangSmith to debug LLM pipelines, I can't recommend this post enough: Debugging LLM Pipelines with LangSmith: Why Prompting Alone Isn’t Enough | Focused

Read Entire Article