With the advent of LLMs, the temptation to churn out a flood of unit tests for a false veneer of productivity and protection is stronger than ever.
My colleague Matthias Doepmann recently fired a shot at AI-generated tests that don’t validate the behavior of the System Under Test (SUT) but instead create needless ceremony around internal implementations. At best, these tests give a shallow illusion of confidence in the system’s correctness while breaking at the smallest change. At worst, they remain green even when the SUT’s behavior changes.
In practice, they add maintenance overhead and drag down code reviews. The frustration in that post wasn’t about violating some abstract testing philosophy. It came from having to wade through countless implementation-checking tests churned out by LLMs across components of a real, large-scale distributed system.
I think the problem persists for three reasons:
- First, many developers have begun defaulting to LLMs for generating tests. Regrettably, even in critical systems. In greenfield projects with no test baseline, AI agents often go rogue and churn out these cheap implementation-checking tests. Google calls them interaction tests.
- Second, the prevalence of mocking libraries encourages this pattern. They make it too easy to write tests that assert “which function called which” instead of “what actually happened.”
- Third, once these tests exist, they create inertia and people keep piling on the same style of tests to be consistent.
Test state, not interactions
The general theme when writing unit tests should be checking the behavior of the system, not the scaffolding of its implementation. It doesn’t matter which method called which, how many times, or with what arguments.
What matters is: if you give the SUT some input, does it return the expected output? In a stateful system, does the input cause the system to mutate some persistence layer in the expected way? That persistence layer doesn’t always need to be a real database; it could be an in-memory buffer.
In scenarios where your code invokes external systems, it is more useful to test your system with canned responses from upstream calls rather than testing which method is being called.
The salient point is: test outcomes, not implementation details. As the book Software Engineering at Google puts it: test state, not interactions:
With state testing, you observe the system itself to see what it looks like after invoking with it. With interaction testing, you instead check that the system took an expected sequence of actions on its collaborators in response to invoking it. Many tests will perform a combination of state and interaction validation.
And the guidance that follows:
By far the most important way to ensure this is to write tests that invoke the system being tested in the same way its users would; that is, make calls against its public API rather than its implementation details. If tests work the same way as the system’s users, by definition, change that breaks a test might also break a user.
I think the first step in the right direction is to accept that LLMs can’t substitute for thought. The first few critical tests in your systems shouldn’t be written by LLMs and you must vet the tests churned out by the genie that wants to leap. Next up, you can often get away without a mocking library and more often than not, they improve the quality and maintainability of your tests.
Mocking libraries often don’t help
Mocking libraries come with their own idiosyncratic syntax and workflows. On most occasions, handwritten fakes are better than mocks. I’ll use Go to make my point here because that’s what I write the most these days, but the lesson applies to other languages too.
Consider a simple UserService that depends on a DB interface. Its job is to delegate user creation to the database and return any error to the caller:
A mocking tool such as mockery can generate a mock implementation of the DB interface. The generated code records calls and arguments so that tests can later assert whether the expected interactions happened:
Using this mock, a test can be written to check that CreateUser interacts with the dependency in the expected way:
This works mechanically, but it breaks down in practice:
It checks the collaborator call, not the result
A useful test would assert that “alice” was actually added or that a duplicate error was returned. This one only verifies that InsertUser("alice") was invoked once.
It breaks on harmless refactors
If the database method is renamed while keeping the same semantics, callers see no difference but the test fails:
// usersvc/usersvc.go (harmless refactor, behavior unchanged) package usersvc type DB interface { UpsertUser(name string) error // was InsertUser ListUsers() []string } func (s *UserService) CreateUser(name string) error { return s.db.UpsertUser(name) // same public behavior }The mock-based test no longer compiles or needs rewiring, even though the public behavior didn’t change.
And worse, it survives real bugs
If an error is accidentally swallowed, callers get the wrong signal but the test still passes:
// usersvc/usersvc.go (buggy refactor: behavior changed) package usersvc func (s *UserService) CreateUser(name string) error { _ = s.db.InsertUser(name) // ignore error by mistake return nil // callers think it succeeded }A real DB or an in-memory fake would raise a constraint error that should propagate. The mock test goes green anyway because it only checked the call path.
The common thread is that mocks lock tests to implementation details. They don’t protect the behavior that real users rely on.
Interface-guided design and fakes
A better approach is to keep the same interface but back it with a handwritten fake. The fake encodes the domain rules you care about, and tests can focus on outcomes instead of verifying which collaborator methods were called.
Here, we’re hand writing the fake implementation of the DB interface instead of generating it via a mockgen library.
Tests with the fake read like a statement of expected behavior:
This avoids the fragility of mocks. The tests survive harmless refactors, fail when behavior changes, and stay readable without a mocking DSL.
But the cost is maintaining the fake as the interface evolves. However, in practice, that’s still easier than constantly updating brittle mock expectations and occasionally dealing with the mock library’s lengthy migration workflow.
Fakes vs real systems
Sometimes the right move is to test against a real database running in a container. That is still state testing, just at a higher fidelity. The tradeoff is speed: you get stronger confidence in behavior, but the tests run slower.
Most of the time, handwritten in-memory fakes are what you need, and most tests should stick to those. When you do need the same behavior you would see in production, tools like testcontainers let you spin up databases, queues, or caches inside containers. Your tests can then call the SUT normally, with its configuration pointing at the containerized service, just as production code would connect to a production resource.
Parting words
This is not a rally against using LLMs for tests. But the seed tests, the first handful that set the standard, need to come from you. They define what correctness means in your system and give the ensuing tests a model to follow. If you hand that job to an LLM, you give up the chance to shape how the rest of the suite grows.
This is not to disparage mocking libraries either. But I have seen people armed with overzealous LLMs and mocks wreak havoc on a test suite and then unironically ask reviewers to review the mess. Instead of validating behavior, the suite fills up with fragile interaction checks that break on refactors and stay green through real bugs.
More often than not, you can skip mocking libraries and rely on handwritten fakes that check the behavior of the SUT instead of its interactions. The next person that needs to read and extend your tests might thank you for that.
~~~
Recent posts
- Organizing Go tests
- Subtest grouping in Go
- Let the domain guide your application structure
- Early return and goroutine leak
- Lifecycle management in Go tests
- Gateway pattern for external service calls
- Flags for discoverable test config in Go
- You probably don't need a DI framework
- Preventing accidental struct copies in Go
- Go 1.24's "tool" directive