At CGFT, we help companies to fine-tune custom code AI agents on & for their specific codebases. For a customer we recently used test-coverage guided reinforcement learning (RL) to fine-tune Qwen2.5-Coder-7B & deploy a custom unit-testing agent. Given a code diff/PR, the agent increases code coverage with meaningful tests.
The result: significantly improved code coverage and more realistic mocks and assertions, outperforming general-purpose models like o3-mini/o4, even while being small @ 7B params. This post dives into some of the technical work behind making that possible.
Our results highlight a key takeaway: with codebase-specific RL, we can build specialized models that surpass general frontier models for targeted tasks like unit-test creation at a fraction of their cost / size
We noticed general-purpose code models often fell short on specific tasks like realistic mocking or context-aware assertions. This makes sense since they lack familiarity with a project’s unique structure, patterns, and dependencies as they are trained mostly on public Github data.
In contrast, a company’s internal codebase provides rich, precise supervision. This includes existing tests, usage patterns, and execution environments that lets us run model-generated code. This is great for reinforcement learning, where we can reward the model for meaningful outcomes: passing tests, coverage increases, and bug-catching via mutation testing.
With RL, the model learns from product impact (i.e. code coverage), not just examples. This enables tightly aligned models that generate useful, relevant tests grounded in the actual codebase, not GitHub averages.
We first take the customer’s codebase and pair each source file with its corresponding test file. Using the LSP, we map source functions to their associated tests.
From these mappings, we generate prompt-output examples for training. Each prompt includes:
- The full source file, with the target function highlighted via a diff,
- The corresponding test file,
- And additional context from similar tests elsewhere in the codebase.
To create a reference output, we selectively remove a subset of tests related to the target function. These omitted tests become the reference output associated with the prompt. This output is used exclusively for the CoT warmup phase and not for reinforcement learning.
We divide the prompt-output pairs into train/eval sets (approximately 80-20 split). To avoid cross contamination, this split is done at the file level instead of uniformly across all prompts.
We initialize training with the base Qwen-2.5-Coder-7B model and perform a brief warmup phase to bootstrap its ability to generate thinking traces. These traces are produced by prompting Qwen-2.5-Coder-32B-Instruct to generate a thinking trace given the source file diff & the new test in the output.
The following prompt was used.
The warmup phase involves standard next-token supervised fine-tuning. Specifically, the model is trained to output the chain of thought trace + new test, given the prompt.
We then do RL on the warmed up model for 5 epochs using GRPO. Designing an effective reward function required several iterations. We started with something simple, rewarding the model when the generated test both passed and increased coverage for a diff in a given PR.
In early experiments, the model deftly hacks this reward function by generating tests with non-existent/trivial assertions (e.g. simply calling the function with wrong inputs and asserting that a generic error is thrown).
We also noticed that the model outputted more whimsical, useless chain of thought traces when hacking rewards. Here’s a real example:
To address this hacking, we added an extra reward function to reward useful assertions and penalize trivial ones. Penalized assertions include null checks, expect.anything() checks, etc.
Ultimately, a test’s values is in its ability to catch bugs. To model this, we incorporated mutation scoring into the reward function. Synthetic mutations are introduced to the source function being tested (swapping operators and removing conditionals). The mutation score is the proportion of mutants the test is able to catch.
We also found that keeping the KL divergence penalty in GRPO ensured the model didn’t drift catastrophically from the initial warmed up model - helping mitigate hacking.
In production, the unit-testing agent can generate multiple test cases per function and only keep ones that add coverage. Training should optimize for metrics that reflect this best of K approach. However, vanilla GRPO maximizes the average reward across individual samples which optimizes for pass@1.
We tweak GRPO to optimize for pass@K (at least one successful test out of K attempts) and max@K coverage (maximum coverage achieved among K samples). These metrics better align with production goals. We leverage the work from Tang et al. to do this by computing the advantage in GRPO as follows:
Unit testing is also a creative process → the developer needs to reason & anticipate different code paths and identify potential failure points. This is somewhat at odds with the general RL process, which tends to result in less diverse model outputs. Our pass@K objective somewhat mitigates this but we also found the following to work well:
- An entropy loss term to encourage output diversity and prevent the model from becoming overly confident in specific token predictions (implemented as a KL divergence term w.r.t uniform distribution over vocabulary).
- Following DAPO, we used the decoupled clipping mechanism to allow for more exploration.
We also tried other approaches with lesser success.
- We experimented with Generative Flow Networks (GFlowNets). Rather than solely optimizing for the maximum reward, GFlowNets aim to learn a policy that samples trajectories in proportion to their reward, modeling the entire reward distribution.
- We implemented a group-level novelty reward. Given K samples from the model, we compute the average of the syntax tree distance between each test prediction and other samples.
We remain optimistic that we could get these approaches working with some tweaking.
RL training can be tricky and we implemented a few tricks to keep things stable:
- We aggressively clipped gradient norms to 0.05
- For LoRAs, setting a high learning rate (~1e-5 to ~5e-5) tended to work best
Training was done on a 4xA100 node. Reward calculation was a major bottleneck, as building and running tests for all model outputs was time-consuming. So, we deployed a dedicated cluster of 15 CPU machines to handle test execution and compute reward metrics in parallel, improving throughput. Our RL pipeline is asynchronous; model sampling, reward calculation, and training steps run concurrently without blocking.
We find substantial performance gains with the CGFT Fine-tune (7B) model compared to the o4-mini baseline, in both increasing test coverage and the number of successful coverage-increasing tests.
We specifically evaluated the effectiveness of each model through two key metrics:
- Coverage Increase Rate: The fine-tuned model achieves a rate of 0.41, compared to about 0.3 for o4-mini & o3, showing stronger ability to generate tests covering additional code paths. We compute pass@3, i.e. the likelihood that at least one out of three generated tests successfully passes.
- Average Covered Lines Increase: The fine-tuned model covers an average of 7.76 additional lines, surpassing the 4.9 lines achieved by o3, reflecting greater robustness and comprehensiveness. We compute max@3, the maximum coverage achieved across three attempts.
To ensure a fair comparison, we optimized prompts and applied rule-based heuristics to correct any output formatting errors from o4-mini.
Multi-turn Rollouts with Tools
So far, the model only generates tests in a single forward pass. We’re building multi-turn setups where the model can call test runners, read error logs, and revise its output. General models tend to be overfit to the tools they were post-trained with (read_file, write_file, etc.) → fine-tuning’s crucial for teaching models to use custom tools such as test runner & error logs well.
Better Reward Modeling
We’re investigating more semantically rich bug seeding methods (e.g., real-world bug injection from prior commits, synthesizing bugs with an LLM vs relying on heuristics).
Regression unit testing is just the first step. Our broader goal is to build bug-finding agents trained end-to-end with RL.
We believe end-to-end RL training on internal coding environments is crucial in pushing the frontier of coding agents. Like engineers, LLMs need onboarding. Hand-crafted prompts and rigid workflows don’t scale across projects. What works best for one codebase doesn’t necessarily transfer to another. Reinforcement learning enables models to adapt and align more robustly to each codebase.
We currently partner with a select, small group of companies to train & deploy customized coding agent experiences tailored to their internal code. Join the waitlist here: https://tally.so/r/mYKrvN