- I trained a 14B orchestrator model to better coordinate explorer & coder subagents
- I scaled this to 32x Nvidia H100s, and 416x Intel Xeon Platinum 8470 CPU cores.
- Qwen3-14B achieved a 160.71% relative increase on Stanford's TerminalBench after training.
- Full training code, model weights, datasets, and documentation are released below.
This project builds upon the great prime-rl framework developed by Prime Intellect, and heavily depends upon the multi-agent architecture developed in multi-agent-coder. Please note that this code and the resulting model are meant simply as proof-of-concepts and building blocks for multi-agent coding RL.
For a full breakdown of this project's code structure, see here
- 💻 Distributed Training on 32x H100s
- 📈 Reward
- 🏆 Leaderboard Climb
- 🏋️♂️ Training & Rollout Details
- 🤗 Model Weights
- 🚀 Getting Started
- 🪜 Potential Steps Forward
- 🙏 Acknowledgements
- 📝 Citation
- 📄 License
The below image shows the Orca-Agent-RL training code pushing thirty two Nvidia H100s to their limits.
At any one time, there were also up to 256 distributed Docker containers rolling out simultaneously across the 4x bare metal node cluster.
This training setup can be scaled from a single instance to a multi-node cluster.
The 32x H100 cluster was organised as follows:
- 16 GPUs: Model training (gradient computation and optimisation)
- 8 GPUs: Policy model inference (orchestrator model rollouts)
- 8 GPUs: Subagent model inference (tool-calling rollouts, not trained upon)
To maximize CPU utilisation across the cluster, all 256 concurrent Docker environments were automatically distributed across all 4 nodes:
Architecture:
- Main node orchestrates container placement via DOCKER_ENDPOINTS environment variable
- Worker nodes expose their Docker daemons over TCP (port 2375, firewall-restricted to main node)
This simple yet effective code enabled 256 concurrent containers to be distributed evenly across all available nodes to balance CPU load, and can be scaled up or down depending on compute budget. The code can be found here (in other project) and more details on how to link the nodes together can be found here.
- Add link when v0.2 is published in other lib
Below is a visualization of the reward improvement over a single 20 hour run.
Qwen3-14B (run#3) Reward:
Training Dynamics:
- Entropy (left): Model explores diverse strategies early, then converges to confident policies
- Gradient Norm (right): Smooth decrease indicates stable, healthy optimisation
I evaluated Qwen3-14B on Stanford's TBench before and after training (using Qwen3-Coder-30B-A3B as the explorer & coder subagents). The RL-trained model achieved an 11.25% absolute increase (160.71% relative increase)! Nice!
| Qwen3-Coder-480B | Qwen3-Coder-480B | 19.7% |
| Orca-Agent-v0.1-14B | Qwen3-Coder-30B | 18.25% |
| Qwen3-14B | Qwen3-Coder-30B | 7.0% |
The results of this can be found here (qwen) and here (Orca), and instructions on how to reproduce are here.
This places Orca-Agent-v0.1 (14B) + Qwen3-Coder-Flash (30B MoE), within striking distance of Qwen3-Coder-480B using the same architecture which placed #26 on TerminalBench when it was published recently in my other project.
- Orchestrator (policy) model: Qwen3-14B
- Subgent (tool call) model: Qwen3-Coder-30B-A3B
- Rollouts: 64 per task
- 🐳 Each rollout has an isolated docker environment
- Batch size: 256 (mbs=1)
- All environments distributed across 4 nodes
- Temperature: 1.0
- Linear learning rate: 1e-6 <-> 5e-6
- Sequence length: 18,000
- Precision: BF16
- Max turns per rollout: 14
- Rollout timeout: 1200s (20 minutes)
*I tried many runs, therefore the hyperparameters above are representative of the collective. To see hyperparams/my notes for each run attempt, see here.
To provide meaningful supervision during RL, rewards were simplified to just unit tests. I found that by adding in additional "smartly crafted" reward signals, policy collapse was never too far away.
- Each training datapoint included Python unit tests to verify task completion
- Tests were assigned individual weights to provide granular partial credit
- Test execution ran in the isolated Docker container in which the agent completed its work
- Weighted scoring: passed tests contributed their weight to the final test score
I utilised my synthetically generated training dataset published here, created by my multi-agent synthetic data pipeline project found here, and that was also presented in this RL project.
Each training datapoint contains:
Before the training run, the model to be trained was evaluated against the tasks in the train dataset. Partially completed tasks were included in the next training run, none or fully completed tasks where excluded.
- Stage-1: Tasks where Qwen3-14B succeeded 1-2/3 times (41 tasks)
- Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times (Ran out of compute budget to try more runs)
- Stage-3: ... To infinity? 😅
The trained Orca-Agent-v0.1 orchestrator model is available on HuggingFace:
This 14B parameter model was trained to coordinate explorer and coder subagents within a multi-agent-coding system, achieving a 160.71% relative improvement on Stanford's TerminalBench.
Clone the repository and install dependencies:
That's it! UV will handle all dependencies automatically.
Terminal Bench Evaluation Follow the guide shown here, and host the below models (for help on how to host, see here):
Training
A guide on how to setup a rented multi-node cluster can be found here.
To run evals on the train dataset, also see here, and also complete the section titled: "IF RUNNING ON TRAIN DS", and also host the models as shown above.
- Add links when v0.2 is deployed
- Finish the eval script.
This project is a proof of concept for multi-agent coding RL. Given the limited dataset diversity and relatively small training set (due to my resource limitations), there is a meaningful possibility of overfitting to the training distribution. While the TerminalBench results are encouraging, expanding dataset variety and scale would be essential next steps to better validate generalisation capabilities.
After completing stage-1 training, I began experiments for stage-2 (beginning from stage-1 model weights) and saw that whilst the model learned well (increased reward), it actually decreased in Terminal Bench performance. Below are some of my thoughts on why that is in the form of ideas I would try given more compute budget.
- Scale up a lot.
- There is an argument that says, given a lot more compute, perhaps all that is required is to prune the dataset for the highest quality tasks, no matter their difficulty, take a big enough model, and scale up those rollouts vertically to a dramatic number. In that case I would:
- Train GLM-4.6 as the base Orchestrator model
- Use GLM-4.6 as the subagent model too
- Heat up A LOT OF GPUs, but potentially receive a powerful artifact in return that could really climb the TerminalBench leaderboard.
- There is an argument that says, given a lot more compute, perhaps all that is required is to prune the dataset for the highest quality tasks, no matter their difficulty, take a big enough model, and scale up those rollouts vertically to a dramatic number. In that case I would:
- Scale up a little.
- Find a multi-turn RL training framework that has stable MoE support and switch to Qwen3-Coder-30B as the Orchestrator policy model. (Qwen3-Coder evaluated as best ~32B Orchestrator model)
- Switch to more competent sub-agent (GLM-4.6 has been evaluated as top subagent)
- Tweaks
- There is also an argument that no more scale is needed, and I know there is most-certainly ways to improve with the current setup. Including but not limited to:
- Blend run #3 and run #11's tasks together for a longer run with otherwise identical hyperparams.
- Keep batch size the same, but reduce number of rollouts to allow more tasks per step.
- Remove the efficiency penalty.
- Increase batch size from 256 -> 320 by adding a new node, leveraging a load balancer, and providing one more node for the currently bottlenecked subagent inference task.
- Speed up deployment speed by automating orchestration of multi-node cluster (NFS, Docker, etc..) instead of long setup guide.
- Discover Agentic-RL training framework with stable MoE implementation (Qwen3-Coder-Flash was best low param Orchestrator agent in evaluations - but I had to use dense)
- There is also an argument that no more scale is needed, and I know there is most-certainly ways to improve with the current setup. Including but not limited to:
- Thank you to Taras for providing the compute for this project and supporting open source.
- Thank you to the incredibly smart team at Prime Intellect behind prime-rl and verifiers for making all the hard stuff work... and for putting up with my stream of requests 😅, specifically:
- Cloud providers for the GPUs, including:
- Hyperbolic, which I used for almost all my experiments and all of my training runs, with an excellent experience.
- Datacrunch, which I used for running most of my evaluations
- Hyperstack, which I used for running some experiements & some evaluations
- Alex Dimakis - for briefing me on his upcoming (now released) paper "How to Train Your Advisor" during a call on the day of my multi-agent-coder release. That short yet excellent conversation sparked the realisation for me that training the Orchestrator architecture would be far more effective than my previous single-agent approach in Terminal-Bench-RL. Thanks Alex!
This work was built and evaluated using the following tools and models:
All open sourced items as part of this release, including:
- Code in this repository
- Model weights
- Training data
Are under the Apache 2.0 license
.png)







