Scaling Coding-Agent RL to 32x H100s. 160% Improvement on Stanford's TBench

4 days ago 3

I trained a 14B orchestrator model to better coordinate explorer & coder subagents
I scaled this to 32x Nvidia H100s, and 416x Intel Xeon Platinum 8470 CPU cores.
Qwen3-14B achieved a 160.71% relative increase on Stanford's TerminalBench after training.
Full training code, model weights, datasets, and documentation are released below.

This project builds upon the great prime-rl framework developed by Prime Intellect, and heavily depends upon the multi-agent architecture developed in multi-agent-coder. Please note that this code and the resulting model are meant simply as proof-of-concepts and building blocks for multi-agent coding RL.

For a full breakdown of this project's code structure, see here

💻 Distributed Training on 32x H100s
📈 Reward
🏆 Leaderboard Climb
🏋️‍♂️ Training & Rollout Details
🤗 Model Weights
🚀 Getting Started
🪜 Potential Steps Forward
🙏 Acknowledgements
📝 Citation
📄 License

💻 Distributed Training on 32x H100s

The below image shows the Orca-Agent-RL training code pushing thirty two Nvidia H100s to their limits.

At any one time, there were also up to 256 distributed Docker containers rolling out simultaneously across the 4x bare metal node cluster.

This training setup can be scaled from a single instance to a multi-node cluster.

🎛️ GPU Cluster Configuration

The 32x H100 cluster was organised as follows:

16 GPUs: Model training (gradient computation and optimisation)
8 GPUs: Policy model inference (orchestrator model rollouts)
8 GPUs: Subagent model inference (tool-calling rollouts, not trained upon)

🐳 Distributed Docker Rollouts

To maximize CPU utilisation across the cluster, all 256 concurrent Docker environments were automatically distributed across all 4 nodes:

Architecture:

Main node orchestrates container placement via DOCKER_ENDPOINTS environment variable
Worker nodes expose their Docker daemons over TCP (port 2375, firewall-restricted to main node)

This simple yet effective code enabled 256 concurrent containers to be distributed evenly across all available nodes to balance CPU load, and can be scaled up or down depending on compute budget. The code can be found here (in other project) and more details on how to link the nodes together can be found here.

Add link when v0.2 is published in other lib

Below is a visualization of the reward improvement over a single 20 hour run.

Qwen3-14B (run#3) Reward:

Starts at ~0.47 reward
Ends at ~0.78 reward

Training Dynamics:

Entropy (left): Model explores diverse strategies early, then converges to confident policies
Gradient Norm (right): Smooth decrease indicates stable, healthy optimisation

I evaluated Qwen3-14B on Stanford's TBench before and after training (using Qwen3-Coder-30B-A3B as the explorer & coder subagents). The RL-trained model achieved an 11.25% absolute increase (160.71% relative increase)! Nice!

Orchestrator Subagent Terminal Bench

Qwen3-Coder-480B	Qwen3-Coder-480B	19.7%
Orca-Agent-v0.1-14B	Qwen3-Coder-30B	18.25%
Qwen3-14B	Qwen3-Coder-30B	7.0%

The results of this can be found here (qwen) and here (Orca), and instructions on how to reproduce are here.

This places Orca-Agent-v0.1 (14B) + Qwen3-Coder-Flash (30B MoE), within striking distance of Qwen3-Coder-480B using the same architecture which placed #26 on TerminalBench when it was published recently in my other project.

🏋️‍♂️ Training & Rollout Details

Orchestrator (policy) model: Qwen3-14B
Subgent (tool call) model: Qwen3-Coder-30B-A3B
Rollouts: 64 per task
- 🐳 Each rollout has an isolated docker environment
Batch size: 256 (mbs=1)
- All environments distributed across 4 nodes
Temperature: 1.0
Linear learning rate: 1e-6 <-> 5e-6
Sequence length: 18,000
Precision: BF16
Max turns per rollout: 14
Rollout timeout: 1200s (20 minutes)

*I tried many runs, therefore the hyperparameters above are representative of the collective. To see hyperparams/my notes for each run attempt, see here.

To provide meaningful supervision during RL, rewards were simplified to just unit tests. I found that by adding in additional "smartly crafted" reward signals, policy collapse was never too far away.

Each training datapoint included Python unit tests to verify task completion
Tests were assigned individual weights to provide granular partial credit
Test execution ran in the isolated Docker container in which the agent completed its work
Weighted scoring: passed tests contributed their weight to the final test score

I utilised my synthetically generated training dataset published here, created by my multi-agent synthetic data pipeline project found here, and that was also presented in this RL project.

Each training datapoint contains:

{ "task_id": "git-deployment-workflow-setup", # Unique task identifier "difficulty": "hard", # easy|medium|hard|extremely_hard "category": "system-administration", # Task category "prompt": "I need help setting up a simple CI/CD system...", # The actual task instruction "dockerfile": "FROM ghcr.io/laude-institute/t-bench/ubuntu-24-04:latest\n...", # Docker environment setup "test_functions": "def test_hook_script_executable():\n ...", # Pytest verification code "test_weights": { # Weight for each test (for partial credit) "test_hook_script_executable": 0.35, "test_nginx_service_running": 0.15, "test_deployment_works_correctly": 0.50 }, "additional_files": { # Optional files to include in container "backup_config.json": "{\n \"schedules\": [...", "collision_detector.py": "#!/usr/bin/env python3\n..." } }

Before the training run, the model to be trained was evaluated against the tasks in the train dataset. Partially completed tasks were included in the next training run, none or fully completed tasks where excluded.

Stage-1: Tasks where Qwen3-14B succeeded 1-2/3 times (41 tasks)
Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times (Ran out of compute budget to try more runs)
Stage-3: ... To infinity? 😅

The trained Orca-Agent-v0.1 orchestrator model is available on HuggingFace:

DanAu5tin/Orca-Agent-v0.1

This 14B parameter model was trained to coordinate explorer and coder subagents within a multi-agent-coding system, achieving a 160.71% relative improvement on Stanford's TerminalBench.

Clone the repository and install dependencies:

git clone [email protected]:Danau5tin/Orca-Agent-RL.git && \ cd Orca-Agent-RL && \ uv sync

That's it! UV will handle all dependencies automatically.

Terminal Bench Evaluation Follow the guide shown here, and host the below models (for help on how to host, see here):

export ORCA_ORCHESTRATOR_MODEL="openai/DanAu5tin/Orca-Agent-v0.1" export ORCA_SUBAGENT_MODEL="openai/Qwen/Qwen3-Coder-30B-A3B-Instruct" export ORCA_ORCHESTRATOR_API_BASE="http://127.0.0.1:8000/v1" export ORCA_SUBAGENT_API_BASE="http://127.0.0.1:8001/v1" # 8001 port number

Training

A guide on how to setup a rented multi-node cluster can be found here.

To run evals on the train dataset, also see here, and also complete the section titled: "IF RUNNING ON TRAIN DS", and also host the models as shown above.

Add links when v0.2 is deployed
Finish the eval script.

🪜 Potential Steps Forward

⚠️ Proof of Concept Caveat

This project is a proof of concept for multi-agent coding RL. Given the limited dataset diversity and relatively small training set (due to my resource limitations), there is a meaningful possibility of overfitting to the training distribution. While the TerminalBench results are encouraging, expanding dataset variety and scale would be essential next steps to better validate generalisation capabilities.

After completing stage-1 training, I began experiments for stage-2 (beginning from stage-1 model weights) and saw that whilst the model learned well (increased reward), it actually decreased in Terminal Bench performance. Below are some of my thoughts on why that is in the form of ideas I would try given more compute budget.

Scale up a lot.
- There is an argument that says, given a lot more compute, perhaps all that is required is to prune the dataset for the highest quality tasks, no matter their difficulty, take a big enough model, and scale up those rollouts vertically to a dramatic number. In that case I would:
  - Train GLM-4.6 as the base Orchestrator model
  - Use GLM-4.6 as the subagent model too
  - Heat up A LOT OF GPUs, but potentially receive a powerful artifact in return that could really climb the TerminalBench leaderboard.
Scale up a little.
- Find a multi-turn RL training framework that has stable MoE support and switch to Qwen3-Coder-30B as the Orchestrator policy model. (Qwen3-Coder evaluated as best ~32B Orchestrator model)
- Switch to more competent sub-agent (GLM-4.6 has been evaluated as top subagent)
Tweaks
- There is also an argument that no more scale is needed, and I know there is most-certainly ways to improve with the current setup. Including but not limited to:
  - Blend run #3 and run #11's tasks together for a longer run with otherwise identical hyperparams.
  - Keep batch size the same, but reduce number of rollouts to allow more tasks per step.
  - Remove the efficiency penalty.
  - Increase batch size from 256 -> 320 by adding a new node, leveraging a load balancer, and providing one more node for the currently bottlenecked subagent inference task.
  - Speed up deployment speed by automating orchestration of multi-node cluster (NFS, Docker, etc..) instead of long setup guide.
  - Discover Agentic-RL training framework with stable MoE implementation (Qwen3-Coder-Flash was best low param Orchestrator agent in evaluations - but I had to use dense)

Thank you to Taras for providing the compute for this project and supporting open source.
Thank you to the incredibly smart team at Prime Intellect behind prime-rl and verifiers for making all the hard stuff work... and for putting up with my stream of requests 😅, specifically:
Cloud providers for the GPUs, including:
- Hyperbolic, which I used for almost all my experiments and all of my training runs, with an excellent experience.
- Datacrunch, which I used for running most of my evaluations
- Hyperstack, which I used for running some experiements & some evaluations
Alex Dimakis - for briefing me on his upcoming (now released) paper "How to Train Your Advisor" during a call on the day of my multi-agent-coder release. That short yet excellent conversation sparked the realisation for me that training the Orchestrator architecture would be far more effective than my previous single-agent approach in Terminal-Bench-RL. Thanks Alex!

Underlying frameworks and models

This work was built and evaluated using the following tools and models:

# Great set of models @article{qwen3, title={Qwen3 Technical Report}, author={An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jing Zhou and Jingren Zhou and Junyang Lin and Kai Dang and Keqin Bao and Kexin Yang and Le Yu and Lianghao Deng and Mei Li and Mingfeng Xue and Mingze Li and Pei Zhang and Peng Wang and Qin Zhu and Rui Men and Ruize Gao and Shixuan Liu and Shuang Luo and Tianhao Li and Tianyi Tang and Wenbiao Yin and Xingzhang Ren and Xinyu Wang and Xinyu Zhang and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yinger Zhang and Yu Wan and Yuqiong Liu and Zekun Wang and Zeyu Cui and Zhenru Zhang and Zhipeng Zhou and Zihan Qiu}, journal = {arXiv preprint arXiv:2505.09388}, year={2025} } # An multi-turn RL framework which works. No small find! @misc{primeintellect2025prime-rl, author = {Prime Intellect}, title = {PRIME-RL}, url = {https://github.com/PrimeIntellect-ai/prime-rl}, year = {2025} } # Great abstractions for the environment and rewards @misc{brown_verifiers_2025, author = {William Brown}, title = {{Verifiers}: Environments for LLM Reinforcement Learning}, howpublished = {\url{https://github.com/willccbb/verifiers}}, year = {2025} } # Terminal Bench is a large inspiration for my work @misc{tbench_2025, title={Terminal-Bench: A Benchmark for AI Agents in Terminal Environments}, url={https://github.com/laude-institute/terminal-bench}, author={The Terminal-Bench Team}, year={2025}, month={Apr} } # I was not able to read this paper before I began training, however a conversation with Alex in early September sparked the idea to train a smaller Orchestrator/Advisor model @misc{asawa2025trainadvisorsteeringblackbox, title={How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models}, author={Parth Asawa and Alan Zhu and Matei Zaharia and Alexandros G. Dimakis and Joseph E. Gonzalez}, year={2025}, eprint={2510.02453}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2510.02453}, }

All open sourced items as part of this release, including: