RoboMonkey: Test Time Compute for Robotics

7 hours ago 3

RoboMonkey

Scaling Test-Time Sampling and Verification
for Vision-Language-Action Models

Abstract

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a VLM-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such Vision Language Model (VLM)-based verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 9% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.

Inference-time Scaling Law

Scaling Law

We observe that action error consistently decreases as we scale the number of generated actions across multiple sampling approaches, assuming the presence of an oracle verifier. Repeatedly sampling actions from robot policies, applying Gaussian perturbation to a few sampled actions, and even random sampling of action tokens all outperform single-attempt OpenVLA. We also find that the relationship between action error and the number of samples generated through Gaussian Perturbation follows an approximate power law across a range of VLA models, including CogACT, Octo, OpenVLA, and SpatialVLA. For power law fitting, we model the logarithm of action error e as a function of the number of samples: log(e) ≈ log(a) + b * log(k).

Approach

Stage 1: Training the Action Verifier: Given an imitation learning dataset, we sample N candidate actions per state from a generalist robot policy, and apply clustering to reduce them to K representative actions. We construct synthetic action comparisons and assign preferences based on the RMSE between each sampled action and the ground-truth action. This synthetic preference dataset is then used to fine-tune a VLM-based action verifier.

Stage 1

Stage 2: Scaling Test-Time Compute: At deployment, we sample N̂ initial actions from the generalist robot policy based on the given task instruction and observation. We fit a Gaussian distribution to the translation and rotation components of these actions, and use majority voting to determine the gripper state. This creates an action proposal distribution from which we can efficiently sample candidate actions with negligible overhead. Finally, we use the fine-tuned VLM-based verifier to evaluate these K̂ candidate actions and select the optimal action.

Stage 2

Experiments

Task Suites

Example tasks across Bridge V2, SIMPLER, and LIBERO.

① Bridge V2

Results

Scaling test-time compute leads to substantial improvements on OOD generalization tasks, achieving a 25% absolute improvement

② SIMPLER

Results

RoboMonkey improves the precision of generalist robot policies in the SIMPLER environment, leading to 9% higher average success rate on in-distribution tasks

③ LIBERO-LONG

Results

Fine-tuning both OpenVLA and RoboMonkey action verifier results in 7% improvement in average success rate compared to simply fine-tuning OpenVLA on LIBERO-Long

Real-World Case Studies

Imprecise Grasping

OpenVLA ❌

Imprecise Grasping Fail

V-GPS ❌

Imprecise Grasping VGPS

RoboMonkey ✅

Imprecise Grasping Success

Task Progression Failure

OpenVLA ❌

Task Progression Fail

V-GPS ❌

Task Progression VGPS

RoboMonkey ✅

Task Progression Success

Collision

OpenVLA ❌

Collision Fail

V-GPS ❌

Collision VGPS

RoboMonkey ✅

Collision Success

Stall in Place

OpenVLA ❌

Collision Fail

V-GPS ❌

Collision VGPS

RoboMonkey ✅

Collision Success

How does RoboMonkey enable practical deployment for test-time scaling?

① VLA Serving Engine

Latency

Repeated sampling can exploit KV Cache optimizations and batch processing to achieve higher throughput than greedy decoding. Therefore, we extended SGLang’s capabilities to properly support OpenVLA. Our optimized implementation substantially outperforms the naive OpenVLA inference pipeline, achieving lower latency and significantly higher throughput across batch sizes.

② Gaussian Perturbation

Latency

Gaussian perturbation applied to a small set of actions is more efficient than naively sampling actions from robot policies when constructing action proposal distributions. RoboMonkey can sample and verify 16 candidate actions in 650 ms (or 1.5 Hz).

How does scaling the synthetic training dataset impact downstream success rate?

Synthetic

Average success rates across four tasks on SIMPLER as a function of synthetic dataset size. Scaling the dataset size (number of synthetic action comparisons) consistently improves the performance of the RoboMonkey verifier, leading to higher closed-loop success rates.

BibTeX

@article{kwok25robomonkey, title={RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models}, author={Jacky Kwok and Christopher Agia and Rohan Sinha and Matt Foutter and Shulu Li and Ion Stoica and Azalia Mirhoseini and Marco Pavone}, journal={arXiv preprint arXiv:2506.17811}, year={2025}, }

Read Entire Article

RoboMonkey: Test Time Compute for Robotics

RoboMonkey

Scaling Test-Time Sampling and Verification
for Vision-Language-Action Models

Abstract

Inference-time Scaling Law

Approach