InstructVLA: Vision-Language-Action Instruction Tuning

1 day ago 1

Shuai Yang2,3†*, Hao Li1,3*, Yilun Chen2, Bin Wang2,3, Yang Tian3, Tai Wang3,
Hanqing Wang3, Feng Zhao1, Yiyi Liao2, Jiangmiao Pang3

1University of Science and Technology of China, 2Zhejiang University,
3Shanghai Artificial Intelligence Laboratory
Under review

*Indicates equal contribution, The work is completed during an internship at3

Introduction of InstructVLA

Abstract

To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5\% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92\% and an action expert aided by GPT-4o by 29\%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.

Method of InstructVLA

Contributions

  • We propose InstructVLA, a VLA architecture and training pipeline that emphasizes the importance of language capability in VLAs by efficiently preserving pretrained vision-language knowledge from VLMs while integrating manipulation as a component of instruction following.
  • We design a practical data and evaluation pipeline for vision-language-action instruction following, supported by 650K tailored VLA-IT annotations and a manually curated benchmark suite, enabling evaluation of VLAs' instruction generalization capabilities.
  • InstructVLA achieves leading performance across robotic manipulation tasks, multimodal benchmarks, and real-world deployments, enabling intuitive and controllable manipulation.

Dataset

We curate the Vision-Language-Action Instruction Tuning (VLA-IT) dataset, consisting of 650K human-robot interactions annotated with diverse instructions, scene captions, and question-answer pairs grounded in high-quality manipulation tasks.

Simulation

Benchmark

We introduce the SimplerEnv-Instruct benchmark, a manually designed evaluation suite featuring 80 zero-shot manipulation tasks. It encompasses both closed-loop manipulation tasks and high-level instruction reasoning, involving either situated understanding or decomposition into actionable subtasks.

Real World

Experiments

Experiments includes: (1) Real-world Experiments. Few-shot experiments on the Franka Research 3 robot and zero-shot experiments on the WidowX-250 Arm. (2) Multimodal understanding performance. (3) Robotic manipulation performance. Google Robot and WidowX Robot denote two embodiments in SimplerEnv. For SimplerEnv-Instruct, we focus on two reasoning levels, Instruction Aggregation and Situated Reasoning.

Real World

Simulation_table

Simulation_table

BibTeX

@misc{yang2025instructvlavisionlanguageactioninstructiontuning, title={InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation}, author={Shuai Yang and Hao Li and Yilun Chen and Bin Wang and Yang Tian and Tai Wang and Hanqing Wang and Feng Zhao and Yiyi Liao and Jiangmiao Pang}, year={2025}, eprint={2507.17520}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2507.17520}, }
Read Entire Article