MultiNet: A Generalist Benchmark for Multimodal Action Models

4 months ago 15

Multinet is an open science initiative with contributions from leading research teams at instituitions like:

Manifold Research Logo MIT Logo Metarch.ai Logo Georgia Tech Logo


Interested in Collaborating? Reach out to us Emailor join the working group on Discord

Systems, Algorithms, and Research for evaluating Next Generation Action Models

Multinet is a comprehensive benchmarking initiative for evaluating generalist models across vision, language, and action. It provides:

  • Consolidation of diverse datasets and standardized protocols for assessing multimodal understanding
  • Extensive training data (800M+ image-text pairs, Trillion+ language tokens, 35TB+ control data)
  • Varied evaluation tasks including captioning, VQA, robotics, game-playing, commonsense reasoning, and simulated locomotion/manipulation
  • Open-source toolkit to standardize the process of obtaining and utilizing robotics/RL data

Explore our Research

Dataset Specification

Potentially the largest open-source generalist dataset, consolidating diverse modalities and tasks suitable for pre-training, fine-tuning, and evaluation

Technical report icon

Dataset Specification

MultiNet addresses the gap in holistic evaluation by assessing both Vision-Language Model (VLM) action capabilities and Vision-Language-Action Model (VLA) multimodal understanding across vision-language, language, and control tasks:

  • As a part of this effort we consolidate a diverse set of high-quality datasets (e.g., OBELICS, COYO-700M, OpenX-Embodiment, Mujoco, Procgen, Atari, WinoGAViL, VQA-v2, etc.).
  • Carefully curated training data and corresponding evaluation data, which includes:
    • Over 800M image-text pairs for vision-language association.
    • 1.3T tokens for language understanding.
    • Over 35 TB of data for robotics and RL control tasks.

News

🏆

Multinet v0.2 released! We systematically profile state-of-the-art VLAs and VLMs perform in procedurally generated OOD game environments. Read more about our recent release here.

🏅

Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.

🎉

Multinet v0.1 released! How well do state-of-the-art VLMs and VLAs perform on real-world robotics tasks? Read more on our release Page.

🚀

Introducing Multinet! A new generalist benchmark to evaluate Vision-Language & Action models. Learn more here.

Research Talks & Demos

Explore our collection of presentations showcasing Multinet's vision, progress, and development journey!

Citations

Multinet v0.2 - Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments
@misc{guruprasad2025benchmarkingvisionlanguage, title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments}, author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka}, year={2025}, eprint={2505.05540}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.05540}, } Open-Source Evaluation Harness and Toolkit (accepted at ICML 2025)
@misc{guruprasad2025opensourcesoftwaretoolkit, title={An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models}, author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Jaewoo Song and Harshvardhan Sikka}, year={2025}, eprint={2506.09172}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.09172}, } Multinet v0.1 - Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
@misc{guruprasad2024benchmarkingvisionlanguage, title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, author={Pranav Guruprasad and Harshvardhan Sikka and Jaewoo Song and Yangyue Wang and Paul Pu Liang}, year={2024}, eprint={2411.05821}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2411.05821}, } Multinet Vision and Dataset specification
@misc{guruprasad2024benchmarking, author={Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul}, title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, DOI={10.20944/preprints202411.0494.v1}, year={2024}, }
Read Entire Article