MultiNet: A Generalist Benchmark for Multimodal Action Models

4 months ago 15

Multinet is an open science initiative with contributions from leading research teams at instituitions like:

Interested in Collaborating? Reach out to us or join the working group on Discord

Systems, Algorithms, and Research for evaluating Next Generation Action Models

Multinet is a comprehensive benchmarking initiative for evaluating generalist models across vision, language, and action. It provides:

Consolidation of diverse datasets and standardized protocols for assessing multimodal understanding
Extensive training data (800M+ image-text pairs, Trillion+ language tokens, 35TB+ control data)
Varied evaluation tasks including captioning, VQA, robotics, game-playing, commonsense reasoning, and simulated locomotion/manipulation
Open-source toolkit to standardize the process of obtaining and utilizing robotics/RL data

Explore our Research

Dataset Specification

Potentially the largest open-source generalist dataset, consolidating diverse modalities and tasks suitable for pre-training, fine-tuning, and evaluation

Technical report icon

Dataset Specification

MultiNet addresses the gap in holistic evaluation by assessing both Vision-Language Model (VLM) action capabilities and Vision-Language-Action Model (VLA) multimodal understanding across vision-language, language, and control tasks:

As a part of this effort we consolidate a diverse set of high-quality datasets (e.g., OBELICS, COYO-700M, OpenX-Embodiment, Mujoco, Procgen, Atari, WinoGAViL, VQA-v2, etc.).
Carefully curated training data and corresponding evaluation data, which includes:
- Over 800M image-text pairs for vision-language association.
- 1.3T tokens for language understanding.
- Over 35 TB of data for robotics and RL control tasks.

NEKO

An open-source GATO-style generalist multimodal model for image, text, RL, and robotics tasks.

NEKO

NEKO is a distributed, open-source effort to build a Large Multimodal Model tackling "massively multimodal" scenarios (3+ modalities) to advance generalist AI.

Massively Multimodal: Aims to handle images, text, audio, video, control & proprioception.
Generalist vision: Designed for numerous objectives across diverse tasks, adapting to new domains.
Open Source & Community: Fosters collective progress and understanding in AI systems.
GATO-Inspired, future-focused: Builds on generalist principles with cutting-edge advancements.

News

🏆

Multinet v0.2 released! We systematically profile state-of-the-art VLAs and VLMs perform in procedurally generated OOD game environments. Read more about our recent release here.

🏅

Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.

🎉

Multinet v0.1 released! How well do state-of-the-art VLMs and VLAs perform on real-world robotics tasks? Read more on our release Page.

🚀

Introducing Multinet! A new generalist benchmark to evaluate Vision-Language & Action models. Learn more here.

Research Talks & Demos

Explore our collection of presentations showcasing Multinet's vision, progress, and development journey!

Citations

Multinet v0.2 - Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments
@misc{guruprasad2025benchmarkingvisionlanguage, title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments}, author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka}, year={2025}, eprint={2505.05540}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.05540}, } Open-Source Evaluation Harness and Toolkit (accepted at ICML 2025)
@misc{guruprasad2025opensourcesoftwaretoolkit, title={An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models}, author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Jaewoo Song and Harshvardhan Sikka}, year={2025}, eprint={2506.09172}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.09172}, } Multinet v0.1 - Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
@misc{guruprasad2024benchmarkingvisionlanguage, title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, author={Pranav Guruprasad and Harshvardhan Sikka and Jaewoo Song and Yangyue Wang and Paul Pu Liang}, year={2024}, eprint={2411.05821}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2411.05821}, } Multinet Vision and Dataset specification
@misc{guruprasad2024benchmarking, author={Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul}, title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, DOI={10.20944/preprints202411.0494.v1}, year={2024}, }

Read Entire Article

MultiNet: A Generalist Benchmark for Multimodal Action Models

Systems, Algorithms, and Research for evaluating Next Generation Action Models

Explore our Research

Dataset Specification

Dataset Specification

NEKO

NEKO

News

Research Talks & Demos

Citations

Related

Simplify Insurance Authorizations and Billing with Secure ES...

Show HN: CellARC Measuring Intelligence with Cellular Automa...

Servo's SpiderMonkey Fork Mozjs