Multinet is an open science initiative with contributions from leading research teams at instituitions like:
Interested in Collaborating? Reach out to us
or join the working group on
Discord
Systems, Algorithms, and Research for evaluating Next Generation Action Models
Multinet is a comprehensive benchmarking initiative for evaluating generalist models across vision, language, and action. It provides:
- Consolidation of diverse datasets and standardized protocols for assessing multimodal understanding
- Extensive training data (800M+ image-text pairs, Trillion+ language tokens, 35TB+ control data)
- Varied evaluation tasks including captioning, VQA, robotics, game-playing, commonsense reasoning, and simulated locomotion/manipulation
- Open-source toolkit to standardize the process of obtaining and utilizing robotics/RL data
Explore our Research
News
🏆
Multinet v0.2 released! We systematically profile state-of-the-art VLAs and VLMs perform in procedurally generated OOD game environments. Read more about our recent release here.
🏅
Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.
🎉
Multinet v0.1 released! How well do state-of-the-art VLMs and VLAs perform on real-world robotics tasks? Read more on our release Page.
🚀
Introducing Multinet! A new generalist benchmark to evaluate Vision-Language & Action models. Learn more here.
Research Talks & Demos
Explore our collection of presentations showcasing Multinet's vision, progress, and development journey!
Citations
Multinet v0.2 - Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments@misc{guruprasad2025benchmarkingvisionlanguage, title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments}, author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka}, year={2025}, eprint={2505.05540}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.05540}, } Open-Source Evaluation Harness and Toolkit (accepted at ICML 2025)
@misc{guruprasad2025opensourcesoftwaretoolkit, title={An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models}, author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Jaewoo Song and Harshvardhan Sikka}, year={2025}, eprint={2506.09172}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.09172}, } Multinet v0.1 - Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
@misc{guruprasad2024benchmarkingvisionlanguage, title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, author={Pranav Guruprasad and Harshvardhan Sikka and Jaewoo Song and Yangyue Wang and Paul Pu Liang}, year={2024}, eprint={2411.05821}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2411.05821}, } Multinet Vision and Dataset specification
@misc{guruprasad2024benchmarking, author={Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul}, title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, DOI={10.20944/preprints202411.0494.v1}, year={2024}, }
.png)

