MultiNet: A Benchmark for Evaluating Multimodal Reasoning and Action Models

3 days ago 2

¹ Manifold Research, ² Metarch AI, ³ Georgia Tech, ⁴ Tufts University, ⁵ Northeastern University, ⁶ Birla Institute of Technology and Science, Pilani ⁷ Institute for Research and Innovation in Intelligent Systems (IRIIS)
^*Indicates Equal Contribution

MultiNet v1 Release Visual

MultiNet v1.0 provides a comprehensive benchmark suite for evaluating state-of-the-art multimodal reasoning and action models across diverse domains including robotics, gameplay, and multimodal understanding tasks.
(Hover over the image to enlarge)

Abstract

Multimodal reasoning and action models hold immense promise as general-purpose agents, yet the current evaluation landscape remains fragmented with domain-specific benchmarks that fail to capture true generalization capabilities. This critical gap prevents us from understanding where these sophisticated systems excel and where they fail. We introduce MultiNet v1.0, a unified benchmark suite that bridges this evaluation gap by systematically assessing state-of-the-art Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and generalist models across robotics, multi-agent gameplay, and multimodal reasoning tasks. Our evaluation suite spans 11 datasets and includes state-of-the-art models such as GPT-5, Pi0, and Magma in their respective categories.

Key contributions of MultiNet v1.0 include:

Comprehensive Domain Coverage: Evaluation across robotics, gameplay, commonsense reasoning, spatial reasoning, visual question answering, and visual understanding tasks. These capabilities are essential for generalist models and systems.
Standardized Evaluation Protocols: Unified metrics and evaluation procedures for fair comparison across different model architectures
Model Adaptation Framework: Open-source code for adapting diverse models to various out-of-distribution task domains
Extensive Analysis: In-depth analysis of model capabilities, failure modes, and architectural trade-offs
Open-Source Toolkit: Complete evaluation harness and benchmarking tools for the research community

Our findings reveal significant insights into the current state of multimodal AI, highlighting both promising capabilities and critical limitations that inform future research directions. We release our complete benchmark suite, evaluation framework, and detailed analysis to accelerate progress in this field.

Dataset Coverage

MultiNet v1.0 evaluates models across six major domains using 11 diverse datasets. Each dataset presents unique challenges in vision-language-action understanding, from robotic manipulation to complex reasoning tasks.

MultiNet v1.0 evaluates models across diverse modalities and task types, from low-level robotic control to high-level reasoning. For complete dataset specifications and evaluation protocols, please refer to our technical report.

💡 Tip: Hover over images to zoom, or click to view full-size in lightbox

Dataset preview

Evaluation Methodology and Metrics

MultiNet v1.0 employs standardized evaluation metrics tailored to each task category, ensuring comprehensive and fair assessment across diverse model architectures. Our evaluation framework adapts metrics to the unique characteristics of each domain while maintaining consistency for cross-domain comparisons:

Results

Our evaluation across diverse domains reveals significant insights into model performance and capabilities. Below we present detailed results from our evaluation suite.

Model Performance Comparison

Evaluation across 7 diverse tasks across robotics, digital control, and multimodal reasoning.

Loading leaderboard data...

Task Groups:

Robotics Control

Digital Control

Spatial Reasoning

Image Classification

Tool Use

Metrics:

EM: Exact Match Rate (%)

F1: Macro F1 Score (%)

MAE: Mean Absolute Error

Visual Indicators:

🏆 Best score per task

Wins: Number of tasks where model scored highest

Notes:

* We did not profile GPT5 on BFCLv3 with this release. See Gorilla leaderboard for BFCLv4 results.

Model Output Comparison

Compare how different models respond to the same visual input from the ODinW selfdrivingCar dataset

Input

Sample image for model comparison

What object is shown in this image from the selfdrivingCar dataset?
Option 0: biker Option 1: car ...
Output the number (0-10) of the correct option only.

Pi0's prediction space collapse visualized

Pi0 experiences prediction space collapse on the Overcooked dataset, centered around the action 24, which maps to (Player 1: STAY, Player 2: NORTH)

Frequency of predicted action classes for Pi0 model on Overcooked dataset

Key Findings and Analysis

1. Catastrophic Cross-Domain Failure: MultiNet v1.0 reveals catastrophic failure at the boundaries of vision-language-action models. No current model achieves true cross-domain generalization — Pi0 performance drops to 0% on basic vision-language tasks. GPT-5 while performing relatively better, still does not achieve anywhere near the performance necessary for succesful task completion.

2. Domain-Specific Fine-Tuning Corruption: Fine-tuning for robotics seems to systematically corrupt vision language models. Pi0 exhibits repetitive "increa" token spam, suggesting that action-oriented training degrades linguistic capabilities through catastrophic forgetting of language generation pathways.

3. Output Modality Misalignment: Magma, designed as a generalist model, produces spatial coordinates instead of text answers when prompted with language tasks. This reveals fundamental misalignment between input processing and output generation across different task domains.

4. Limited Impact of Prompt Engineering: Our prompt engineering experiments only led to marginal gains (~20% improvement) that cannot bridge fundamental architectural incompatibilities. This suggests that current model limitations are structural rather than interface-related.

5. Need for Architectural Innovation: These results demonstrate that current training paradigms create overspecialized models with incompatible domain-specific biases. This necessitates fundamental rethinking of modular architectures and progressive training strategies for truly unified multimodal systems.

For a detailed look at our results, analysis, and methodology, please refer to our comprehensive technical report.

Looking Forward

We are exploring several near-term experiments with collaborators, as well as larger-scale research directions that build on MultiNet's findings. If you're interested in contributing to the future of multimodal AI evaluation and development, we encourage you to get involved. Join our Discord community to connect with researchers and explore opportunities to contribute as part of the Manifold Research team.

Near Term Experiments

Building on MultiNet v1.0's findings of catastrophic failure at domain boundaries, our immediate research priorities focus on understanding and mitigating the fundamental limitations of current vision-language & action models. These investigations target the core mechanisms behind knowledge degradation, architectural incompatibilities, and the emergence of failure modes that prevent true cross-domain generalization.

Investigating the Gibberish Outputs of VLAs ▼

Token degradation analysis: Examine how fine-tuning on action sequences corrupts language generation pathways, particularly the emergence of repetitive tokens like Pi0's "increa" spam
Mitigation strategies: Develop strategies to mitigate the degradation of language generation pathways despite end-to-end fine-tuning on entirely new data distributions

Investigating SoM/ToM Outputs of Magma ▼

Coordinate-text mapping errors: Investigate why Magma produces spatial coordinates instead of natural language responses when prompted with inputs of completely different domains, revealing fundamental misalignment in output modalities
Output modality alignment: Develop fine-tuning/training techniques and architectural modifications to ensure proper alignment between input domains and output modalities, preventing cross-modal contamination in multi-task models

Pi0.5 - SoTA Knowledge-Insulated VLA Performance on v1.0 ▼

Architectural improvements: Evaluate next-generation knowledge insulation techniques in Pi0.5 against MultiNet v1's comprehensive benchmark suite
Domain transfer efficiency: Measure how effectively Pi0.5 maintains performance across vision-language tasks while adapting to robotic control
Failure mode comparison: Contrast Pi0.5's failure patterns with previous Vision-Language-Action models to validate architectural advances in preventing catastrophic forgetting

Knowledge Insulation Testing and Experiments ▼

Modular architecture evaluation: Test compartmentalized model designs that isolate domain-specific knowledge while maintaining shared representations for common reasoning tasks
Progressive fine-tuning protocols: Investigate learning approaches that gradually introduce new domains without corrupting existing capabilities

Long-term Research Directions

Our long-term vision extends beyond addressing current limitations to fundamentally reimagining how we evaluate, understand, and build multimodal action models. These goals represent paradigm shifts toward more robust, adaptive, and truly general AI systems that can seamlessly operate across diverse domains while maintaining coherent reasoning capabilities.

Live Benchmarks ▼

Dynamic evaluation frameworks: Develop continuously updating benchmarks that adapt to model capabilities, preventing overfitting to static test sets
Real-time performance monitoring: Create systems for ongoing assessment of deployed VLAs across diverse real-world scenarios
Community-driven evaluation: Build platforms for researchers to contribute new tasks and domains, ensuring benchmark relevance as the field evolves

World Models as Evaluators ▼

Causal reasoning assessment: Build and leverage world models to evaluate whether multimodal systems of the future understand causal relationships rather than just statistical correlations
Counterfactual analysis: Deploy world models to test the robustness of multimodal systems through systematic perturbation of environmental conditions and task parameters

Building the Next Generation of Multimodal Action Models ▼

Unified architecture design: Develop foundational architectures that natively support multiple modalities without domain-specific fine-tuning degradation
Compositional reasoning systems: Create models that can decompose complex tasks into modular components, enabling flexible recombination across domains
Meta-learning for rapid adaptation: Build systems that can quickly acquire new capabilities while preserving existing knowledge, moving beyond current catastrophic forgetting limitations

Citation

@article{guruprasad2025multinet, author = {Pranav Guruprasad and Sudipta Chowdhury and Harshvardhan Sikka and Mridul Sharma and Helen Lu and Sean Rivera and Aryan Khurana and Yangyue Wang}, title = {MultiNet v1.0: A Comprehensive Benchmark for Evaluating Multimodal Reasoning and Action Models Across Diverse Domains}, journal = {Manifold Research Publications}, year = {2025}, note = {https://multinet.ai/static/pages/Multinetv1.html}, doi = {10.5281/zenodo.17404313} }

Read Entire Article