Provider-agnostic, open-source evaluation infrastructure for language models 🚀
OpenBench provides standardized, reproducible benchmarking for LLMs across 20+ evaluation suites spanning knowledge, reasoning, coding, and mathematics. Works with any model provider - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, and more.
We're building in public! This is an alpha release - expect rapid iteration. The first stable release is coming soon.
- 🎯 20+ Benchmarks: MMLU, GPQA, HumanEval, SimpleQA, and competition math (AIME, HMMT)
- 🔧 Simple CLI: bench list, bench describe, bench eval
- 🏗️ Built on inspect-ai: Industry-standard evaluation framework
- 📊 Extensible: Easy to add new benchmarks and metrics
- 🤖 Provider-agnostic: Works with 15+ model providers out of the box
Prerequisite: Install uv
| Knowledge | MMLU (57 subjects), GPQA (graduate-level), SuperGPQA (285 disciplines), OpenBookQA |
| Coding | HumanEval (164 problems) |
| Math | AIME 2023-2025, HMMT Feb 2023-2025, BRUMO 2025 |
| Reasoning | SimpleQA (factuality), MuSR (multi-step reasoning) |
For a complete list of all commands and options, run: bench --help
| bench | Show main menu with available commands |
| bench list | List available evaluations, models, and flags |
| bench eval <benchmark> | Run benchmark evaluation on a model |
| bench view | View logs from previous benchmark runs |
| --model | BENCH_MODEL | groq/meta-llama/llama-4-scout-17b-16e-instruct | Model(s) to evaluate |
| --epochs | BENCH_EPOCHS | 1 | Number of epochs to run each evaluation |
| --max-connections | BENCH_MAX_CONNECTIONS | 10 | Maximum parallel requests to model |
| --temperature | BENCH_TEMPERATURE | 0.6 | Model temperature |
| --top-p | BENCH_TOP_P | 1.0 | Model top-p |
| --max-tokens | BENCH_MAX_TOKENS | None | Maximum tokens for model response |
| --seed | BENCH_SEED | None | Seed for deterministic generation |
| --limit | BENCH_LIMIT | None | Limit evaluated samples (number or start,end) |
| --logfile | BENCH_OUTPUT | None | Output file for results |
| --sandbox | BENCH_SANDBOX | None | Environment to run evaluation (local/docker) |
| --timeout | BENCH_TIMEOUT | 10000 | Timeout for each API request (seconds) |
| --display | BENCH_DISPLAY | None | Display type (full/conversation/rich/plain/none) |
| --reasoning-effort | BENCH_REASONING_EFFORT | None | Reasoning effort level (low/medium/high) |
| --json | None | False | Output results in JSON format |
OpenBench is built on Inspect AI. To create custom evaluations, check out their excellent documentation.
OpenBench provides:
- Reference implementations of 20+ major benchmarks with consistent interfaces
- Shared utilities for common patterns (math scoring, multi-language support, etc.)
- Curated scorers that work across different eval types
- CLI tooling optimized for running standardized benchmarks
Think of it as a benchmark library built on Inspect's excellent foundation.
Different tools for different needs! OpenBench focuses on:
- Shared components: Common scorers, solvers, and datasets across benchmarks reduce code duplication
- Clean implementations: Each eval is written for readability and reliability
- Developer experience: Simple CLI, consistent patterns, easy to extend
We built OpenBench because we needed evaluation code that was easy to understand, modify, and trust. It's a curated set of benchmarks built on Inspect AI's excellent foundation.
If you want bench to be available outside of uv, you can run the following command:
Some evaluations may require logging into HuggingFace to download the dataset. If bench prompts you to do so, or throws "gated" errors, defining the environment variable
should fix the issue. The full HuggingFace documentation can be found on the HuggingFace docs on Authentication.
For development work, you'll need to clone the repository:
We welcome contributions! Please open issues and PRs at github.com/groq/openbench.
As the authors of OpenBench, we strive to implement this tool's evaluations as faithfully as possible with respect to the original benchmarks themselves.
However, it is expected that developers may observe numerical discrepancies between OpenBench's scores and the reported scores from other sources.
These numerical differences can be attributed to many reasons, including (but not limited to) minor variations in the model prompts, different model quantization or inference approaches, and repurposing benchmarks to be compatible with the packages used to develop OpenBench.
As a result, OpenBench results are meant to be compared with OpenBench results, not as a universal one-to-one comparison with every external result. For meaningful comparisons, ensure you are using the same version of OpenBench.
We encourage developers to identify areas of improvement and we welcome open source contributions to OpenBench.
This project would not be possible without:
- Inspect AI - The incredible evaluation framework that powers OpenBench
- EleutherAI's lm-evaluation-harness - Pioneering work in standardized LLM evaluation
- Hugging Face's lighteval - Excellent evaluation infrastructure
MIT
Built with ❤️ by Aarush Sah and the Groq team
.png)


