Provider-agnostic, open-source evaluation infra for LLMs

3 months ago 4

Provider-agnostic, open-source evaluation infrastructure for language models 🚀

OpenBench provides standardized, reproducible benchmarking for LLMs across 20+ evaluation suites spanning knowledge, reasoning, coding, and mathematics. Works with any model provider - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, and more.

We're building in public! This is an alpha release - expect rapid iteration. The first stable release is coming soon.

🎯 20+ Benchmarks: MMLU, GPQA, HumanEval, SimpleQA, and competition math (AIME, HMMT)
🔧 Simple CLI: bench list, bench describe, bench eval
🏗️ Built on inspect-ai: Industry-standard evaluation framework
📊 Extensible: Easy to add new benchmarks and metrics
🤖 Provider-agnostic: Works with 15+ model providers out of the box

🏃 Speedrun: Evaluate a Model in 60 Seconds

Prerequisite: Install uv

# Create a virtual environment and install OpenBench (30 seconds) uv venv source .venv/bin/activate uv pip install openbench # Set your API key (any provider!) export GROQ_API_KEY=your_key # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc. # Run your first eval (30 seconds) bench eval mmlu --model groq/llama-3.3-70b-versatile --limit 10 # That's it! 🎉 Check results in ./logs/ or view them in an interactive UI: bench view

openbench.mp4

Using Different Providers

# Groq (blazing fast!) bench eval gpqa_diamond --model groq/meta-llama/llama-4-maverick-17b-128e-instruct # OpenAI bench eval humaneval --model openai/o3-2025-04-16 # Anthropic bench eval simpleqa --model anthropic/claude-sonnet-4-20250514 # Google bench eval mmlu --model google/gemini-2.5-pro # Local models with Ollama bench eval musr --model ollama/llama3.1:70b # Any provider supported by Inspect AI!

Category Benchmarks

Knowledge	MMLU (57 subjects), GPQA (graduate-level), SuperGPQA (285 disciplines), OpenBookQA
Coding	HumanEval (164 problems)
Math	AIME 2023-2025, HMMT Feb 2023-2025, BRUMO 2025
Reasoning	SimpleQA (factuality), MuSR (multi-step reasoning)

# Set your API keys export GROQ_API_KEY=your_key export OPENAI_API_KEY=your_key # Optional # Set default model export BENCH_MODEL=groq/llama-3.1-70b

For a complete list of all commands and options, run: bench --help

Command Description

bench	Show main menu with available commands
bench list	List available evaluations, models, and flags
bench eval <benchmark>	Run benchmark evaluation on a model
bench view	View logs from previous benchmark runs

Option Environment Variable Default Description

--model	BENCH_MODEL	groq/meta-llama/llama-4-scout-17b-16e-instruct	Model(s) to evaluate
--epochs	BENCH_EPOCHS	1	Number of epochs to run each evaluation
--max-connections	BENCH_MAX_CONNECTIONS	10	Maximum parallel requests to model
--temperature	BENCH_TEMPERATURE	0.6	Model temperature
--top-p	BENCH_TOP_P	1.0	Model top-p
--max-tokens	BENCH_MAX_TOKENS	None	Maximum tokens for model response
--seed	BENCH_SEED	None	Seed for deterministic generation
--limit	BENCH_LIMIT	None	Limit evaluated samples (number or start,end)
--logfile	BENCH_OUTPUT	None	Output file for results
--sandbox	BENCH_SANDBOX	None	Environment to run evaluation (local/docker)
--timeout	BENCH_TIMEOUT	10000	Timeout for each API request (seconds)
--display	BENCH_DISPLAY	None	Display type (full/conversation/rich/plain/none)
--reasoning-effort	BENCH_REASONING_EFFORT	None	Reasoning effort level (low/medium/high)
--json	None	False	Output results in JSON format

OpenBench is built on Inspect AI. To create custom evaluations, check out their excellent documentation.

How does OpenBench differ from Inspect AI?

OpenBench provides:

Reference implementations of 20+ major benchmarks with consistent interfaces
Shared utilities for common patterns (math scoring, multi-language support, etc.)
Curated scorers that work across different eval types
CLI tooling optimized for running standardized benchmarks

Think of it as a benchmark library built on Inspect's excellent foundation.

Why not just use Inspect AI, lm-evaluation-harness, or lighteval?

Different tools for different needs! OpenBench focuses on:

Shared components: Common scorers, solvers, and datasets across benchmarks reduce code duplication
Clean implementations: Each eval is written for readability and reliability
Developer experience: Simple CLI, consistent patterns, easy to extend

We built OpenBench because we needed evaluation code that was easy to understand, modify, and trust. It's a curated set of benchmarks built on Inspect AI's excellent foundation.

How can I run bench outside of the uv environment?

If you want bench to be available outside of uv, you can run the following command:

I'm running into an issue when downloading a dataset from HuggingFace - how do I fix it?

Some evaluations may require logging into HuggingFace to download the dataset. If bench prompts you to do so, or throws "gated" errors, defining the environment variable

HF_TOKEN="<HUGGINGFACE_TOKEN>"

should fix the issue. The full HuggingFace documentation can be found on the HuggingFace docs on Authentication.

For development work, you'll need to clone the repository:

# Clone the repo git clone https://github.com/groq/openbench.git cd openbench # Setup with UV uv venv && uv sync --dev source .venv/bin/activate # Run tests pytest

We welcome contributions! Please open issues and PRs at github.com/groq/openbench.

Reproducibility Statement

As the authors of OpenBench, we strive to implement this tool's evaluations as faithfully as possible with respect to the original benchmarks themselves.

However, it is expected that developers may observe numerical discrepancies between OpenBench's scores and the reported scores from other sources.

These numerical differences can be attributed to many reasons, including (but not limited to) minor variations in the model prompts, different model quantization or inference approaches, and repurposing benchmarks to be compatible with the packages used to develop OpenBench.

As a result, OpenBench results are meant to be compared with OpenBench results, not as a universal one-to-one comparison with every external result. For meaningful comparisons, ensure you are using the same version of OpenBench.

We encourage developers to identify areas of improvement and we welcome open source contributions to OpenBench.

This project would not be possible without:

Inspect AI - The incredible evaluation framework that powers OpenBench
EleutherAI's lm-evaluation-harness - Pioneering work in standardized LLM evaluation
Hugging Face's lighteval - Excellent evaluation infrastructure

@software{openbench, title = {OpenBench: Open-source Evaluation Infrastructure for Language Models}, author = {Sah, Aarush and {Groq Team}}, year = {2025}, url = {https://github.com/groq/openbench} }

MIT

Built with ❤️ by Aarush Sah and the Groq team

Read Entire Article