Powerful, minimal framework for LLM prompt evaluation with YAML configuration, tool execution support, and comprehensive result tracking.
Most prompt testing tools are either too academic or too bloated.
RawBench is for devs who want:
- YAML-first, CLI-native minimal workflow
- Built in tool call mocking with recursive support
- Dynamic variables (functions, env, time, etc.)
- Multi-model testing with latency + cost metrics
- Zero setup, just run rawbench init && rawbench run
- Multi-model testing with simultaneous evaluation
- YAML configuration with Docker-compose style anchors
- Variable substitution and template system
- Metrics for latency, tokens, and costs
- CLI and Python API interfaces
- Extensible tool mocking system
- Dynamic variable injection
- Beautiful html reports
- Local dashboard for interactive result viewing
- assertions
- response caching
- ai judge
- prompt auto-finetuning
- more llm providers
- ...
RawBench now includes a local React dashboard for interactive result viewing:
- Interactive Results Viewer: Browse and analyze evaluation results with a modern web interface
- Real-time Updates: View results as they're generated
- Detailed Metrics: Explore latency, token usage, and cost breakdowns
- Test Case Analysis: Drill down into individual test cases and responses
- Model Comparison: Compare performance across different models side-by-side
To start the dashboard:
Then open your browser to http://localhost:8000 to access the dashboard.
RawBench uses YAML files for configuration. Here's a comprehensive guide to the configuration options:
RawBench supports powerful tool mocking for testing agents that use function calling:
- Recursive: Handles multiple tool calls in sequence
- Priority Resolution: Test-specific mocks override global mocks
- Loop Prevention: max_iterations prevents infinite loops
- Clean: Simple YAML structure
You can compare multiple models or different configurations of the same model:
You can compare multiple prompts:
RawBench supports dynamic variables in your prompts:
Note: You'll have to create a new file current_time and define a function current_time returning the string
-
Multi-Model Comparison
- Location: examples/evaluations/multi-model-comparison.yaml
- Compare responses from different models or configurations
- Track performance metrics across models
-
Complex Evaluation Criteria
- Location: examples/evaluations/complex-criteria.yaml
- Define sophisticated evaluation rules
- Apply multiple test cases
-
Variable Usage
- Location: examples/evaluations/variable-usage.yaml
- Inject dynamic content into prompts
- Use environment variables and functions
-
Tool Mocking
- Location: examples/evaluations/tool-mock-example.yaml
- Mock external tool calls
- Test tool-using agents
-
Recursive Tool Testing
- Location: examples/evaluations/recursive-tool-test.yaml
- Test agents that make multiple tool calls
- Complex workflow testing
- Python ≥ 3.8
MIT
.png)






