RawBench: A minimal prompt evaluation framework

4 months ago 2

Powerful, minimal framework for LLM prompt evaluation with YAML configuration, tool execution support, and comprehensive result tracking.

Most prompt testing tools are either too academic or too bloated.

RawBench is for devs who want:

YAML-first, CLI-native minimal workflow
Built in tool call mocking with recursive support
Dynamic variables (functions, env, time, etc.)
Multi-model testing with latency + cost metrics
Zero setup, just run rawbench init && rawbench run

Multi-model testing with simultaneous evaluation
YAML configuration with Docker-compose style anchors
Variable substitution and template system
Metrics for latency, tokens, and costs
CLI and Python API interfaces
Extensible tool mocking system
Dynamic variable injection
Beautiful html reports
Local dashboard for interactive result viewing

assertions
response caching
ai judge
prompt auto-finetuning
more llm providers
...

git clone https://github.com/0xsomesh/rawbench.git cd rawbench make install

# initiate rawbench rawbench init rawbench_tests cd rawbench_tests # export openai api key EXPORT OPENAI_API_KEY="<your_key_here>" # Run evaluation rawbench run tests/template.yaml --html -o template_result # Start local dashboard server rawbench serve --port 8000

RawBench now includes a local React dashboard for interactive result viewing:

Interactive Results Viewer: Browse and analyze evaluation results with a modern web interface
Real-time Updates: View results as they're generated
Detailed Metrics: Explore latency, token usage, and cost breakdowns
Test Case Analysis: Drill down into individual test cases and responses
Model Comparison: Compare performance across different models side-by-side

To start the dashboard:

rawbench serve --port 8000

Then open your browser to http://localhost:8000 to access the dashboard.

RawBench uses YAML files for configuration. Here's a comprehensive guide to the configuration options:

id: evaluation-name description: Optional description of the evaluation models: - id: model-id provider: openai name: gpt-4 temperature: 0.7 max_tokens: 1024 prompts: - id: prompt-id system: | System prompt text here tests: - id: test-id messages: - role: user content: Test message content

RawBench supports powerful tool mocking for testing agents that use function calling:

Recursive: Handles multiple tool calls in sequence
Priority Resolution: Test-specific mocks override global mocks
Loop Prevention: max_iterations prevents infinite loops
Clean: Simple YAML structure

tools: - id: search_tool name: search_tool description: Search for information parameters: type: object properties: query: type: string description: Search query required: [query] mock: output: '{"results": [{"title": "Example", "content": "Search result"}]}' tests: - id: search-test tool_execution: mode: mock # mock or actual max_iterations: 5 # Prevent infinite loops output: # Test-specific mocks (overrides global) - id: search_tool output: '{"results": [{"title": "Custom", "content": "Custom result"}]}' messages: - role: user content: "Search for information about AI"

You can compare multiple models or different configurations of the same model:

models: - id: gpt4-conservative provider: openai name: gpt-4 temperature: 0.2 - id: gpt4-creative provider: openai name: gpt-4 temperature: 0.8

You can compare multiple prompts:

prompts: - id: default_researcher system: | You are a helpful crypto research assistant. - id: default_teacher system: | You are a knowledgeable teacher.

Variables and Dynamic Content

RawBench supports dynamic variables in your prompts:

variables: current_time: function: current_datetime # Loads from variables/current_datetime.py prompts: - id: time_aware_prompt system: | Current time is {{current_time}} Please consider this timestamp in your responses.

Note: You'll have to create a new file current_time and define a function current_time returning the string

Multi-Model Comparison
- Location: examples/evaluations/multi-model-comparison.yaml
- Compare responses from different models or configurations
- Track performance metrics across models
Complex Evaluation Criteria
- Location: examples/evaluations/complex-criteria.yaml
- Define sophisticated evaluation rules
- Apply multiple test cases
Variable Usage
- Location: examples/evaluations/variable-usage.yaml
- Inject dynamic content into prompts
- Use environment variables and functions
Tool Mocking
- Location: examples/evaluations/tool-mock-example.yaml
- Mock external tool calls
- Test tool-using agents
Recursive Tool Testing
- Location: examples/evaluations/recursive-tool-test.yaml
- Test agents that make multiple tool calls
- Complex workflow testing