RawBench: A minimal prompt evaluation framework

4 months ago 2

Powerful, minimal framework for LLM prompt evaluation with YAML configuration, tool execution support, and comprehensive result tracking.

Python License

Most prompt testing tools are either too academic or too bloated.

RawBench is for devs who want:

  • YAML-first, CLI-native minimal workflow
  • Built in tool call mocking with recursive support
  • Dynamic variables (functions, env, time, etc.)
  • Multi-model testing with latency + cost metrics
  • Zero setup, just run rawbench init && rawbench run

Terminal Output

  • Multi-model testing with simultaneous evaluation
  • YAML configuration with Docker-compose style anchors
  • Variable substitution and template system
  • Metrics for latency, tokens, and costs
  • CLI and Python API interfaces
  • Extensible tool mocking system
  • Dynamic variable injection
  • Beautiful html reports
  • Local dashboard for interactive result viewing
  • assertions
  • response caching
  • ai judge
  • prompt auto-finetuning
  • more llm providers
  • ...
git clone https://github.com/0xsomesh/rawbench.git cd rawbench make install
# initiate rawbench rawbench init rawbench_tests cd rawbench_tests # export openai api key EXPORT OPENAI_API_KEY="<your_key_here>" # Run evaluation rawbench run tests/template.yaml --html -o template_result # Start local dashboard server rawbench serve --port 8000

RawBench now includes a local React dashboard for interactive result viewing:

  • Interactive Results Viewer: Browse and analyze evaluation results with a modern web interface
  • Real-time Updates: View results as they're generated
  • Detailed Metrics: Explore latency, token usage, and cost breakdowns
  • Test Case Analysis: Drill down into individual test cases and responses
  • Model Comparison: Compare performance across different models side-by-side

To start the dashboard:

rawbench serve --port 8000

Then open your browser to http://localhost:8000 to access the dashboard.

heatmap

testlist

raw_json

RawBench uses YAML files for configuration. Here's a comprehensive guide to the configuration options:

id: evaluation-name description: Optional description of the evaluation models: - id: model-id provider: openai name: gpt-4 temperature: 0.7 max_tokens: 1024 prompts: - id: prompt-id system: | System prompt text here tests: - id: test-id messages: - role: user content: Test message content

RawBench supports powerful tool mocking for testing agents that use function calling:

  • Recursive: Handles multiple tool calls in sequence
  • Priority Resolution: Test-specific mocks override global mocks
  • Loop Prevention: max_iterations prevents infinite loops
  • Clean: Simple YAML structure
tools: - id: search_tool name: search_tool description: Search for information parameters: type: object properties: query: type: string description: Search query required: [query] mock: output: '{"results": [{"title": "Example", "content": "Search result"}]}' tests: - id: search-test tool_execution: mode: mock # mock or actual max_iterations: 5 # Prevent infinite loops output: # Test-specific mocks (overrides global) - id: search_tool output: '{"results": [{"title": "Custom", "content": "Custom result"}]}' messages: - role: user content: "Search for information about AI"

You can compare multiple models or different configurations of the same model:

models: - id: gpt4-conservative provider: openai name: gpt-4 temperature: 0.2 - id: gpt4-creative provider: openai name: gpt-4 temperature: 0.8

You can compare multiple prompts:

prompts: - id: default_researcher system: | You are a helpful crypto research assistant. - id: default_teacher system: | You are a knowledgeable teacher.

Variables and Dynamic Content

RawBench supports dynamic variables in your prompts:

variables: current_time: function: current_datetime # Loads from variables/current_datetime.py prompts: - id: time_aware_prompt system: | Current time is {{current_time}} Please consider this timestamp in your responses.

Note: You'll have to create a new file current_time and define a function current_time returning the string

  1. Multi-Model Comparison

    • Location: examples/evaluations/multi-model-comparison.yaml
    • Compare responses from different models or configurations
    • Track performance metrics across models
  2. Complex Evaluation Criteria

    • Location: examples/evaluations/complex-criteria.yaml
    • Define sophisticated evaluation rules
    • Apply multiple test cases
  3. Variable Usage

    • Location: examples/evaluations/variable-usage.yaml
    • Inject dynamic content into prompts
    • Use environment variables and functions
  4. Tool Mocking

    • Location: examples/evaluations/tool-mock-example.yaml
    • Mock external tool calls
    • Test tool-using agents
  5. Recursive Tool Testing

    • Location: examples/evaluations/recursive-tool-test.yaml
    • Test agents that make multiple tool calls
    • Complex workflow testing
  • Python ≥ 3.8

MIT

Read Entire Article