Everywhere-Ready LLM Code Optimizer & Self-Validating Benchmark Suite
Ship "optimized by AI" code with confidence. llm-benchmark automatically generates, validates, and benchmarks LLM-optimized variants of your functions across multiple providers.
- 🤖 Multi-Provider Support - OpenAI, Anthropic, Azure, Ollama, and more
- 🌍 Polyglot - JavaScript, TypeScript, Python, Rust, Go, and growing
- ✅ Self-Validating - Ensures functional equivalence before benchmarking
- 📊 Rich Benchmarks - Ops/sec, percentiles, memory usage, cost analysis
- 🎨 Beautiful TUI - Real-time progress, results visualization
- 🔌 Extensible - Plugin architecture for languages and providers
- 📦 Zero Lock-in - Export to JSON, CSV, JUnit, HTML
# Install globally
npm install -g llm-benchmark
# Or use npx
npx llm-benchmark demo
# Optimize a function (must be exported)
llm-benchmark optimizeProcess.js
# With specific providers
llm-benchmark optimizeProcess.js --providers openai:gpt-4o anthropic:claude-3
# Named export
llm-benchmark utils.js myFunction
# CI mode (no interactive UI)
llm-benchmark optimizeProcess.js --ci
Note: Your function must be exported (either as default export or named export) for the tool to find it.
- Node.js ≥ 18
- API keys for your chosen providers (OpenAI, Anthropic, etc.)
Create llm-benchmark.yaml in your project:
providers:
- openai:gpt-4o
- anthropic:claude-3-sonnet
validation:
mode: record-replay # or 'static' or 'property-based'
cases: ./test-cases.json
bench:
runs: 5000
warmup: 20
langPlugins:
- js
- py
- rust
Set up your .env:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
Given this function:
// optimizeProcess.js
export default function optimizeProcess(records) {
const valid = records.filter((r) => r.status === 'active' && r.value > 0);
const transformed = valid.map((r) => ({
...r,
value: r.value * 1.1,
category: r.category.toUpperCase(),
}));
return Object.values(
transformed.reduce((acc, r) => {
if (!acc[r.category]) {
acc[r.category] = { count: 0, total: 0 };
}
acc[r.category].count++;
acc[r.category].total += r.value;
return acc;
}, {}),
);
}
# Step 1: Navigate to the example directory
cd examples/js
# Step 2: Install dependencies (if needed)
npm install
# Step 3: Run the benchmark
llm-benchmark optimizeProcess.js
# Or run from the monorepo root
cd ../..
node packages/core/bin/llm-benchmark.js examples/js/optimizeProcess.js
🚀 LLM Benchmark
📝 Generating optimized variants...
✓ openai:gpt-4o completed
✓ anthropic:claude-3-sonnet completed
✅ Validating variants...
✓ All variants passed 100 test cases
📊 Running benchmarks...
🏆 Benchmark Results
──────────────────────────────────────────────────────────────────────
Variant Ops/sec Improvement P95 (ms) σ
──────────────────────────────────────────────────────────────────────
🔥 openai.gpt_4o 125,420 +34.2% 0.045 ±2.1%
anthropic.claude_3 118,230 +26.5% 0.048 ±1.8%
original 93,420 baseline 0.062 ±2.3%
──────────────────────────────────────────────────────────────────────
✅ All variants passed validation (1,000 test cases)
💰 Total cost: $0.0234
📄 Results saved to: ./results.json
The tool will generate optimized variants like:
// optimizeProcess.openai.gpt-4o.js
export default function optimizeProcess(records) {
const grouped = {};
// Single pass through records
for (let i = 0; i < records.length; i++) {
const record = records[i];
if (record && record.status === 'active' && record.value > 0) {
const category = record.category.toUpperCase();
const transformedValue = record.value * 1.1;
if (!grouped[category]) {
grouped[category] = { total: 0, count: 0 };
}
grouped[category].total += transformedValue;
grouped[category].count++;
}
}
return Object.values(grouped);
}
llm-benchmark
├── packages/
│ ├── core/ # CLI and orchestration
│ ├── adapters/ # Provider adapters (OpenAI, Anthropic, etc.)
│ └── plugins/ # Language plugins (JS, Python, Rust, etc.)
├── examples/ # Example projects
└── docs/ # Documentation
- ✅ JavaScript/TypeScript
- ✅ Python
- ✅ Rust
- 🚧 Go
- 🚧 Java
- 🚧 C/C++
- ✅ OpenAI (GPT-4, GPT-3.5)
- ✅ Anthropic (Claude 3)
- 🚧 Azure OpenAI
- 🚧 Google Vertex AI
- 🚧 Ollama (local models)
- 🚧 Cohere
# Generate variants only
llm-benchmark generate <file> [function]
# Validate existing variants
llm-benchmark validate <file> [function]
# Benchmark validated variants
llm-benchmark bench <file> [function]
# Preview prompts
llm-benchmark prompt <file> [function]
- --config <path> - Config file path (default: llm-benchmark.json)
- --providers <providers...> - Override configured providers
- --runs <number> - Override benchmark iterations
- --ci - CI mode (no interactive UI)
- --no-color - Disable colored output
- --debug - Enable debug logging
Provide test cases in JSON/YAML:
{
"cases": [
{
"input": [{ "status": "active", "value": 100, "category": "electronics" }],
"output": { "ELECTRONICS": { "count": 1, "total": 110 } }
}
]
}
Automatically capture real execution:
validation:
mode: record-replay
recordingEnabled: true
Generate test inputs with invariants:
validation:
mode: property-based
propertyTests:
invariants:
- 'output.total >= 0'
- 'output.count === input.length'
- JSON - Detailed results with metadata
- CSV - Spreadsheet-friendly format
- JUnit XML - CI integration
- HTML - Interactive report
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Clone the repo
git clone https://github.com/ajaxdavis/llm-benchmark.git
cd llm-benchmark
# Install dependencies
pnpm install
# Run tests
pnpm test
# Build all packages
pnpm build
MIT © Ajax Davis
Built with:
- Commander.js - CLI framework
- Ink - React for CLIs
- Benchmark.js - Benchmarking library
Made with ❤️ by developers, for developers
.png)



