Show HN: LLM-benchmark – Make LLMs fight for fastest ops/SEC on your code

3 months ago 7

Everywhere-Ready LLM Code Optimizer & Self-Validating Benchmark Suite

Ship "optimized by AI" code with confidence. llm-benchmark automatically generates, validates, and benchmarks LLM-optimized variants of your functions across multiple providers.

🤖 Multi-Provider Support - OpenAI, Anthropic, Azure, Ollama, and more
🌍 Polyglot - JavaScript, TypeScript, Python, Rust, Go, and growing
✅ Self-Validating - Ensures functional equivalence before benchmarking
📊 Rich Benchmarks - Ops/sec, percentiles, memory usage, cost analysis
🎨 Beautiful TUI - Real-time progress, results visualization
🔌 Extensible - Plugin architecture for languages and providers
📦 Zero Lock-in - Export to JSON, CSV, JUnit, HTML

# Install globally npm install -g llm-benchmark # Or use npx npx llm-benchmark demo # Optimize a function (must be exported) llm-benchmark optimizeProcess.js # With specific providers llm-benchmark optimizeProcess.js --providers openai:gpt-4o anthropic:claude-3 # Named export llm-benchmark utils.js myFunction # CI mode (no interactive UI) llm-benchmark optimizeProcess.js --ci

Note: Your function must be exported (either as default export or named export) for the tool to find it.

Node.js ≥ 18
API keys for your chosen providers (OpenAI, Anthropic, etc.)

Create llm-benchmark.yaml in your project:

providers: - openai:gpt-4o - anthropic:claude-3-sonnet validation: mode: record-replay # or 'static' or 'property-based' cases: ./test-cases.json bench: runs: 5000 warmup: 20 langPlugins: - js - py - rust

Set up your .env:

OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-...

Given this function:

// optimizeProcess.js export default function optimizeProcess(records) { const valid = records.filter((r) => r.status === 'active' && r.value > 0); const transformed = valid.map((r) => ({ ...r, value: r.value * 1.1, category: r.category.toUpperCase(), })); return Object.values( transformed.reduce((acc, r) => { if (!acc[r.category]) { acc[r.category] = { count: 0, total: 0 }; } acc[r.category].count++; acc[r.category].total += r.value; return acc; }, {}), ); }

# Step 1: Navigate to the example directory cd examples/js # Step 2: Install dependencies (if needed) npm install # Step 3: Run the benchmark llm-benchmark optimizeProcess.js # Or run from the monorepo root cd ../.. node packages/core/bin/llm-benchmark.js examples/js/optimizeProcess.js

🚀 LLM Benchmark 📝 Generating optimized variants... ✓ openai:gpt-4o completed ✓ anthropic:claude-3-sonnet completed ✅ Validating variants... ✓ All variants passed 100 test cases 📊 Running benchmarks... 🏆 Benchmark Results ────────────────────────────────────────────────────────────────────── Variant Ops/sec Improvement P95 (ms) σ ────────────────────────────────────────────────────────────────────── 🔥 openai.gpt_4o 125,420 +34.2% 0.045 ±2.1% anthropic.claude_3 118,230 +26.5% 0.048 ±1.8% original 93,420 baseline 0.062 ±2.3% ────────────────────────────────────────────────────────────────────── ✅ All variants passed validation (1,000 test cases) 💰 Total cost: $0.0234 📄 Results saved to: ./results.json

The tool will generate optimized variants like:

// optimizeProcess.openai.gpt-4o.js export default function optimizeProcess(records) { const grouped = {}; // Single pass through records for (let i = 0; i < records.length; i++) { const record = records[i]; if (record && record.status === 'active' && record.value > 0) { const category = record.category.toUpperCase(); const transformedValue = record.value * 1.1; if (!grouped[category]) { grouped[category] = { total: 0, count: 0 }; } grouped[category].total += transformedValue; grouped[category].count++; } } return Object.values(grouped); }

llm-benchmark ├── packages/ │ ├── core/ # CLI and orchestration │ ├── adapters/ # Provider adapters (OpenAI, Anthropic, etc.) │ └── plugins/ # Language plugins (JS, Python, Rust, etc.) ├── examples/ # Example projects └── docs/ # Documentation

✅ JavaScript/TypeScript
✅ Python
✅ Rust
🚧 Go
🚧 Java
🚧 C/C++

✅ OpenAI (GPT-4, GPT-3.5)
✅ Anthropic (Claude 3)
🚧 Azure OpenAI
🚧 Google Vertex AI
🚧 Ollama (local models)
🚧 Cohere

# Generate variants only llm-benchmark generate <file> [function] # Validate existing variants llm-benchmark validate <file> [function] # Benchmark validated variants llm-benchmark bench <file> [function] # Preview prompts llm-benchmark prompt <file> [function]

--config <path> - Config file path (default: llm-benchmark.json)
--providers <providers...> - Override configured providers
--runs <number> - Override benchmark iterations
--ci - CI mode (no interactive UI)
--no-color - Disable colored output
--debug - Enable debug logging

Provide test cases in JSON/YAML:

{ "cases": [ { "input": [{ "status": "active", "value": 100, "category": "electronics" }], "output": { "ELECTRONICS": { "count": 1, "total": 110 } } } ] }

Automatically capture real execution:

validation: mode: record-replay recordingEnabled: true

Generate test inputs with invariants:

validation: mode: property-based propertyTests: invariants: - 'output.total >= 0' - 'output.count === input.length'

JSON - Detailed results with metadata
CSV - Spreadsheet-friendly format
JUnit XML - CI integration
HTML - Interactive report

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Clone the repo git clone https://github.com/ajaxdavis/llm-benchmark.git cd llm-benchmark # Install dependencies pnpm install # Run tests pnpm test # Build all packages pnpm build

Built with: