Show HN: LLM-benchmark – Make LLMs fight for fastest ops/SEC on your code

3 months ago 7

Everywhere-Ready LLM Code Optimizer & Self-Validating Benchmark Suite

npm version CI  MIT

Ship "optimized by AI" code with confidence. llm-benchmark automatically generates, validates, and benchmarks LLM-optimized variants of your functions across multiple providers.

Demo

  • 🤖 Multi-Provider Support - OpenAI, Anthropic, Azure, Ollama, and more
  • 🌍 Polyglot - JavaScript, TypeScript, Python, Rust, Go, and growing
  • Self-Validating - Ensures functional equivalence before benchmarking
  • 📊 Rich Benchmarks - Ops/sec, percentiles, memory usage, cost analysis
  • 🎨 Beautiful TUI - Real-time progress, results visualization
  • 🔌 Extensible - Plugin architecture for languages and providers
  • 📦 Zero Lock-in - Export to JSON, CSV, JUnit, HTML
# Install globally npm install -g llm-benchmark # Or use npx npx llm-benchmark demo # Optimize a function (must be exported) llm-benchmark optimizeProcess.js # With specific providers llm-benchmark optimizeProcess.js --providers openai:gpt-4o anthropic:claude-3 # Named export llm-benchmark utils.js myFunction # CI mode (no interactive UI) llm-benchmark optimizeProcess.js --ci

Note: Your function must be exported (either as default export or named export) for the tool to find it.

  • Node.js ≥ 18
  • API keys for your chosen providers (OpenAI, Anthropic, etc.)

Create llm-benchmark.yaml in your project:

providers: - openai:gpt-4o - anthropic:claude-3-sonnet validation: mode: record-replay # or 'static' or 'property-based' cases: ./test-cases.json bench: runs: 5000 warmup: 20 langPlugins: - js - py - rust

Set up your .env:

OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-...

Given this function:

// optimizeProcess.js export default function optimizeProcess(records) { const valid = records.filter((r) => r.status === 'active' && r.value > 0); const transformed = valid.map((r) => ({ ...r, value: r.value * 1.1, category: r.category.toUpperCase(), })); return Object.values( transformed.reduce((acc, r) => { if (!acc[r.category]) { acc[r.category] = { count: 0, total: 0 }; } acc[r.category].count++; acc[r.category].total += r.value; return acc; }, {}), ); }
# Step 1: Navigate to the example directory cd examples/js # Step 2: Install dependencies (if needed) npm install # Step 3: Run the benchmark llm-benchmark optimizeProcess.js # Or run from the monorepo root cd ../.. node packages/core/bin/llm-benchmark.js examples/js/optimizeProcess.js
🚀 LLM Benchmark 📝 Generating optimized variants... ✓ openai:gpt-4o completed ✓ anthropic:claude-3-sonnet completed ✅ Validating variants... ✓ All variants passed 100 test cases 📊 Running benchmarks... 🏆 Benchmark Results ────────────────────────────────────────────────────────────────────── Variant Ops/sec Improvement P95 (ms) σ ────────────────────────────────────────────────────────────────────── 🔥 openai.gpt_4o 125,420 +34.2% 0.045 ±2.1% anthropic.claude_3 118,230 +26.5% 0.048 ±1.8% original 93,420 baseline 0.062 ±2.3% ────────────────────────────────────────────────────────────────────── ✅ All variants passed validation (1,000 test cases) 💰 Total cost: $0.0234 📄 Results saved to: ./results.json

The tool will generate optimized variants like:

// optimizeProcess.openai.gpt-4o.js export default function optimizeProcess(records) { const grouped = {}; // Single pass through records for (let i = 0; i < records.length; i++) { const record = records[i]; if (record && record.status === 'active' && record.value > 0) { const category = record.category.toUpperCase(); const transformedValue = record.value * 1.1; if (!grouped[category]) { grouped[category] = { total: 0, count: 0 }; } grouped[category].total += transformedValue; grouped[category].count++; } } return Object.values(grouped); }
llm-benchmark ├── packages/ │ ├── core/ # CLI and orchestration │ ├── adapters/ # Provider adapters (OpenAI, Anthropic, etc.) │ └── plugins/ # Language plugins (JS, Python, Rust, etc.) ├── examples/ # Example projects └── docs/ # Documentation
  • ✅ JavaScript/TypeScript
  • ✅ Python
  • ✅ Rust
  • 🚧 Go
  • 🚧 Java
  • 🚧 C/C++
  • ✅ OpenAI (GPT-4, GPT-3.5)
  • ✅ Anthropic (Claude 3)
  • 🚧 Azure OpenAI
  • 🚧 Google Vertex AI
  • 🚧 Ollama (local models)
  • 🚧 Cohere
# Generate variants only llm-benchmark generate <file> [function] # Validate existing variants llm-benchmark validate <file> [function] # Benchmark validated variants llm-benchmark bench <file> [function] # Preview prompts llm-benchmark prompt <file> [function]
  • --config <path> - Config file path (default: llm-benchmark.json)
  • --providers <providers...> - Override configured providers
  • --runs <number> - Override benchmark iterations
  • --ci - CI mode (no interactive UI)
  • --no-color - Disable colored output
  • --debug - Enable debug logging

Provide test cases in JSON/YAML:

{ "cases": [ { "input": [{ "status": "active", "value": 100, "category": "electronics" }], "output": { "ELECTRONICS": { "count": 1, "total": 110 } } } ] }

Automatically capture real execution:

validation: mode: record-replay recordingEnabled: true

Generate test inputs with invariants:

validation: mode: property-based propertyTests: invariants: - 'output.total >= 0' - 'output.count === input.length'
  • JSON - Detailed results with metadata
  • CSV - Spreadsheet-friendly format
  • JUnit XML - CI integration
  • HTML - Interactive report

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Clone the repo git clone https://github.com/ajaxdavis/llm-benchmark.git cd llm-benchmark # Install dependencies pnpm install # Run tests pnpm test # Build all packages pnpm build

MIT © Ajax Davis

Built with:


Made with ❤️ by developers, for developers

Read Entire Article