A Mini‑Benchmark of 8 Web‑Enabled Models comparing accuracy and cost for finding knife manufacturer websites.
A Mini‑Benchmark of 8 Web‑Enabled Models (May 31 2025)
Every week a new LLM promises sharper reasoning, broader knowledge, or lower prices. Hype is easy; trustworthy data is harder—especially when your use‑case is as quirky as mine.
I run Knife.Day, a community where collectors catalogue every cutlery brand from Al Mar to Zladotek. To enrich those pages I need a scriptable way to fetch each maker’s official website, including tiny outfits like Actilam or Aiorosu Knives that barely register on Google. Perfect task for a quick LLM showdown:
Goal: Given a brand name, return { brand, official_url, confidence }.
Metric: accuracy per brand vs total cost.
Dataset: ten obscure knife brands.
1 · Experimental Setup
| Brands | ABKT · 5ive Star Gear · 5.11 Tactical · Aclim8 · Acta Non Verba Knives · Actilam · AGA Campolin · Aiorosu Knives · AKC · Al Mar Knives |
| Models (via OpenRouter) | Perplexity sonar‑deep‑research · OpenAI gpt‑4o & gpt‑4o‑mini · Anthropic claude‑sonnet‑4 · Google gemini‑2.5‑pro & gemini‑2.0‑flash · Meta llama‑3.1‑70b · Alibaba qwen‑2.5‑72b |
| Prompt (one‑liner) | System: "Return ONLY JSON with keys brand, official_url, confidence." |
| Scoring | Exact official domain = ✅ · “No official site” (with justification) = ✅ |
| Costs | OpenRouter prices on 31 May 2025 (Perplexity billed separately) |
| Code + logs | https://github.com/pvijeh/find-knife-brands |
2 · Results
2.1 Leaderboard (sorted by cost per correct URL)
| 1 | Gemini 2.0 Flash | 7 | 0.001 | 0.0001 | 4 k |
| 2 | GPT‑4o‑Mini | 9 | 0.19 | 0.02 | 24.7 k |
| 2 | Llama‑3.1‑70B | 9 | 0.19 | 0.02 | 25.1 k |
| 4 | Qwen 2.5‑72B | 8 | 0.19 | 0.02 | 27.9 k |
| 5 | GPT‑4o | 9 | 0.26 | 0.03 | 25.6 k |
| 6 | Claude Sonnet‑4 | 9 | 0.32 | 0.04 | 31.8 k |
| 7 | Gemini 2.5 Pro¹ | 5 | 0.31 | 0.06 | 36.8 k |
| 8 | Perplexity Sonar | 10 | 9.42 | 0.94 | 860 k |
¹ Gemini 2.5 Pro produced HTML tables in five cases; my JSON parser rejected them.
2.2 Interpretation
- Perplexity is flawless but costs $9.42 for ten queries—mostly due to an 860 k‑token footprint.
- GPT‑4o‑Mini & Llama‑3.1‑70B reach 90 % accuracy at ~$0.02 per hit—best bang‑for‑buck.
- Gemini Flash lands 70 % at one‑tenth of a cent; with a manual QA pass it’s unbeatable on price.
- Structured output matters. Gemini 2.5 Pro’s HTML responses were unusable—well‑formed JSON is part of model quality.
- Edge cases: only Perplexity explicitly declared “no official site” for Aiorosu Knives; GPT‑4o‑Mini offered a reseller link (helpful, but scored wrong).
3 · Key Take‑Aways
- Define “good enough.” 90 % accuracy + quick human review beats 100 % accuracy at 45× the price.
- Validate on ingestion. Malformed JSON breaks pipelines—enforce a schema check.
- Watch prices. Promo rates shift; query the price API before batching thousands of calls.
4 · What’s Next?
I’m wiring GPT‑4o‑Mini into Knife.Day so collectors see verified manufacturer links on every brand page. Re‑crawling ≈ 250 brands now costs under $5.
If you collect knives—or just enjoy oddball benchmarks—join the Knife.Day beta and tell me which dataset you’d automate next.
Disclaimer: model versions, accuracy, and prices are a snapshot from 31 May 2025. Future mileage—and billing—will vary.
.png)

