Show HN: Which LLM Finds Obscure Knife-Brand URLs Cheapest? (8-Model Benchmark)

4 months ago 7

A Mini‑Benchmark of 8 Web‑Enabled Models comparing accuracy and cost for finding knife manufacturer websites.

A Mini‑Benchmark of 8 Web‑Enabled Models (May 31 2025)

Every week a new LLM promises sharper reasoning, broader knowledge, or lower prices. Hype is easy; trustworthy data is harder—especially when your use‑case is as quirky as mine.

I run Knife.Day, a community where collectors catalogue every cutlery brand from Al Mar to Zladotek. To enrich those pages I need a scriptable way to fetch each maker’s official website, including tiny outfits like Actilam or Aiorosu Knives that barely register on Google. Perfect task for a quick LLM showdown:

Goal: Given a brand name, return { brand, official_url, confidence }.
Metric: accuracy per brand vs total cost.
Dataset: ten obscure knife brands.

1 · Experimental Setup

ItemDetails

Brands	ABKT · 5ive Star Gear · 5.11 Tactical · Aclim8 · Acta Non Verba Knives · Actilam · AGA Campolin · Aiorosu Knives · AKC · Al Mar Knives
Models (via OpenRouter)	Perplexity sonar‑deep‑research · OpenAI gpt‑4o & gpt‑4o‑mini · Anthropic claude‑sonnet‑4 · Google gemini‑2.5‑pro & gemini‑2.0‑flash · Meta llama‑3.1‑70b · Alibaba qwen‑2.5‑72b
Prompt (one‑liner)	System: "Return ONLY JSON with keys brand, official_url, confidence."
Scoring	Exact official domain = ✅ · “No official site” (with justification) = ✅
Costs	OpenRouter prices on 31 May 2025 (Perplexity billed separately)
Code + logs	https://github.com/pvijeh/find-knife-brands

2 · Results

2.1 Leaderboard (sorted by cost per correct URL)

RankModelCorrect/10Total USDUSD/CorrectTokens

1	Gemini 2.0 Flash	7	0.001	0.0001	4 k
2	GPT‑4o‑Mini	9	0.19	0.02	24.7 k
2	Llama‑3.1‑70B	9	0.19	0.02	25.1 k
4	Qwen 2.5‑72B	8	0.19	0.02	27.9 k
5	GPT‑4o	9	0.26	0.03	25.6 k
6	Claude Sonnet‑4	9	0.32	0.04	31.8 k
7	Gemini 2.5 Pro¹	5	0.31	0.06	36.8 k
8	Perplexity Sonar	10	9.42	0.94	860 k

_{¹ Gemini 2.5 Pro produced HTML tables in five cases; my JSON parser rejected them.}

2.2 Interpretation

Perplexity is flawless but costs $9.42 for ten queries—mostly due to an 860 k‑token footprint.
GPT‑4o‑Mini & Llama‑3.1‑70B reach 90 % accuracy at ~$0.02 per hit—best bang‑for‑buck.
Gemini Flash lands 70 % at one‑tenth of a cent; with a manual QA pass it’s unbeatable on price.
Structured output matters. Gemini 2.5 Pro’s HTML responses were unusable—well‑formed JSON is part of model quality.
Edge cases: only Perplexity explicitly declared “no official site” for Aiorosu Knives; GPT‑4o‑Mini offered a reseller link (helpful, but scored wrong).

3 · Key Take‑Aways

Define “good enough.” 90 % accuracy + quick human review beats 100 % accuracy at 45× the price.
Validate on ingestion. Malformed JSON breaks pipelines—enforce a schema check.
Watch prices. Promo rates shift; query the price API before batching thousands of calls.

4 · What’s Next?

I’m wiring GPT‑4o‑Mini into Knife.Day so collectors see verified manufacturer links on every brand page. Re‑crawling ≈ 250 brands now costs under $5.

If you collect knives—or just enjoy oddball benchmarks—join the Knife.Day beta and tell me which dataset you’d automate next.

Disclaimer: model versions, accuracy, and prices are a snapshot from 31 May 2025. Future mileage—and billing—will vary.

Read Entire Article

Show HN: Which LLM Finds Obscure Knife-Brand URLs Cheapest? (8-Model Benchmark)

A Mini‑Benchmark of 8 Web‑Enabled Models (May 31 2025)

1 · Experimental Setup

2 · Results

2.1 Leaderboard (sorted by cost per correct URL)

2.2 Interpretation

3 · Key Take‑Aways

4 · What’s Next?

Related

You Can't Play Rock Paper Scissors at Night

Signal Runs Partly on AWS

Daily Econ Headlines Scraped and (Kinda) Analysed

Show HN: Which LLM Finds Obscure Knife-Brand URLs Cheapest? (8-Model Benchmark)

A Mini‑Benchmark of 8 Web‑Enabled Models (May 31 2025)

1 · Experimental Setup

2 · Results

2.1 Leaderboard (sorted by cost per correct URL)

2.2 Interpretation

3 · Key Take‑Aways

4 · What’s Next?

Related

You Can't Play Rock Paper Scissors at Night

Signal Runs Partly on AWS

Daily Econ Headlines Scraped and (Kinda) Analysed

A Mini‑Benchmark of 8 Web‑Enabled Models (May 31 2025)