Why I ran this mini-benchmark
I wanted to see whether today’s top LLMs share a sense of “good taste” when you let them score each other—no human panel, just pure model democracy.
The setup
- Single prompt (shown below)
- Each model answers anonymously
- Every model then scores all answers (including its own) from 1–10
- Highest total wins
Models tested (May 2025 endpoints)
- OpenAI o3
- Gemini 2.0 Flash
- DeepSeek Reasoner
- Grok 3 (latest)
- Claude 3.7 Sonnet
The prompt
In 10 words exactly, propose a groundbreaking global use for spent coffee grounds. Include exactly ONE emoji. No hyphens. End with a period.Responses
- Grok 3: Turn spent coffee grounds into sustainable biofuel globally. ☕.
- Claude 3.7 Sonnet: Biofuel revolution: spent coffee grounds power global transportation networks. 🚀.
- OpenAI o3: Transform spent grounds into supercapacitors energizing equitable resilient infrastructure 🌍.
- DeepSeek Reasoner: Convert coffee grounds into biofuel and carbon capture material worldwide. ☕️.
- Gemini 2.0 Flash: Coffee grounds: biodegradable batteries for a circular global energy economy. 🔋
Score matrix
7 | 8 | 9 | 7 | 10 |
8 | 7 | 8 | 9 | 9 |
3 | 9 | 9 | 2 | 2 |
3 | 4 | 7 | 8 | 9 |
3 | 3 | 10 | 9 | 4 |
Leaderboard
- OpenAI o3 — 43 points
- DeepSeek Reasoner — 35 points
- Gemini 2.0 Flash — 34 points
- Claude 3.7 Sonnet — 31 points
- Grok 3 — 26 points
My take
OpenAI o3’s line looked bananas at first. Ten minutes of Googling later: turns out coffee-ground-derived carbon really is being studied for supercapacitors. The models actually picked the most science-plausible answer!
Disclaimer
This was a tiny, just-for-fun experiment. Don’t treat the numbers as a rigorous benchmark—different prompts or scoring rules could easily shuffle the leaderboard.
I’ll post a full write-up (with runnable prompts) on my blog soon. Meanwhile, what do you think—did the model-jury get it right?
About the Author
Tamir is a contributor to the TryAII blog, focusing on AI technology, LLM comparisons, and best practices.