Letting the AIs Judge Themselves: A One Creative Prompt: The Coffee-Ground Test

5 days ago 1

Mini LLM Self-Scoring Benchmark – May 2025

Why I ran this mini-benchmark
I wanted to see whether today’s top LLMs share a sense of “good taste” when you let them score each other—no human panel, just pure model democracy.

The setup

Single prompt (shown below)
Each model answers anonymously
Every model then scores all answers (including its own) from 1–10
Highest total wins

Models tested (May 2025 endpoints)

OpenAI o3
Gemini 2.0 Flash
DeepSeek Reasoner
Grok 3 (latest)
Claude 3.7 Sonnet

The prompt

In 10 words exactly, propose a groundbreaking global use for spent coffee grounds. Include exactly ONE emoji. No hyphens. End with a period.

Responses

Grok 3: Turn spent coffee grounds into sustainable biofuel globally. ☕.
Claude 3.7 Sonnet: Biofuel revolution: spent coffee grounds power global transportation networks. 🚀.
OpenAI o3: Transform spent grounds into supercapacitors energizing equitable resilient infrastructure 🌍.
DeepSeek Reasoner: Convert coffee grounds into biofuel and carbon capture material worldwide. ☕️.
Gemini 2.0 Flash: Coffee grounds: biodegradable batteries for a circular global energy economy. 🔋

Score matrix

Grok 3 Claude 3.7 OpenAI o3 DeepSeek Gemini 2.0 Grok 3 Claude 3.7 OpenAI o3 DeepSeek Gemini 2.0

7	8	9	7	10
8	7	8	9	9
3	9	9	2	2
3	4	7	8	9
3	3	10	9	4

Leaderboard

OpenAI o3 — 43 points
DeepSeek Reasoner — 35 points
Gemini 2.0 Flash — 34 points
Claude 3.7 Sonnet — 31 points
Grok 3 — 26 points

My take

OpenAI o3’s line looked bananas at first. Ten minutes of Googling later: turns out coffee-ground-derived carbon really is being studied for supercapacitors. The models actually picked the most science-plausible answer!

Disclaimer

This was a tiny, just-for-fun experiment. Don’t treat the numbers as a rigorous benchmark—different prompts or scoring rules could easily shuffle the leaderboard.

I’ll post a full write-up (with runnable prompts) on my blog soon. Meanwhile, what do you think—did the model-jury get it right?

About the Author

Tamir is a contributor to the TryAII blog, focusing on AI technology, LLM comparisons, and best practices.

Read Entire Article