Letting the AIs Judge Themselves: A One Creative Prompt: The Coffee-Ground Test

5 days ago 1
Mini LLM Self-Scoring Benchmark – May 2025

Why I ran this mini-benchmark
I wanted to see whether today’s top LLMs share a sense of “good taste” when you let them score each other—no human panel, just pure model democracy.

The setup

  • Single prompt (shown below)
  • Each model answers anonymously
  • Every model then scores all answers (including its own) from 1–10
  • Highest total wins

Models tested (May 2025 endpoints)

  1. OpenAI o3
  2. Gemini 2.0 Flash
  3. DeepSeek Reasoner
  4. Grok 3 (latest)
  5. Claude 3.7 Sonnet

The prompt

In 10 words exactly, propose a groundbreaking global use for spent coffee grounds. Include exactly ONE emoji. No hyphens. End with a period.

Responses

  • Grok 3: Turn spent coffee grounds into sustainable biofuel globally. ☕.
  • Claude 3.7 Sonnet: Biofuel revolution: spent coffee grounds power global transportation networks. 🚀.
  • OpenAI o3: Transform spent grounds into supercapacitors energizing equitable resilient infrastructure 🌍.
  • DeepSeek Reasoner: Convert coffee grounds into biofuel and carbon capture material worldwide. ☕️.
  • Gemini 2.0 Flash: Coffee grounds: biodegradable batteries for a circular global energy economy. 🔋

Score matrix

Grok 3 Claude 3.7 OpenAI o3 DeepSeek Gemini 2.0 Grok 3 Claude 3.7 OpenAI o3 DeepSeek Gemini 2.0
789710
87899
39922
34789
331094

Leaderboard

  1. OpenAI o3 — 43 points
  2. DeepSeek Reasoner — 35 points
  3. Gemini 2.0 Flash — 34 points
  4. Claude 3.7 Sonnet — 31 points
  5. Grok 3 — 26 points

My take

OpenAI o3’s line looked bananas at first. Ten minutes of Googling later: turns out coffee-ground-derived carbon really is being studied for supercapacitors. The models actually picked the most science-plausible answer!

Disclaimer

This was a tiny, just-for-fun experiment. Don’t treat the numbers as a rigorous benchmark—different prompts or scoring rules could easily shuffle the leaderboard.

I’ll post a full write-up (with runnable prompts) on my blog soon. Meanwhile, what do you think—did the model-jury get it right?

About the Author

Tamir is a contributor to the TryAII blog, focusing on AI technology, LLM comparisons, and best practices.

Read Entire Article