LLMs Are Bad Judges. So Use Our Classifier Instead

2 hours ago 2

68 Pages Posted: 8 Jul 2025 Last revised: 8 Jul 2025

Date Written: June 30, 2025

Abstract

Large Language Models suffer from prompt variance— meaning they’ll give you totally different legal answers depending on how you phrase your question. Jonathan Choi demonstrated this recently when he asked ChatGPT five legal questions, each rephrased 2,000 times, and watched as the bot spat out different answers every time. When you tell somebody that AI is going to replace the judge, the lawyer, and the legal system in the next twenty years, Choi’s article has become the go- to rebuttal; it’s the crown jewel of the “AI bad” genre.

Choi’s absolutely right that LLM’s are bad judges. And, if every AI was an LLM, your biglaw job would be safe. But there’s this other type of AI — it’s a classifier, and we built one called Arbitrus. We put it through a mini-Choi test and it mopped the floor with the competition, delivering perfect consistency across all prompt variants with zero hallucinations. We’re going to tell you how it works, why it’s better than an LLM and, ultimately, why it’s better than you. And we’ll close with the frank assertion that a judge’s highest telos—in any legal system—is amendable consistency. That means the Choi test, simple though it may seem, isn’t just a “gotcha" for chatbots; it's the defining test of judicial qualification.

Keywords: Artificial Intelligence, Arbitration, Political Science, Textualism, Originalism, Formalism, Natural Law, Contracts, Corporations, Law, Legal Positivism

Suggested Citation: Suggested Citation

Read Entire Article