AI Juries

3 hours ago 2

Posted: 2025-10-28

This is part of a series I’m writing on generative AI.

Condorcet’s Jury Theorem

Imagine a panel of N experts trying to decide if a binary (true/false) fact is true. Each expert has an independent probability p of being right. Since they are experts, assume p > 0.5 –they’re at least better than a coin flip.

Let’s call the probability that the majority of the panel is right j. This “jury accuracy” increases dramatically as the panel size (N) grows.

The table below shows how powerful this effect is. It shows the jury accuracy j for different panel sizes N (rows) and individual expert accuracies p (columns):

1	0.510	0.550	0.600	0.700	0.750	0.800	0.900	0.950
3	0.515	0.575	0.648	0.784	0.844	0.896	0.972	0.993
5	0.519	0.593	0.683	0.837	0.896	0.942	0.991	0.999
11	0.527	0.633	0.753	0.922	0.966	0.988	1.000	1.000
15	0.531	0.654	0.787	0.950	0.983	0.996	1.000	1.000
25	0.540	0.694	0.846	0.983	0.997	1.000	1.000	1.000
35	0.547	0.725	0.886	0.994	0.999	1.000	1.000	1.000
45	0.554	0.751	0.914	0.998	1.000	1.000	1.000	1.000
55	0.559	0.772	0.934	0.999	1.000	1.000	1.000	1.000
75	0.569	0.808	0.960	1.000	1.000	1.000	1.000	1.000
101	0.580	0.844	0.979	1.000	1.000	1.000	1.000	1.000
201	0.612	0.923	0.998	1.000	1.000	1.000	1.000	1.000
501	0.673	0.988	1.000	1.000	1.000	1.000	1.000	1.000
1001	0.737	0.999	1.000	1.000	1.000	1.000	1.000	1.000
5001	0.921	1.000	1.000	1.000	1.000	1.000	1.000	1.000

If you have a panel of N = 25 experts, each only p = 75 likely to be right, the probability j that the majority is right is already 99.663%!

This “wisdom of the crowds” effect works, at least in this idealized scenario where all experts are independent.

There’s an interesting flip side: if the “experts” are more likely to be wrong (p < 0.5), the panel’s performance plummets. The majority vote just amplifies the error, converging towards 0% accuracy (i.e., being reliably wrong). You can see this by swapping the definitions of “right” and “wrong” in the description above.

This model isn’t new; it goes back to the dawn of the French revolution. It was first proposed in 1785 by the Marquis of Condorcet and is known as his Jury Theorem.

Scaling AI juries

In Dumb AI and the software revolution I mentioned a key advantage of generative AI: in addition to working at light speed (answering questions in seconds, not days), it can be scaled instantly:

The infinite monkey is no longer hitting keys at random –just somewhat stupidly. But at this speed and with instant scaling, the difference is monumental.

If you can get an AI juror to answer a binary question with an accuracy p > 0.5, you can achieve arbitrarily high performance simply by scaling resources: running bigger juries.

Even if your AI is barely better than a coin flip (say, p = 0.51), you can still get a jury with j > 0.99 accuracy, though it can be quite expensive (with $p = 0.51%, you’d need a jury of N = 5001 just to hit j = 0.92).

But, as the table above shows, if you improve your agent’s accuracy p to, say, 0.70, with N = 25 you’re already above 98% accuracy.

And, interestingly, scaling the panel doesn’t impact latency. The questions can be executed in parallel, so the total time is determined by the slowest agent. I suspect strategies like hedging (e.g., run N + M jurors, take the answer from the N first responders) may be applicable.

What types of questions?

So what kinds of questions can this be useful for? I have a few practical, everday examples in mind:

This unit test fails. Is the implementation correct (regarding the tested property)? I intend to use this to determine the next step: ask an agent to fix the implementation, or ask it to fix the test.
A variation of the above: Does this code implement a specific property (described in natural language)?
I’ve received a code change (possibly from AI, possibly from a human; it doesn’t matter). Does it adhere to a specific coding style guideline? The jury’s answer determines if I run another agent to fix the style issue.
Is this function’s documentation (docstring) still accurate for the code it describes?
I have written a new recipe. Does it uphold one of my principles for my recipes?

The practical trade-off

I’m excited to have a tool where I can create my own AI juries. Based on their observed performance (false positives/negatives ratios), I can tweak the contexts (increase p) or adjust the jury size (increase N), until I find the right balance between final accuracy and cost.

Every time the jury gets things wrong, I get involved and steer the process. Every time it gets things right, it saves me from analyzing things myself.

This creates a very concrete trade-off: spend money (bigger juries) to save time (less manual intervention).