Collaborating AI models outperform the best models

5 days ago 2

Ken Jennings is the most successful Jeopardy! contestant of all time, with total winnings over $2.5 million. But even Jennings didn’t get every answer correct. After his run of 75 games, Jennings had answered 2,694 questions correctly and got 264 questions wrong, meaning he was correct 91 percent of the time.

‍

Leading AI models, which are trained on millions of articles, have much higher accuracy rates than Jennings (no surprise there). But even AI models often get things wrong, causing people to make decisions that cause real-world harm. The Chicago Sun-Times recently published a “Summer Reading List,” for example, that included several books that don’t exist.

‍

At Rostra, we’ve built a tool that drastically reduces factual errors in AI-generated responses. We benchmarked our tool’s factual accuracy against leading AI models operating independently, including GPT-4.1 and Claude 3.5. We randomly selected 250 Jeopardy! questions from Season 40 and fed them into each model or Rostra via API. Responses generated by each model were categorized as “Correct,” “Incorrect,” or “Partially Correct,” and final percentages tabulated.

‍

Rostra outperformed all other models in factual accuracy, with a total error rate (Incorrect + Partially Correct) of 4.2 percent. OpenAI’s GPT-4.1 had an error rate of 6.7 percent. Claude 3.5 Haiku had an error rate of more than 16 percent.

‍

‍

Our tool works by assembling a “quorum” of up to 24 different AI models; both frontier LLMs and smaller, task-specific models. Each model is prompted with the question and coaxed to reach a consensus. Our tool checks semantic alignment across answers. A retrieval-augmented generation (RAG) step checks key assertions against external datasets or internal documents. Each model independently formulates an answer, citing sources and explaining its reasoning when possible. But the answers generated by each model are weighed according to prior performances on that question’s technical domain, like chemistry, history, or law. Outlier answers get flagged, and then a consensus agent synthesizes these inputs into one confidence-weighted answer.

‍

Rostra outperforms leading frontier models even when its “quorum of experts” consists solely of last-generation models. In other words, we are able to beat modern AI models by forming collaborations between solely older, error-prone models. Rostra’s accuracy rate goes up even further when the quorum consists of newer LLMs.

‍

There are many caveats with our work, of course. For one, these data only include 250 Jeopardy! questions, and so the error rates reported here only apply to shorter “recall” abilities of the various models. We’re currently testing Rostra’s accuracy on questions that require deeper reasoning and benchmarking its hallucination rates for long-form responses. We’ll publish those data as soon as they become available.

‍

Join our waitlist to receive future updates from Rostra.

Read Entire Article