Using an AI Quorum to Reduce Errors

4 months ago 13

Every AI model makes mistakes. Those mistakes can be costly.

‍

In February 2024, a British Columbia tribunal forced Air Canada to refund a passenger C$650 after the airline’s customer-service chatbot made up a refund policy. Lawyers in New York and Vancouver have faced sanctions for citing cases invented by ChatGPT, and a Stanford study shows legal-research bots still get at least one in six answers wrong on benchmark queries.

‍

“The insidious thing about hallucinations is not that the model is getting five percent wrong,” Snowflake CEO Sridhar Ramaswamy has said. “It’s that you don’t know which five percent.” We think the best way to solve this problem is to build better guardrails, boost transparency, and force models to “test their thinking” in the open, in dialogue with other AI models.

‍

Rostra’s Best Answers tool does exactly that. Rather than trusting answers generated by a single AI model, Best Answers collects opinions from a quorum of AI models and then returns only what multiple engines—as well as internal and external evidence—confirm to be true. Best Answers cuts hallucinations by up to 80 percent, based on internal benchmarks.

‍

Here’s how it works: A user starts by asking a question on Rostra’s web interface, which looks much like ChatGPT. The query is simultaneously sent to seven different models, including a mixture of frontier LLMs and task-specific tools, depending on the question’s topic. The Best Answer engine connects with more than two dozen models so far, including the latest LLMs from OpenAI, Anthropic, Gemini, and Mistral. A diversity of LLMs in the quorum, including closed- and open-sourced models, means that a single bad response is unlikely to dominate the set.

‍

As each model returns an answer, a consensus agent starts running. The agent compares statements line by line, flags contradictions, and calls a retrieval-augmented checker that pulls trusted documents—public datasets or your own files—to fact-check each model’s output line-by-line. The agent distills overlapping claims from the various models into a single reply, weighted by how many models and how much evidence supports each claim. Outlier claims are down-ranked or tossed out entirely.

‍

Every single answer returned to the user is backed up by a real source; either an internal or external document. In healthcare, workflows built on similar RAG guardrails have driven hallucination rates down to 1.47 percent. Legal RAG services still miss the mark 17–34 percent of the time, though.

‍

Each “Best Answer” response also features a 45-point quality audit that looks for unsupported claims, biased phrasing, policy violations, personal-data leaks, and more. If a red flag triggers, the answer is revised or withdrawn. The full chain of reasoning—from the initial question to raw model outputs, consensus logic, and audit results—are recorded in a ledger for compliance teams to trace their business’ AI use.

‍

Even if future LLMs hallucinate extremely infrequently—like less than one percent—a small chance of errors can still compound. A one-percent chance of making a mistake, spread across a 100-step workflow, implies a 63 percent failure rate. Databricks CEO Ali Ghodsi, whose platform builds corporate AI agents, warns that people “underestimate how hard it is to completely automate a task.” Humans will remain supervisors of AI agents for the foreseeable future, Ghodsi says, in part because it’s so difficult to make sure each part of an automated task is solved correctly by AI tools.

‍

Best Answers not only drastically cuts hallucination rates, but audits and records every interaction on a ledger. It is compliant, transparent, and robust. We’re excited for you to try it.

‍

Join our waitlist to learn more.

Read Entire Article