How to Identify and Avoid Overpriced AI Models

1 day ago 4

A common issue I see in AI app projects I’ve been involved with is overpriced models. Whether I’m coming in on a new project or analyzing the cost-performance efficiencies of existing projects, this issue of clients paying too much—and frequently for underperforming models—emerges enough that it warrants a dedicated post.

What I’ll attempt to do in this post is create a framework for AI practitioners to use to compare models by price vis-a-vis performance.

Start with Leaderboards

Where machine learning models stay pretty consistent in terms of pricing and performance—with many being open source and freely available—AI models can vary wildly in both performance and pricing, even from task to task. (You can use my free Machine Learning Model Picker for determining the best ML model for a task.)

To help users decide on an AI model, there are myriads of leaderboards that benchmark providers have provided that are task specific. Ergo, if you’re building an app with a chat function, you’ll want to use leaderboard like HELM lite for safety, Google’s FACTS Grounding for factually responsible responses, Artificial Analysis for cost vs performance, or any of the other leaderboards associated with chatbot performance, as shown in this screenshot from my free AI Strategy app. (Leaderboards are indicated with a green node, and selecting one opens a leaderboard card with copious notes to guide your foray into the wild wild west of leaderboards.)

To include cost benchmarks, be sure to include cost in your initial request. Whatever benchmark categories you select (e.g., quality, cost, latency, etc) will be present in the network graph, if there’s a leaderboard included in the app that includes that benchmark.

Note: I ordered tasks alphabetically, but I ordered benchmarks by frequency because, whereas all leaderboards include some kind of quality benchmark, the other benchmarks are less frequently included in most leaderboards.

Evaluate Performance vs Cost

Scatterplots Are Your Friend

If you’re fortunate enough to be interested in a task included in the Artificial Analysis (AA) family of leaderboards, you’re in luck because it not only includes cost benchmarks, it provides scatterplots that compare cost and performance for you. You can see an example of this in their Speech to Text AI Model & Provider Leaderboard.

One thing I love about AA’s leaderboards is that their scatterplots identify the ideal quadrant with a green background and the most underperforming quadrant using a gray background. I call them the Hall of Fame and Hall of Shame, respectively.

So not only is Amazon Transcribe the most expensive of the 11 AA surfaces by default (more on that in a minute), it has an error rate of 11.2%, at the time of writing (i.e., percentage of incorrect words in the transcription). The three models in the green quadrant are Scribe, Enhanced, and Universal-2.

What About Speed?

So it’s clear from this chart that Amazon Transcribe can’t justify being an outlier in terms of cost based on quality (i.e., word error rate, in this instance). But what if it’s lightning fast? Would it maybe be worth the loss of quality and higher cost? Let’s keep scrolling to compare the speed of your models.

Well, this is awkward. Of the 11 models AA surfaces by default, Amazon Transcribe is the second slowest. And this isn’t a blip on the radar. I highlighted it by selecting it in the legend, and it’s consistently been slow.

I can’t tell you just how common this is for bigger brands to have a pricing structure that just isn’t justifiable. We’ll take a look at more examples toward the end of this post.

If I were consulting on this project, I’d gravitate toward recommending Universal-2, which is 4 xs as fast as Amazon transcribe for 1/4 the cost. There are three models that were significantly faster, but it’s apparent that there is a point of diminishing returns where speed results in significantly higher error rates.

But keep in mind there are more than 11 models. At the time of writing there are 28 models to choose from. So, to give you an idea of the steps I’d take IRL, I’d select all of them and then start removing models that are clear under-performers, as indicated by the green x’s in the screenshot below.

Final Recommendation

After opening the floor to all the models, I’d actually probably recommend digger more into Universal-1 or Wizper (L, v3), fal.ai, if the slightly higher error rate would be acceptable to the client.

Once you have a good idea on pricing, you can use some of the other speech-to-text leaderboards to see how your top choices perform across other benchmarks.

Other Examples of Overpriced Models

Image Generation

As is quite typical, OpenAI’s model is an outlier in terms of pricing. If I were deciding on a text-to-image model, I’d want to dig into Recraft V3 since, according to AA, it’s less than 1/4 the price of 4o and 8xs faster, with only a small hit in quality (i.e., 1111 vs 1165 Quality ELO score).

Agentic Models

With Agentic models, OpenAI’s models have consistently been overpriced, according to the Galileo leaderboard. OpenAI’s agentic models are so consistently overpriced, they require a log10 x-axis. You don’t want to be the reason some analyst has to use a log10 axis for cost. The reason is it means there’s such a high differential in pricing, if they used a standard axis there would be massive clustering of datapoints between $0 and $10, a giant gap where there are two OpenAI models, and then gpt-4.5-preview on the far right.

To wit, at the time of writing, the gemini-2.0-flash-001 model received a higher performance score and is 0.3% the cost of gpt-4.5-preview-2025-02-27. Anthropic’s claude-3-7-sonnet-20250219 scored the highest for performance and is still just 6% the cost of gpt-4.5.

You can check out more leaderboards that evaluate agentic models from this filtered view of the AI Strategy App.

Be Vigilant

The scarlet thread I see when I compare these AI models is that some of the bigger brands—especially OpenAI and Amazon—gravitate toward pricing models that are hard to justify. It pays to learn how to leverage these leaderboards to ensure you are getting the most value from the models you choose for the tasks you want your app to perform.