The Irony of the LLM Treadmill

12 hours ago 1

A strange new burden has crept into software teams: the LLM treadmill. Many models retire within months, so developers now continuously migrate features they only just shipped.

Our team feels this new pain sharply. We support many LLM-based features, together measuring massive token volumes each month. Though I suspect even teams with small LLM dependencies feel this frustration.

Example: A migration affected by the “jagged frontier”

You can treat these migrations like any other software version bump. But users dislike adapting to change that is only mostly better. And since LLMs are weird and lately their upgrades jagged, simply bumping the version can be quite messy.

Consider a common scenario: a feature in your product is powered by a clever, “vibe-based” prompt. It worked surprisingly well on a popular model, so you shipped it and iterated on it when users gave feedback.

Then came the model’s deprecation notice. Time to migrate. When you migrated the same feature last year, the version bump was a clear and easy win. Hopefully again!

Only this time the new model makes this feature feel different. It’s sometimes better, sometimes worse. The prior model had a special knack for the task. You worry about your users adapting to change.

This pushes you to graduate. You formalize the task, annotate high-quality examples, and fine-tune a replacement model. You now have a more robust solution with much-improved quality, all because the treadmill forced you to build it right.

Was all that necessary?

Getting migrations right is important.

A recent example is ChatGPT’s move to GPT-5. Chat became smarter, but lacked 4o’s personality. Many users were unhappy and wanted it rolled back. It’ll take another migration to fix properly.

So what should you do when your vibe-prompted LLM is to be sunset?

  • If a newer model makes a mediocre feature feel great, take it quickly.

  • Otherwise, move beyond feel. Really break down what people like in your feature.

And this takes serious effort … just to migrate.

But it pays for itself. Once the nuances of “good” are measurable, you can make your feature even better. In the above scenario, the new smarter model is often also 10x cheaper and 2x faster. And it’ll be easier to migrate next time, as you already have our nicely annotated dataset.

My team does this often. We ship a v1 with prompting. A model gets deprecated. We nail down “good” → measure it → kick off an optimization loop.

We try new prompts, alternative models, and sometimes go tune our own. We usually end up faster, cheaper, sturdier, and consistently higher quality than the vendor’s version bump that forced the whole process.

Awkwardly, that often churns spend from them.

And that’s the irony of the LLM treadmill: short model lifespans force even happy API customers to keep reconsidering. And the better customer they are (the more features they’ve built and maintain), the more forceful is this push away.

Seems like a hard way to do business.

Each big AI lab forces a different “treadmill”, with OpenAI’s so far offering the most self-determination.

Google’s Gemini models retire one year from release. New releases are often priced quite differently (↕)

Anthropic’s retirements can occur with just 60 days notice, after a year from release. Pricing has been flat since Claude 3’s release (except for Haiku ↑).

And OpenAI’s treadmill is more developer friendly:

  • Models are supported for longer (still supporting even GPT-3.5-Turbo and Davinci).

  • Upgrades often arrive with a lower price ↓.

The contrast between OpenAI and Anthropic becomes clearer when you look at how they position their models.

At their recent DevDay, OpenAI showcased a long list of top customers. One thing I observed is how varied the top customer list is. All kinds of unicorns — consumer, business platforms, productivity, developer tools, and more — seem to be heavily using OpenAI’s API.

This was consistent with how OpenAI positioned GPT-5 upon release; as a model intended to tackle a broad range of tasks.

Anthropic, in contrast, appears to be specializing. In their Claude 3 announcement, Anthropic touted a wide array of uses. By their Claude Sonnet 4.5 release, they more narrowly positioned Claude as the best “coding model”. And code tools are reportedly an increasingly large part of their revenue.

I think it’s not a coincidence that the friendlier-treadmill vendor has kept a wider base of software built on them. I also wonder if this is self-reinforcing in how the big labs iterate on their product-market fit.

I think software teams will keep following their incentives. If model migrations cost more than they deliver, those teams will grow tired. They’ll reclaim control over quality and their roadmap prioritization, by either self-hosting models or moving to labs with friendlier policies.

That said, I’m also optimistic that the big AI labs will see this and fix the driver. I hope they’ll commit to long-term support of their models. The pain today might just be a growing pain of a new industry. The LLM treadmill may, in time, disappear.

The exception to this post is code tools. That crowd is seeing pure upside from each new model, so Anthropic’s focused bet on code will likely continue to compound. But for the rest of us building AI-powered app features, navigating the treadmill has become a very real and pressing problem.

Find this work interesting? We’re hiring!

You can also follow me on X.

Discussion about this post

Read Entire Article