We shipped GPT-5 support before lunch

3 months ago 3

Shipping GPT-5 support the very morning it dropped wasn’t a lucky break—it was the payoff from months of infrastructure and eval investment. OpenAI unveiled the model at 10 AM PT; by 12 PM it had cleared our evaluation harness, flowed through our provider-agnostic router, and appeared as a “GPT-5” toggle in every customer dashboard, all before lunch. The playbook below breaks down the repeatable systems that made that two-hour turnaround possible.

Ruthless, layered evaluations

We’ve spent many months building up our evaluation datasets, and this is the primary reason why we can release new models so quickly.

Domain-specific suites

We separate our evaluations into different tasks and maintain separate datatsets for voice (focused on latency and instruction following), chat (focused on tone of voice and response accuracy), and long-horizon email workflows (focused on agentic and multi-turn goal completion).

We’ve curated our evaluation datasets using expert responses from our customers as well as our own internal examples. Each evaluation carries a short note on why the evaluation is important so context survives model migrations.

LLM-as-a-judge

We make heavy use of LLM-as-a-judge to speed up evaluation, but for all model upgrades we also do a set of human evaluations to get a sense of how the model behaves. We’ve invested heavily in tuning our LLM-as-a-judge to have a combination of speed and accuracy.

A model-agnostic orchestration layer

Abstract once, swap forever

Our earliest outages taught us that routing everything to one model was a single-point-of-failure strategy. We now funnel every inference call through a provider-agnostic interface. Vendor, model name, and temperature are just environment variables—no downstream service code changes are required.

Capability-tier routing

Borrowing ideas from multi-LLM routing providers, we tag each model as Fast, General, Reasoning, etc. That makes it easy to plug in new types of models to switch out large swaths of our infrastructure, or to target specific use cases with a new model. Switching an entire tier—or targeting a single workflow—takes one configuration flip.

Central governance & cost controls

A centralized layer also enforces rate-limit backoffs and PII redaction, which means we don’t have to rebuild any infrastructure whenever a new model comes in.

Product-level model picker

Most “AI-for-support” vendors hide the model behind the curtain. At Assembled, model choice is a field which allows our customers to self-serve their own upgrades.

Even though we spend a lot of time and effort running evaluations, our customers will generally know their business better than us, so we give them the tools to choose their own path when it comes to the underlying models they want to use.

Predict-and-prep release tactics

We treat major model launches like on-call incidents you can see coming. By monitoring prior OpenAI announcement patterns and community chatter, we had an extremely accurate guess a week or two before GPT-5 was scheduled to launched.

When the model type appeared in the API documentation, it was relatively easy for us to set our model upgrade process in place.

Takeaways

  1. Invest in evals early—compounding returns are real. Your first golden dataset feels slow; your tenth lets you green-light a model during the keynote.
  2. Abstract the model boundary. Business logic shouldn’t care how many trillions of parameters answered the question.
  3. Expose choice to customers. They’ll teach you surprising use cases—and shoulder the switch when they’re ready.
  4. Treat releases like predictable incidents. Gather intel, stage configs, and rehearse the playbook before the API docs drop.

Read Entire Article