Token economics are serious AI business; API costs are out of control

1 month ago 5

On the surface, using AI feels simple. You type a prompt into ChatGPT, Claude, or Gemini, and out comes an answer. No waiting around, and no complicated setup. It all feels like magic.

But I've been thinking a lot about this "magic" lately, and I'm starting to realize that it runs on something far more tangible: tokens. A hidden meter is running in the background. And I have a feeling that meter is spinning faster than almost anyone realizes.

What is a token?

Let me try to break it down simply. Think of a token as a chunk of text. It could be a whole word like 'cat,' a piece of a word like 'ing' in 'running,' or even just punctuation. When you give an AI a prompt, it breaks your text down into these tokens before it can process anything. Every word you send and every word it generates back counts as tokens.

To be precise, tokens are the unit of measurement and the unit of currency for AI. Every time you use an AI model through an API, you're being charged for tokens. Both your input (what you ask) and the output (what it answers) count separately. It's like fuel in your car: you pay for every gallon you pump and every mile you drive. The more tokens that flow in both directions, the more it costs.

The old way with chatbots

In the early days of ChatGPT, it was a simple back-and-forth. You ask a question, it gives an answer. The cost structure here seems simple and straightforward.

Take a real-world example: asking a lawyer to review a contract. On day one, you bring a five-page document. On day two, you return with the same document plus five more pages. By day three, you want the lawyer to remember everything from before and check new clauses. The bill grows with each visit, because the lawyer's time (just like tokens) stacks up.

Now let's apply this to AI.

Your First Question: Summarize the plot of Shakespeare's Hamlet. (Input tokens: ~150)

AI's Answer: Shakespeare's Hamlet is about a Danish prince seeking revenge for his father's murder. His uncle Claudius killed the king and married Hamlet's mother. Hamlet feigns madness, stages a play to catch his uncle's guilt, accidentally kills Polonius, and ultimately dies in a duel but not before finally avenging his father. (Output tokens: ~75)

Total for Round 1: ~225 tokens.

It's easy to miss that each response costs you twice. Once for the input tokens you send, and once for the output tokens the AI generates. They're usually priced differently too, with output tokens costing 2-3x more than input tokens.

That seems simple enough. But there's another layer to this:

Your Follow-up: Now, compare Hamlet's indecisiveness to a modern CEO making tough decisions.

What the AI actually sees is the entire conversation. It took me a while to wrap my head around this. It sees your first question, its own summary of Hamlet, and then your new question. Everything gets fed back into the model's context window.

So that second query isn't just 20 new tokens. The AI has to process your original question (~150 tokens), its own previous response (~75 tokens), plus your new question (~20 tokens). That's ~245 tokens of input just to answer your follow-up. Then it generates maybe another 100 tokens in response. You're now at ~345 tokens total for just two exchanges. Each new turn in the conversation compounds. By the time you've had a 10-message conversation, you could easily be processing thousands of tokens per response.

The new way with ai agents

This is where, I believe, the costs the costs start really piling up. Modern AI tools are upgrading from simple chatbots to "agents." ChatGPT and Claude are both moderately agentic now, depending on the options you pick. AI agents are systems that can reason, use tools (like web browsers), and make decisions. You give a single, high-level command, and the agent goes to work, spinning that token meter like crazy.

Say you're a product manager and you prompt an AI Agent:

"Research the competitive landscape for smart gardening systems and create a detailed report with a SWOT analysis."

This seems like one prompt. But I can only imagine what's happening behind the scenes. It probably looks something like this:

It "Thinks" (Plans): The agent first has to prompt itself, something like: "Okay, to do this, I need to identify top competitors, find recent customer reviews, and analyze their marketing strategies." (Maybe 1,000 tokens right there).

It Uses Tools: It executes a web search for "best smart gardening systems 2025." It then has to read and process the top five articles it finds. (Could be 5,000 tokens).

It "Thinks" Again: It analyzes the results and decides its next move: "I need to specifically compare the features of Gardyn, Rise, and AeroGarden." (Another 500 tokens).

It Uses More Tools: It runs another search for "Gardyn vs AeroGarden customer reviews" and processes that new data. (Let's guess 4,000 tokens).

It Compiles the Data: It finally structures all this information into your neat, requested SWOT analysis.

You see one prompt and one clean report (maybe 2,000 tokens). But behind the scenes, the agent might have consumed 50,000 tokens on its internal planning, searching, and thinking.

The token supply problem

The fundamental constraint: GPUs are the engines that process tokens. They're the expensive, specialized chips inside data centers that do the actual computational work to transform input tokens into output tokens. These chips can only process so many tokens per second. That processing capacity is both the technical bottleneck and the economic one.

You can't just spin up more GPUs overnight. They're expensive to manufacture, expensive to run, and in incredibly short supply. The number of tokens that can be processed globally at any given moment is essentially fixed by how many GPUs exist and how fast they can work. So in a very real sense, there's a finite supply of tokens. Or more precisely, a finite capacity for processing them.

screenshot of shopping listings for nvidia b200 gpus

Meanwhile, demand is exploding. With over 4 billion prompts issued daily across major platforms, it starts to look like a classic scarcity problem: more demand, fixed supply. Think of it like a city with only so many taxis. As more riders flood the streets, the price of a ride goes up.

Who's paying the real bill?

This is the part that's a bit scary. Many analysts suggest that the powerful AI services we use today are being sold at a loss. Even the "expensive" $200 per month for a ChatGPT Pro or Claude Max subscription might not cover the actual compute costs of heavy users.

This was confirmed by OpenAI CEO Sam Altman, who admitted:

"People are using ChatGPT so much that OpenAI is even losing money on its $200-per-month Pro subscription"

The scale of these losses is massive in some cases. Reports suggest that OpenAI has raised projected losses to $115 billion through 2029, with running ChatGPT expected to cost around $7 billion in 2024 alone. That includes $4 billion just to rent server capacity from Microsoft.

Tech analyst Edward Zitron pushes this point even further. His take might be a bit aggressive, but it's hard to ignore his argument:

"Every single company offering any kind of generative AI service … is, from every report I can find, losing money … every single one of these apps only loses money, is actively harmful to their respective investors or owners."

I guess it's a bit like an "all-you-can-eat" buffet for $20, while some customers are eating $200 worth of food. A restaurant can only survive that business model for so long.

Maybe that's why Sam Altman has been doing the billionaire equivalent of panhandling. In January 2025, OpenAI announced the Stargate Project, a joint venture with SoftBank, Oracle, and UAE's MGX to invest $500 billion in AI infrastructure over four years. Then in March, OpenAI closed a $40 billion funding round, the largest private tech fundraise in history, valuing the company at $300 billion. By September, Altman was announcing $850 billion in planned buildouts with Oracle, Nvidia, and SoftBank.

OpenAI is making massive infrastructure commitments it can't possibly afford on its own. The company lost $5 billion in 2024 and projects only $13 billion in revenue for 2025. Yet it's promising hundreds of billions to Oracle, pledging $10 billion to Broadcom for custom chips, and committing to build data centers that cost $50-60 billion per gigawatt. The math doesn't add up unless you understand it as a desperate scramble to secure the GPU capacity needed to process all those tokens. When your entire business model depends on hardware you don't own and can't afford to buy, you make deals with anyone who has money. Governments, sovereign wealth funds, tech giants, anyone.

The risk for businesses

So why does this matter? Well, I think about the countless startups and businesses building their entire workflows on AI. Here's something that should worry them: research shows that 73% of companies underestimate their actual API expenses by 40-60% due to hidden costs and inefficient usage patterns.

If the true costs suddenly get passed on, I see a couple of possibilities:

Prices could skyrocket overnight.
Access could get restricted or cut back.

Let's just imagine a giant like Netflix relying on AI to auto-generate subtitles and personalized recommendations. If its AI costs suddenly spike 5x, its entire streaming model becomes more expensive overnight. If a giant like that could be rattled, think about the fragility of smaller startups. An unexpected surge in costs could wipe out their margins instantly.

Bottom-line

The magic of AI is real (within its many limitations), and its potential is great. But as I watch us push for more capable, agentic systems, I can't help but feel we must be more conscious of the hidden economy powering it all. The conversation is usually about what AI can do. Maybe it's time we start talking what we can afford for AI to do.

Read Entire Article