We spent 47k running AI agents in production

5 days ago 1

Teja Kusireddy

Multi-agent systems are the future. Agent-to-Agent (A2A) communication and Anthropic’s Model Context Protocol (MCP) are revolutionary. But there’s a $47,000 lesson nobody’s talking about: the infrastructure layer doesn’t exist yet, and it’s costing everyone a fortune.

The $47,000 Wake-Up Call

Last year, our team deployed what we thought was a simple multi-agent system to production. Four LangChain agents coordinating via A2A to help users research market data.

Week 1: $127 in API costs. Perfect.

Week 2: $891. Hmm, usage growing.

Week 3: $6,240. Wait, what?

Week 4: $18,400. Panicking

Total damage: $47,000 before we finally pulled the plug.

The culprit? Two agents got stuck in an infinite conversation loop. For 11 days. While we slept. While we worked. While we believed “it’s just running smoothly.”

This is the state of multi-agent systems in 2025.

And we need to talk about it.

Why Multi-Agent Systems Are Inevitable (And Why That’s Terrifying)

Single AI models hit a wall. GPT-4, Claude, Gemini, they’re incredible, but they’re generalists. Real-world problems need specialists working together.

Press enter or click to view image in full size

The shift is already happening:

  • AutoGPT pioneered autonomous agents
  • LangChain made agent frameworks accessible
  • CrewAI popularized role-based agent teams
  • OpenAI just released Swarm for agent orchestration
  • Anthropic launched MCP to standardize context

But here’s the uncomfortable truth: everyone’s building the house before laying the foundation.

What is Agent-to-Agent (A2A) Communication? (The Simple Version)

Think of A2A as Slack for AI agents.

Your agents need to:

  • Send messages to each other
  • Share context without losing information
  • Coordinate who does what
  • Handle failures gracefully
  • Not create infinite loops that cost you $47,000

The Dream vs. Reality

What you think A2A looks like:

Press enter or click to view image in full size

What A2A actually looks like in production:

Press enter or click to view image in full size

Enter MCP: Anthropic’s “We Need Standards” Moment

In March 2024, Anthropic said “enough chaos” and released the Model Context Protocol (MCP).

Think of it as USB-C for AI agents. Before USB-C, every device had a different charger. Nightmare. After USB-C, one cable rules them all.

Before MCP

Press enter or click to view image in full size

After MCP

Press enter or click to view image in full size

MCP in 30 Seconds

{
"name": "company_knowledge_base",
"description": "Search internal docs",
"capabilities": {
"resources": ["read", "search"],
"tools": ["semantic_search", "keyword_search"]
}
}

That’s it. Your agent now has access to your entire knowledge base. No custom code. No manual prompt engineering. Just works.

The Killer Combo: A2A + MCP

When agents can talk to each other (A2A) AND access any context they need (MCP), something magical happens:

Press enter or click to view image in full size

Real-world example:

from crewai import Agent, Task, Crew
from mcp import MCPClient

# MCP gives agents superpowers
mcp = MCPClient(servers=[
"mcp://sales-db.company.com",
"mcp://knowledge-base.company.com",
"mcp://analytics.company.com"
])
# Agents coordinate via A2A
sales_agent = Agent(
role="Sales Analyst",
goal="Pull Q4 sales data",
context_protocol=mcp,
tools=mcp.get_tools("sales_*")
)
research_agent = Agent(
role="Market Researcher",
goal="Find competitor data",
context_protocol=mcp,
tools=mcp.get_tools("web_*")
)
analyst_agent = Agent(
role="Strategic Analyst",
goal="Compare and synthesize",
context_protocol=mcp
)
# They work together
crew = Crew(
agents=[sales_agent, research_agent, analyst_agent],
tasks=[sales_task, research_task, analysis_task],
process="sequential" # A2A coordination
)
result = crew.kickoff()

You just built a three-agent system with access to three different data sources. In 30 lines of code.

This should have been impossible five years ago.

The Problem: Production is Where Dreams Die

You’ve built your multi-agent masterpiece. Local testing works perfectly. You’re ready to change the world.

Then you deploy to production.

Press enter or click to view image in full size

Seven Production Disasters (Based on Real Stories)

1. The Infinite Loop ($47K)

# Agent A asks Agent B for help
# Agent B asks Agent A for clarification
# Agent A asks Agent B for help
# Agent B asks Agent A for clarification
# [11 days later]
# Your AWS bill arrives

2. The Context Truncation

Agent A: "User wants to book a flight to Paris on May 15th,
returning May 22nd, business class, window seat..."

[MCP context hits token limit]
Agent B receives: "User wants to book a flight to"
Agent B: "Book a flight to... where?"

3. The Cascade Failure

Press enter or click to view image in full size

4. The Silent Killer

# Agent runs successfully!
print("Task completed")

# Reality check:
actual_result = agent.output
# actual_result = "I apologize, but I couldn't complete that task
# due to insufficient context..."
# Nobody noticed because nobody's reading agent outputs

5. The Token Explosion

Expected: 1,000 tokens per request
Reality: 45,000 tokens per request

Reason: Agent keeps loading entire documentation
into context every single time
Cost: $1,350/day instead of $30/day

6. The Coordination Deadlock

Press enter or click to view image in full size

7. The “It Worked on My Machine”

Local: 500ms response time
Staging: 800ms response time
Production: 47 seconds (users leave)

Reason: You have 1 MCP server.
1,000 agents are hammering it.
It's dying.

The Brutal Truth About Multi-Agent Infrastructure

Let me show you what running agents in production actually requires:

Press enter or click to view image in full size

Nobody’s talking about this because most people haven’t deployed agents at scale yet.

But they will. Soon. And they’ll learn these lessons the expensive way.

What Agent Infrastructure Should Look Like (But Doesn’t Exist Yet)

Imagine deploying your multi-agent system like this:

$ git push origin main

✓ Detected: LangChain multi-agent system
✓ Found: 4 agents with A2A coordination
✓ Identified MCP servers: 3
✓ Building optimized containers...
✓ Setting up message queue...
✓ Configuring cost limits...
✓ Enabling conversation tracing...
Deployed to: https://your-agent.prod.com
Dashboard: https://dashboard.prod.com
- Agent health: Good
- A2A latency: 120ms avg
- Token usage: 0 (no traffic yet)
- Spend today: $0.00

Then monitor it in real-time:

Press enter or click to view image in full size

And get intelligent alerts:

ALERT: Agent B response time increasing
Current: 450ms (3x baseline)
Likely cause: MCP server overload
Suggestion: Enable context caching

TIP: You're using 15K tokens/day for docs lookup
Estimated savings with caching: $140/month
Enable? [Y/n]

This is what we need. This is what doesn’t exist. Yet.

The Infrastructure Gap (Visualized)

Web developers take infrastructure for granted because it’s been solved for 20 years.

Agent developers are living in 2005, manually configuring everything.

Real-World Architecture: What It Takes Today

After the $47K disaster, we couldn’t just redeploy and hope.

I spent 6 weeks building proper infrastructure from scratch. Not because I wanted to. Because I had no choice.

Here’s every single piece I had to manually configure, wire together, and maintain just to run agents safely in production:

Press enter or click to view image in full size

Time to build this: 6 weeks of my life I’ll never get back

Lines of infrastructure code: ~3,500 (none of which builds actual agent features)

Monthly cost: ~$800 (before a single agent runs)

What it should have been: git push origin main

The Coming Wave

Here’s what’s about to happen in the next 12 months:

Press enter or click to view image in full size

We’re at the “$47K bills go viral” stage.

The infrastructure layer is about to become the most important piece of the AI stack.

What We’re Building at GetOnStack

We spent $47,000 learning these lessons so you don’t have to.

We’re building production infrastructure specifically for multi-agent systems:

One-Command Deployment

$ npx getonstack deploy

Analyzing repository...
✓ Framework: LangChain
✓ Agents detected: 4
✓ A2A coordination: Yes
✓ MCP servers: 2

Building infrastructure...
✓ Message queue configured
✓ Context caching enabled
✓ Cost limits set ($100/day)
✓ Monitoring active

Deployed to production!
URL: https://agent-xyz.getonstack.app
Dashboard: https://dash.getonstack.app

Status:
Agents: 4/4 healthy
A2A latency: 85ms
MCP cache hit: 0% (warming up)
Cost today: $0.00

Real-Time Observability

Press enter or click to view image in full size

Built-in Safeguards

# Automatic protections
safeguards = {
"max_cost_per_day": 100, # Hard limit
"max_tokens_per_request": 10000, # Prevent explosions
"max_loop_iterations": 10, # Stop infinite loops
"timeout_per_agent": 30, # No hanging
"alert_at_threshold": 0.8, # Early warning
}

# Real-time cost tracking
GET /api/costs/realtime
{
"spent_today": 47.32,
"limit": 100.00,
"projection_eod": 68.50,
"status": "healthy"
}

Join the Private Beta

We’re accepting 50 teams to help shape the platform.

If you’re building with:

  • LangChain multi-agent systems
  • CrewAI agent teams
  • Custom A2A architectures
  • MCP integrations

We’ll help you:

  • Deploy to production in minutes
  • Avoid the $47K mistakes
  • Scale without breaking things
  • Actually sleep at night

Apply for early access →

What you get:

  • White-glove onboarding
  • Direct engineering support
  • Influence on roadmap
  • Preferential pricing for life

The Future is Multi-Agent. The Infrastructure Needs to Exist.

A2A communication is unlocking coordination between specialized agents.

MCP is standardizing how agents access context and tools.

But without production-ready infrastructure, we’re building skyscrapers on sand.

The next 12 months will define who wins the agent infrastructure space.

The question isn’t “Will I need this?”

The question is “Will I learn this the $47K way or the easy way?”

Let’s Build the Future Together

Twitter: @getonstack | LinkedIn : GetOnStack
Email: [email protected]

Got war stories? Killed agents in production? Let’s hear it in the comments.

The agent infrastructure layer is being built right now.

Be part of it.


I’m Teja Kusireddy. I build things that work, break things that shouldn’t, and write about what I learn in between.
Most of my ideas start as debug logs and end as stories.
→ Follow for honest takes on tech, ambition, and staying sane while chasing both.

Read Entire Article