Building the cheapest AI voice agent possible ($0.28 per hour)

1 hour ago 2

Welcome to the cheapest, lowest-latency, and best performing AI voice agent possible today.

This stack achieves:

Total costs as low as $0.28 per hour ($0.0046 per minute)
Latency between 600-800ms from end of speech to first audio frame
State-of-the-art voice performance thanks to inworld.ai

From a cost perspective, the Hypercheap stack is:

32x cheaper than OpenAI Realtime
20x cheaper than Elevenlabs Voice Agents
10x cheaper than Vapi

Stack: Fennec (Realtime ASR) → Baseten (LLM via OpenAI-compatible API) → Inworld (streamed TTS)

videodemo.mp4

1) Create accounts & grab keys

A. Fennec ASR (Realtime speech-to-text)

Go to Fennec and create a free account (10 hours included): https://fennec-asr.com
Create your first API key in the dashboard.

You’ll paste that key into your .env as FENNEC_API_KEY.

B. Baseten (LLM — OpenAI-compatible)

Sign up for Baseten (1 dollar of inference included): https://app.baseten.co
Click "Model APIs" and "Add Model API" and create one for "Qwen3 235B A22B"
After creating, click "API Endpoint" and generate an API key.
This setup calls Baseten via the OpenAI-compatible endpoint. The default base URL in this repo is https://inference.baseten.co/v1 and the default model is Qwen/Qwen3-235B-A22B-Instruct-2507.

You’ll paste the API key as BASETEN_API_KEY into your .env. Keep the provided base URL and model (or swap to another more performant Baseten model if you like).

C. Inworld (Text‑to‑Speech)

Create an Inworld account and open the TTS page: https://inworld.ai/tts
In the Portal, generate an API key (Base64) and copy the Base64 value: https://portal.inworld.ai
(Optional) Choose a voice and set your defaults (model inworld-tts-1, 48 kHz, etc.). You can also clone voices with the inworld platform at no extra cost.

The backend expects the Base64 form for Basic auth. In the portal there’s a “Copy base64” button—use that.

Paste the Base64 API key as INWORLD_API_KEY. You can also set INWORLD_VOICE_ID (e.g. Olivia).

Create voice_backend/.env (or copy from voice_backend/.env.example) and fill the values you just collected:

# Fennec ASR FENNEC_API_KEY=... FENNEC_SAMPLE_RATE=16000 FENNEC_CHANNELS=1 # Baseten (OpenAI-compatible) BASETEN_API_KEY=... BASETEN_BASE_URL=https://inference.baseten.co/v1 BASETEN_MODEL=Qwen/Qwen3-235B-A22B-Instruct-2507 # Inworld TTS INWORLD_API_KEY=... INWORLD_MODEL_ID=inworld-tts-1 INWORLD_VOICE_ID=Olivia INWORLD_SAMPLE_RATE=48000

For the frontend, create voice_frontend/.env.local and point to your backend WebSocket:

VITE_AGENT_WS_URL=ws://localhost:8000/ws/agent

Backend

cd voice_backend pip install -r requirements.txt uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Frontend

cd voice_frontend npm install npm run dev

Open http://localhost:5173 and click the mic button to start chatting.

Build the container and run it with your .env:

docker build -t hypercheap-agent:latest -f voice_backend/Dockerfile . docker run --env-file voice_backend/.env -p 8000:8000 hypercheap-agent:latest

If you also want the built UI served by FastAPI, run npm run build in voice_frontend first — it outputs to voice_backend/app/static.

5) Cost Breakdown (how it’s ~$0.28/hr)

ASR (Fennec, streaming): as low as $0.11/hr on scale tier (or $0.16/hr starter), with a generous free trial
LLM (Baseten Qwen3-235B-A22B): $0.22 / 1M input tokens and $0.80 / 1M output tokens
TTS (Inworld): $5.00 / 1M characters, which they estimate as ≈$0.25 per audio‑hour of generated speech.

Example: In a typical chat, the AI speaks ~40–60% of the time.

• Fennec ASR: ~$0.11/hr • Inworld TTS: $0.25 × 0.5 = $0.125/hr (assumes 30 min of AI speech per session hour) • Baseten LLM tokens: usually ~$0.01–$0.03/hr at short replies

Total: ~$0.25–$0.35 per session hour

Actual costs vary with ASR plan, talk ratio, and how verbose the model is. The defaults in this repo (short replies, low max tokens) are tuned to keep costs as low as possible.

Swap voices (Inworld) or LLM models (Baseten) by changing the env vars.
Tune VAD in voice_backend/app/agent/fennec_ws.py for faster/longer turns. It is extremely aggressive by default, which can cut off slow speakers.
Swap LLMs in Baseten for better intelligence at the price of increased cost and higher latency
Add in the audio markups into the LLM prompt, and switch the model to the Inworld inworld-tts-1-max model for increased realism (at double the cost and ~50% increased latency).
Adjust history length in voice_backend/session.py by altering this: self._max_history_msgs. This will increase costs.

Read Entire Article