.... .- .. --.. . .-.. .- -... ... ✨🎤 pip install spoken 🎤✨ .. - .----. ... / .- / -... .- -.. / -.. .- -.--
spoken provides a single abstraction for a variety of audio foundation models. It is primarily designed for large-scale evaluation/benchmarking of realtime speech-to-speech models, but it can also be used as a drop-in inference library.
Large audio models operate on audio tokens rather than transcribed text. This enables low-latency streaming conversational audio agents that directly generate audio end-to-end. Although promising and exciting, using these models requires non-trivial configuration and state management, due to major providers differing significantly in interface.
(AFAWK,) spoken supports all provider speech-to-speech models.
- OpenAI Realtime
- gpt-4o-realtime-preview-2024-12-17
- gpt-4o-mini-audio-preview-2024-12-17 [coming soon, not part of realtime API]
- Gemini Multimodal Live
- gemini-2.5-flash-preview-native-audio-dialog
- gemini-2.5-flash-exp-native-audio-thinking-dialog
- Amazon Nova Sonic (pip install spoken[nova])
- amazon.nova-sonic-v1:0
- Benchmarking TTFT (Time-To-First-Token) Latency
- OpenAI System Prompt
- more interesting things coming soon...
- Simply run pip install spoken
- Python 3.12+ required + pip install spoken[nova] + portaudio.h (+ OS X: brew install portaudio) for Amazon Nova Sonic support