LiteLLM for Native Audio Models

4 hours ago 2

.... .- .. --.. . .-.. .- -... ... ✨🎤 pip install spoken 🎤✨ .. - .----. ... / .- / -... .- -.. / -.. .- -.--

spoken provides a single abstraction for a variety of audio foundation models. It is primarily designed for large-scale evaluation/benchmarking of realtime speech-to-speech models, but it can also be used as a drop-in inference library.

# os.environ['LOG_LEVEL'] = 'DEBUG' # detailed client/server state management logging import spoken model = spoken("gpt-4o-realtime-preview-2024-12-17", "examples/scooby.wav") input_asr, output_asr, output_audio = await model.run() output_asr # "That's quite the story..." len(output_audio) # 8549ms model.output_audio_tokens # 254

Large audio models operate on audio tokens rather than transcribed text. This enables low-latency streaming conversational audio agents that directly generate audio end-to-end. Although promising and exciting, using these models requires non-trivial configuration and state management, due to major providers differing significantly in interface.

(AFAWK,) spoken supports all provider speech-to-speech models.

OpenAI Realtime
- gpt-4o-realtime-preview-2024-12-17
- gpt-4o-mini-audio-preview-2024-12-17 [coming soon, not part of realtime API]
Gemini Multimodal Live
- gemini-2.5-flash-preview-native-audio-dialog
- gemini-2.5-flash-exp-native-audio-thinking-dialog
Amazon Nova Sonic (pip install spoken[nova])
- amazon.nova-sonic-v1:0

Benchmarking TTFT (Time-To-First-Token) Latency
OpenAI System Prompt
more interesting things coming soon...

Simply run pip install spoken
- Python 3.12+ required + pip install spoken[nova] + portaudio.h (+ OS X: brew install portaudio) for Amazon Nova Sonic support

Read Entire Article

LiteLLM for Native Audio Models

Related

Meta is offering multi-mn pay for AI researchers,but not $10...

How we made a Ruby method faster

Hubert Dreyfus's views on artificial intelligence