FluidAudio is a Swift framework for on-device speaker diarization and audio processing, designed to maximize performance per watt by leveraging CoreML models exclusively. Optimized for Apple's Neural Engine, it delivers faster and more efficient processing than CPU or GPU alternatives.
Built to address the need for an open-source solution capable of real-time workloads on iOS and older macOS devices, FluidAudio fills a gap where existing solutions either rely on CPU-only models or remain closed-source behind paid licenses. Since speaker diarization and identification are among the most popular features for voice AI applications, we believe these capabilities should be freely available.
Our testing demonstrates that CoreML versions deliver significantly more efficient inference compared to their ONNX counterparts, making them truly suitable for real-time transcription use cases.
- State-of-the-Art Diarization: Research-competitive speaker separation with optimal speaker mapping
- Apple Neural Engine Optimized: Models run efficiently on Apple's ANE for maximum performance with minimal power consumption
- Speaker Embedding Extraction: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification
- CoreML Models: Native Apple CoreML backend with custom-converted models optimized for Apple Silicon
- Open-Source Models: All models are publicly available on HuggingFace - converted and optimized by our team. Permissive licenses.
- Real-time Processing: Designed for real-time workloads but also works for offline processing
- Cross-platform: Full support for macOS 13.0+ and iOS 16.0+ and any Apple Sillicon device
Add FluidAudio to your project using Swift Package Manager:
See the public DeepWiki docs: https://deepwiki.com/FluidInference/FluidAudio
The repo is indexed by DeepWiki - the MCP server gives your coding tool access to the docs already.
For most clients:
For claude code:
Coming Soon:
- Voice Activity Detection (VAD): Voice activity detection capabilities
- ASR Models: Support for open-source ASR models
- System Audio Access: Tap into system audio via CoreAudio
AMI Benchmark Results (Single Distant Microphone) using a subset of the files:
-
DER: 17.7% - Competitive with Powerset BCE 2023 (18.5%)
-
JER: 28.0% - Outperforms EEND 2019 (25.3%) and x-vector clustering (28.7%)
-
RTF: 0.02x - Real-time processing with 50x speedup
-
Efficient Computing: Runs on Apple Neural Engine with zero performance trade-offs
FluidAudio powers production applications including:
- Slipbox: Privacy-first meeting assistant for real-time conversation intelligence
- Whisper Mate: Transcribe movie/audio to text locally. Realtime record & transcribe from speaker or system apps. 🔒 All process in local mac Whisper AI Model.
Make a PR if you want to add your app!
Customize behavior with DiarizerConfig:
FluidAudio includes a powerful command-line interface for benchmarking and audio processing:
- DiarizerManager: Main diarization class
- performCompleteDiarization(_:sampleRate:): Process audio and return speaker segments
- compareSpeakers(audio1:audio2:): Compare similarity between two audio samples
- validateAudio(_:): Validate audio quality and characteristics
Apache 2.0 - see LICENSE for details.
This project builds upon the excellent work of the sherpa-onnx project for speaker diarization algorithms and techniques. We extend our gratitude to the sherpa-onnx contributors for their foundational work in on-device speech processing.
Pyannote: https://github.com/pyannote/pyannote-audio
Wewpeaker: https://github.com/wenet-e2e/wespeaker
.png)


