Show HN: Lightning-SimulWhisper: Real-Time ASR for Apple Silicon

3 weeks ago 1

The fastest, most power efficient real-time local transcriptions on your apple silicon devices ✨

Zero pytorch dependencies ⛔

15x speedup on encoding, 18x speedup on decoding ⚡

Lightning-SimulWhisper implements Whisper model for simultaneous transcription using MLX (Apple's machine learning framework) and CoreML for optimal performance on Apple Silicon devices. It uses the AlignAtt policy for streaming speech recognition.

Using the original SimulStreaming project I could barely run the base model in real time. Now, I can run medium and even large-v3-turbo models in real time on my M2 Macbook Pro.

The MLX-only version consumes way too much power, so using the CoreML encoder is recommended.

CoreML Encoder: While the encoder speedup is dramatic (up to 18x faster), the overall inference time improvement is more modest because the decoder still runs on MLX
MLX Decoder: MLX provides up to 15x decoder speedup compared to PyTorch implementations, demonstrating excellent Apple Silicon optimization
Power Efficiency: CoreML acceleration uses significantly less power than MLX-only implementations, though exact power measurements weren't captured in this benchmark
Decoder Performance: MLX decoder performance remains consistent across implementations, showing the stability of the MLX framework
Speed Gains: You can achieve up to 18x encoder speed increase and 15x decoder speed increase with optimal CoreML configuration

Note: I have no idea on how to benchmark power consumption for a specific process. Any contributions or suggestions for accurate power measurement on Apple Silicon would be greatly appreciated!

MLX Implementation: Native Apple Silicon optimization with MLX framework (up to 15x decoder speedup)
CoreML Encoder: Up to 18x faster encoding using Apple's Neural Engine
AlignAtt Policy: State-of-the-art simultaneous decoding strategy
Multiple Model Support: tiny, base, small, medium, large-v1, large-v2, large-v3
Beam Search: Configurable beam search decoding
Real-time Streaming: Both file simulation and live microphone input
Power Efficient: Low power consumption with CoreML acceleration

pip install -r requirements.txt

CoreML Acceleration (Recommended)

For optimal performance on Apple Silicon, install CoreML dependencies:

pip install coremltools ane_transformers

Generate CoreML encoder models for faster inference:

# Clone whisper.cpp for CoreML model generation git clone https://github.com/ggml-org/whisper.cpp.git # Generate CoreML encoder for your preferred model ./scripts/generate_coreml_encoder.sh base.en

Available models: tiny.en, tiny, base.en, base, small.en, small, medium.en, medium, large-v1, large-v2, large-v3, large-v3-turbo

For minimal installation, remove torchaudio from requirements.txt. This disables the Silero VAD controller (--vac option).

Quick Start with CoreML (Recommended)

# Basic usage with CoreML acceleration python3 simulstreaming_whisper.py audio.wav \ --model_name base.en \ --model_path mlx_base \ --use_coreml \ --language en # With beam search and CIF model python3 simulstreaming_whisper.py audio.wav \ --model_name medium \ --model_path mlx_medium \ --use_coreml \ --beams 3 \ --cif_ckpt_path cif_model/medium.npz \ --language en

Real-time Simulation from Audio File

usage: simulstreaming_whisper.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--lan LAN] [--vac] [--vac-chunk-size VAC_CHUNK_SIZE] [--vad] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--model_path MODEL_PATH] [--model_name MODEL_NAME] [--beams BEAMS] [--decoder DECODER] [--audio_max_len AUDIO_MAX_LEN] [--audio_min_len AUDIO_MIN_LEN] [--frame_threshold FRAME_THRESHOLD] [--cif_ckpt_path CIF_CKPT_PATH] [--never_fire | --no-never_fire] [--init_prompt INIT_PROMPT] [--static_init_prompt STATIC_INIT_PROMPT] [--max_context_tokens MAX_CONTEXT_TOKENS] [--start_at START_AT] [--comp_unaware] [--use_coreml] [--coreml_encoder_path COREML_ENCODER_PATH] [--coreml_compute_units {ALL,CPU_AND_NE,CPU_ONLY}] audio_path CoreML Options: --use_coreml Enable CoreML encoder acceleration (up to 18x faster, lower power) --coreml_encoder_path COREML_ENCODER_PATH Path to CoreML encoder .mlpackage directory (auto-detected if not provided) --coreml_compute_units {ALL,CPU_AND_NE,CPU_ONLY} CoreML compute units: ALL (default), CPU_AND_NE (recommended), CPU_ONLY Model Options: --model_path MODEL_PATH Path to MLX model directory or HuggingFace repo --model_name MODEL_NAME Model name: tiny, base.en, small, medium, large-v1, large-v2, large-v3 --beams BEAMS, -b BEAMS Number of beams for beam search decoding (1 = greedy) --decoder DECODER Override automatic decoder selection Audio Processing: --min-chunk-size MIN_CHUNK_SIZE Minimum audio chunk size in seconds --audio_max_len AUDIO_MAX_LEN Max length of audio buffer in seconds (default: 30.0) --audio_min_len AUDIO_MIN_LEN Skip processing if audio buffer is shorter than this length --frame_threshold FRAME_THRESHOLD AlignAtt threshold in frames (default: 4) Language: --lan LAN, --language LAN Source language code (en, de, cs, etc.) or 'auto' for detection CIF Model (End-of-Word Detection): --cif_ckpt_path CIF_CKPT_PATH Path to CIF model checkpoint for word boundary detection --never_fire, --no-never_fire Override CIF model behavior (default: False) Context and Prompts: --init_prompt INIT_PROMPT Initial prompt for the model (in target language) --static_init_prompt STATIC_INIT_PROMPT Static prompt that doesn't scroll (terminology, etc.) --max_context_tokens MAX_CONTEXT_TOKENS Maximum context tokens (default: model's max) Simulation Options: --start_at START_AT Start processing audio at this time --comp_unaware Computationally unaware simulation

# Basic MLX implementation python simulstreaming_whisper.py test.mp3 \ --language ko \ --vac \ --vad_silence_ms 1000 \ --beams 3 \ -l CRITICAL \ --cif_ckpt_path cif_model/medium.npz \ --model_name medium \ --model_path mlx_medium # With CoreML encoder acceleration (up to 18x faster, lower power) python simulstreaming_whisper.py test.mp3 \ --language ko \ --vac \ --vad_silence_ms 1000 \ --beams 3 \ -l CRITICAL \ --cif_ckpt_path cif_model/medium.npz \ --model_name medium \ --model_path mlx_medium \ --use_coreml # CoreML with Neural Engine (best power efficiency) python simulstreaming_whisper.py test.mp3 \ --language ko \ --vac \ --vad_silence_ms 1000 \ --beams 3 \ -l CRITICAL \ --cif_ckpt_path cif_model/medium.npz \ --model_name medium \ --model_path mlx_medium \ --use_coreml \ --coreml_compute_units CPU_AND_NE

Lightning-SimulWhisper uses a hybrid architecture combining MLX and CoreML:

Audio Input (16kHz mono) ↓ Mel Spectrogram (MLX) ↓ ┌─────────────────────┐ │ CoreML Encoder │ ← Apple Neural Engine (up to 18x faster) │ (whisper.cpp) │ └─────────────────────┘ ↓ Encoder Features (convert to MLX) ↓ ┌─────────────────────┐ │ MLX Decoder │ ← Full control, beam search, AlignAtt │ (Simul-Whisper) │ └─────────────────────┘ ↓ Transcription Output

Key Components:

MLX Framework: Apple's optimized ML framework for Apple Silicon (up to 15x decoder speedup)
CoreML Encoder: Neural Engine acceleration for the encoder (up to 18x speedup, most compute-intensive part)
MLX Decoder: Flexible decoding with AlignAtt policy, beam search, and streaming
AlignAtt Policy: State-of-the-art simultaneous decoding strategy

Read Entire Article