Show HN: Local audio transcription and speaker ID for Apple Silicon

3 months ago 1

A Python project that combines MLX Whisper for fast local transcription with pyannote.audio for speaker diarization, optimized for Apple Silicon.

  • Fast local transcription using MLX Whisper (Apple Silicon optimized)
  • Speaker diarization with pyannote.audio to identify different speakers
  • Multiple output formats: TXT, SRT subtitles, JSON
  • Privacy-focused: All processing done locally (only model downloads require internet)
  • Robust error handling with fallback methods
  • Python 3.8+
  • Apple Silicon Mac (for MLX optimization)
  • HuggingFace account and token with gated repository access
  1. Install dependencies:
pip install mlx-whisper pyannote.audio torch torchaudio
  1. Get HuggingFace token:

  2. Create .env file:

HF_TOKEN=your_huggingface_token_here

Basic transcription with speaker diarization:

python speech-to-text-fixed.py audio_file.mp3 your_hf_token
python speech-to-text-fixed.py audio_file.mp3 your_hf_token srt
python speech-to-text-fixed.py audio_file.mp3 your_hf_token json

Transcription only (no speaker diarization):

python transcribe_only.py audio_file.mp3
  • TXT: Clean text with speaker labels
  • SRT: Subtitle file with timestamps and speaker identification
  • JSON: Full structured data with segments, timestamps, and metadata
  • speech-to-text-fixed.py - Main script with speaker diarization
  • transcribe_only.py - Simple transcription without speaker identification
  • debug_pyannote.py - Debugging tool for pyannote issues
  • speech-to-text.py - Original script (may have tensor size issues)
  • test-mlx.py - MLX Whisper testing script
  • CLAUDE.md - Development guidance for Claude Code

Common issues and solutions are documented in CLAUDE.md. For debugging pyannote issues, use:

python debug_pyannote.py your_hf_token audio_file.mp3
Read Entire Article