
Denoising, speaker diarization and transcription in a single streamlined process. It's perfect for transcribing podcasts, interviews, or any multi-speaker audio content, as long as they have clear audio. In the output, you'll get a JSON file with the transcript, speaker labels, and timestamps.
- Audio Source Separation: Extract vocals from background music/noise
- Speaker Diarization: Identify and separate different speakers
- Transcription: Convert speech to text with timestamps
- Post-processing: Consolidate transcripts for readability
- Clean Display: Real-time progress updates without cluttering the console
- Step Skipping: Start the pipeline from any step
- Cross-platform: Supports Windows, macOS, and Linux
- Python 3.8+
- FFmpeg (for audio processing)
- CUDA-compatible GPU (recommended, but CPU mode available)
- Hugging Face token (optional, for enhanced speaker diarization accuracy)
The process consists of three main steps that can be run together or separately:
-
Separation (Step 1): Extracts vocals from background using Demucs
- Input: Any audio/video file
- Output: output/combined_vocals.wav
- Note: Files under 60MB are processed as a single unit; larger files are chunked automatically
-
Diarization (Step 2): Identifies different speakers
- Input: output/combined_vocals.wav
- Output: output/combined_vocals_diarized.json
- Tip: Use --num-speakers for better results when speaker count is known
-
Transcription (Step 3): Converts complete audio to text, then maps speakers
- Input: output/combined_vocals.wav and diarization data
- Output: output/final_transcription.json
- Architecture: Complete audio transcription → speaker mapping (no chunking)
- Tip: Specify --language code for improved accuracy
The pipeline creates several files during processing, all stored in the output/ directory:
- combined_vocals.wav: Extracted voices/speech from the input
- combined_background.wav: Background music/noise separated from the input
- speakers/SPEAKER_XX/*.wav: Individual audio segments for each speaker
-
combined_vocals_diarized.json: Speaker diarization results showing who speaks when
{ "speakers": ["SPEAKER_01", "SPEAKER_02", ...], "segments": [ {"speaker": "SPEAKER_01", "start": 0.0, "end": 2.5}, {"speaker": "SPEAKER_02", "start": 2.7, "end": 5.1}, ... ] } -
final_transcription.json: Complete transcription with speaker attribution in chronological order
{ "segments": [ {"text": "Complete sentence or phrase", "start": 0.1, "end": 2.5, "speaker": "SPEAKER_01"}, {"text": "Another speaker's response", "start": 2.7, "end": 5.1, "speaker": "SPEAKER_02"}, {"text": "Continuing conversation", "start": 5.3, "end": 8.0, "speaker": "SPEAKER_01"}, ... ] }
- separated/: Intermediate files from audio separation (preserved for resuming)
- chunks/: Audio chunks when using --chop mode (preserved for debugging)
The presence of these files allows the pipeline to resume from different steps:
- If combined_vocals.wav exists, audio separation can be skipped (step 1)
- If combined_vocals_diarized.json exists, diarization can be skipped (step 2)
AudioPipe includes tools to visualize your transcripts and generate interactive reports:
For best results:
- Use the HTML report for interactive exploration of longer content
- For very long audio (>1 hour), use --chop mode for processing
- Audio: .mp3, .wav, .m4a, .flac, .ogg
- Video (extracts audio): .mp4, .mov, .avi, .mkv
For macOS users, there are two operation modes:
For Macs without dedicated NVIDIA GPUs:
For M1/M2/M3 Macs, you can utilize Metal Performance Shaders:
The final output is a JSON file with chronological segments:
-
Audio Processing:
- Standard mode processes complete audio files for best quality
- For very long files (>1 hour), use --chop to split into 15-minute chunks
- If you get memory errors, try using --device cpu which uses less memory
-
Transcription Accuracy:
- Specify the language with --language for better results
- Complete audio transcription provides better context than chunking
- Improved accuracy for clear audio with minimal background noise
-
Speaker Identification:
- If speakers are not correctly identified, try setting --num-speakers
- Better results when speakers have distinct voices and don't talk over each other
- Hugging Face token improves diarization accuracy but is not required
The project includes a test suite for validating the pipeline functionality:
- Full Pipeline Test: Use --runslow to run the complete pipeline test
- Hugging Face Token: For full testing, provide your token with --hf-token or set the HUGGING_FACE_TOKEN environment variable
For more details on Testing, check README.test.md.
- Transcription searching stopped working at some point
- Some buttons are not working on the visualization page