VoxScribe is a lightweight, unified platform for testing and comparing multiple open-source speech-to-text (STT) models through a single interface. Born from real-world enterprise challenges where proprietary STT solutions become prohibitively expensive at scale, VoxScribe democratizes access to cutting-edge open-source alternatives.
Startups transcribing speech at scale face a common dilemma: cost vs. control. A contact center processing 100,000 hours of calls monthly can easily spend $150,000+ on transcription alone. While open-source STT models like Whisper, Voxtral, Parakeet, and Canary-Qwen now rival proprietary solutions in accuracy, evaluating them has been a nightmare:
- Dependency Hell 🔥: Conflicting library versions between models (transformers version conflicts between Voxtral and NeMo models)
- Inconsistent APIs 🔄: Each model requires different integration approaches
- Complex Setup ⚙️: Hours or days managing CUDA drivers, Python environments, and debugging
- Limited Comparison 📊: No unified way to test multiple models against your specific use cases
✅ Unified Interface: Test 5+ open-source STT models through a single FastAPI backend and clean web UI
✅ Dependency Management: Handles version conflicts and library incompatibilities automatically
✅ Side-by-Side Comparison: Upload audio and compare transcriptions across multiple models
✅ Model Caching: Intelligent caching for faster subsequent runs
✅ Clean API: RESTful endpoints for easy integration into existing workflows
✅ Cost Control: Self-hosted solution puts you in control of transcription costs
- OpenAI Whisper - Industry standard baseline
- Mistral Voxtral - Latest transformer-based approach
- NVIDIA Parakeet - Enterprise-grade accuracy
- Canary-Qwen-2.5B - Multilingual capabilities
- And growing... - Easy to add new models
- RESTful API for all STT operations
- Unified model management for Whisper, Voxtral, Parakeet, Canary
- Automatic dependency handling with version conflict resolution
- File upload and processing with background tasks
- Model comparison endpoint for side-by-side evaluation
- Dependency installation endpoints with subprocess management
- Modern responsive design with dark/light theme toggle
- Drag & drop file upload with audio preview
- Real-time status updates for dependencies and models
- Single model transcription with engine/model selection
- Multi-model comparison with checkbox selection
- Progress tracking and result visualization
- Download options for CSV and text formats
- AWS EC2 g6.xlarge instance with Amazon Linux 2023
- NVIDIA GPU drivers installed
-
Install NVIDIA GRID drivers
# Follow AWS documentation for GRID driver installation # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-GRID-driver -
Verify CUDA installation
-
Install system dependencies
sudo dnf update -y sudo dnf install git -y -
Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh- Accept the license agreement (type yes)
- Confirm installation location (default is fine)
- Initialize Conda (type yes when prompted)
-
Restart your shell or source bashrc
-
Create and activate conda environment
conda create -n voxscribe python=3.12 -y conda activate voxscribe -
Install ffmpeg in Conda env
-
Clone the repository
git clone https://github.com/Fraser27/VoxScribe.git cd VoxScribe -
Install Python dependencies
pip install -r requirements.txt -
Start the application
- Open your browser
- GET /api/status - Get system and dependency status
- GET /api/models - Get available models and cache status
- POST /api/transcribe - Single model transcription
- POST /api/compare - Multi-model comparison
- POST /api/install-dependency - Install missing dependencies
| Whisper | tiny, base, small, medium, large, large-v2, large-v3 | ✅ Built-in | Detailed timestamps, multiple sizes |
| Voxtral | Mini-3B, Small-24B | transformers 4.56.0+ | Advanced audio understanding, multilingual |
| Parakeet | TDT-0.6B-V2 | NeMo toolkit | NVIDIA optimized, fast inference |
| Canary | Qwen-2.5B | NeMo toolkit | State-of-the-art English ASR |
The system automatically handles version conflicts between:
- Voxtral: Requires transformers 4.56.0+
- NeMo models: Require transformers 4.51.3
Installation buttons are provided in the UI for missing dependencies.
Supported audio formats: WAV, MP3, FLAC, M4A, OGG
Static files are served from the public/ directory. Changes to HTML, CSS, or JS files are reflected immediately.
- Update MODEL_REGISTRY in backend.py
- Add loading logic in load_model() function
- Add transcription logic in transcribe_audio() function
- No ScriptRunContext warnings - Clean separation eliminates context issues
- Better performance - FastAPI is faster and more efficient
- Modern UI - Custom HTML/CSS/JS with better UX
- API-first design - Can be integrated with other applications
- Easier deployment - Standard web application deployment
- Better error handling - Proper HTTP status codes and error responses
- Scalability - Can handle multiple concurrent requests
- Missing dependencies: Use the install buttons in the UI
- Model download failures: Check internet connection and disk space
- Audio processing errors: Ensure ffmpeg is installed
- CUDA issues: Check PyTorch CUDA installation
Server logs are displayed in the terminal where you run python run.py.
- Backend changes: Modify backend.py
- Frontend changes: Modify files in public/
- New features: Add API endpoints and corresponding UI elements
- Testing: Use the built-in FastAPI docs at /docs
.png)
