Voxscribe: STT Models Comparison Platform

1 month ago 6

VoxScribe: A platform to test Opensource Speech-to-Text models

VoxScribe is a lightweight, unified platform for testing and comparing multiple open-source speech-to-text (STT) models through a single interface. Born from real-world enterprise challenges where proprietary STT solutions become prohibitively expensive at scale, VoxScribe democratizes access to cutting-edge open-source alternatives.

Startups transcribing speech at scale face a common dilemma: cost vs. control. A contact center processing 100,000 hours of calls monthly can easily spend $150,000+ on transcription alone. While open-source STT models like Whisper, Voxtral, Parakeet, and Canary-Qwen now rival proprietary solutions in accuracy, evaluating them has been a nightmare:

Dependency Hell 🔥: Conflicting library versions between models (transformers version conflicts between Voxtral and NeMo models)
Inconsistent APIs 🔄: Each model requires different integration approaches
Complex Setup ⚙️: Hours or days managing CUDA drivers, Python environments, and debugging
Limited Comparison 📊: No unified way to test multiple models against your specific use cases

✅ Unified Interface: Test 5+ open-source STT models through a single FastAPI backend and clean web UI
✅ Dependency Management: Handles version conflicts and library incompatibilities automatically
✅ Side-by-Side Comparison: Upload audio and compare transcriptions across multiple models
✅ Model Caching: Intelligent caching for faster subsequent runs
✅ Clean API: RESTful endpoints for easy integration into existing workflows
✅ Cost Control: Self-hosted solution puts you in control of transcription costs

OpenAI Whisper - Industry standard baseline
Mistral Voxtral - Latest transformer-based approach
NVIDIA Parakeet - Enterprise-grade accuracy
Canary-Qwen-2.5B - Multilingual capabilities
And growing... - Easy to add new models

├── backend.py # FastAPI backend with STT logic ├── public/ # Frontend static files │ ├── index.html # Main HTML interface │ ├── styles.css # CSS styling with dark/light theme │ └── app.js # JavaScript frontend logic ├── run.py # Startup script └── requirements.txt # Python dependencies

RESTful API for all STT operations
Unified model management for Whisper, Voxtral, Parakeet, Canary
Automatic dependency handling with version conflict resolution
File upload and processing with background tasks
Model comparison endpoint for side-by-side evaluation
Dependency installation endpoints with subprocess management

Modern responsive design with dark/light theme toggle
Drag & drop file upload with audio preview
Real-time status updates for dependencies and models
Single model transcription with engine/model selection
Multi-model comparison with checkbox selection
Progress tracking and result visualization
Download options for CSV and text formats

AWS EC2 g6.xlarge instance with Amazon Linux 2023
NVIDIA GPU drivers installed

Install NVIDIA GRID drivers

# Follow AWS documentation for GRID driver installation # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-GRID-driver
Verify CUDA installation
Install system dependencies

sudo dnf update -y sudo dnf install git -y
Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
- Accept the license agreement (type yes)
- Confirm installation location (default is fine)
- Initialize Conda (type yes when prompted)
Restart your shell or source bashrc
Create and activate conda environment

conda create -n voxscribe python=3.12 -y conda activate voxscribe
Install ffmpeg in Conda env
Clone the repository

git clone https://github.com/Fraser27/VoxScribe.git cd VoxScribe
Install Python dependencies

pip install -r requirements.txt
Start the application

Open your browser

GET /api/status - Get system and dependency status
GET /api/models - Get available models and cache status

POST /api/transcribe - Single model transcription
POST /api/compare - Multi-model comparison

POST /api/install-dependency - Install missing dependencies

Engine Models Dependencies Features

Whisper	tiny, base, small, medium, large, large-v2, large-v3	✅ Built-in	Detailed timestamps, multiple sizes
Voxtral	Mini-3B, Small-24B	transformers 4.56.0+	Advanced audio understanding, multilingual
Parakeet	TDT-0.6B-V2	NeMo toolkit	NVIDIA optimized, fast inference
Canary	Qwen-2.5B	NeMo toolkit	State-of-the-art English ASR

The system automatically handles version conflicts between:

Voxtral: Requires transformers 4.56.0+
NeMo models: Require transformers 4.51.3

Installation buttons are provided in the UI for missing dependencies.

Supported audio formats: WAV, MP3, FLAC, M4A, OGG

# Run with auto-reload uvicorn backend:app --reload --host 0.0.0.0 --port 8000

Static files are served from the public/ directory. Changes to HTML, CSS, or JS files are reflected immediately.

Update MODEL_REGISTRY in backend.py
Add loading logic in load_model() function
Add transcription logic in transcribe_audio() function

No ScriptRunContext warnings - Clean separation eliminates context issues
Better performance - FastAPI is faster and more efficient
Modern UI - Custom HTML/CSS/JS with better UX
API-first design - Can be integrated with other applications
Easier deployment - Standard web application deployment
Better error handling - Proper HTTP status codes and error responses
Scalability - Can handle multiple concurrent requests

uvicorn backend:app --host 0.0.0.0 --port 8000 --workers 4

FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . EXPOSE 8000 CMD ["uvicorn", "backend:app", "--host", "0.0.0.0", "--port", "8000"]

Missing dependencies: Use the install buttons in the UI
Model download failures: Check internet connection and disk space
Audio processing errors: Ensure ffmpeg is installed
CUDA issues: Check PyTorch CUDA installation

Server logs are displayed in the terminal where you run python run.py.