This repository contains the training and modeling code used to create Inworld TTS-1 and TTS-1-Max models.
You can use this code to pre-train, fine-tune, or align with RL your arbitrary SpeechLM-based TTS model, no matter if you're using a single GPU machine or a multi-GPU cluster.
- Modeling: SpeechLM and 1D audio-codecs
- Distributed Training: DDP, DeepSpeed and FSDP for training arbitrary SpeechLM
- Data Pipeline: Ready-to-use scripts to vectorize and prepare your audio data for training
The code is only tested on Ubuntu 22.04.
| Python | 3.10 | Required for all features |
| CUDA | 12.4 or 12.8 | Auto-detected |
| PyTorch | 2.6 (CUDA 12.4) or 2.7 (CUDA 12.8) | Auto-installed |
This project depends on Python 3.10 and uv for package management.
Install uv for fast Python package management:
Default setup (CUDA 12.8 + PyTorch 2.7):
Specify CUDA version:
This automatically:
- Creates Python 3.10 virtual environment
- Installs CUDA-optimized PyTorch with the proper flash attention implementation
- Sets up all project dependencies
In order to train a SpeechLM you need to first vectorize your audio-data into the audio-codes. Combined together with the transcript, the model learns how being conditioned on it generates the audio. Below example shows how to get started with a simple SFT training.
Process your raw audio dataset into a JSONL file where each line contains a sample with the following format:
Required fields:
- transcript: Text transcription of the audio
- language: Language code (e.g., "en" for English)
- wav_path: Absolute path to the audio file
- duration: Audio duration in seconds
- sample_rate: Audio sample rate in Hz
Example dataset: You can reference the LibriTTS dataset which contains ~585 hours of English speech from 2, 456 speakers at 24kHz sampling rate.
Sample files: We provide real example data from LibriTTS in this repository:
- ./example/configs/samples.jsonl - 100 real LibriTTS samples with proper JSONL format
- ./example/wavs/ - Corresponding audio files (audio_1.wav through audio_100.wav)
This gives you a working example to test the data vectorization and training pipeline with actual audio data.
Vectorize audio data using codec's encoder (We also made the codec compatible with xcodec2, so you can use the publicly available checkpoint if you prefer not to train the codec from scratch):
Test with provided samples:
With your own dataset:
After vectorization completes, you'll have multiple shard files in your output directory like below:
Feel free to customize filtering logic to implement your custom filtering logic.
Combine vectorized shards into unified dataset to save space:
After merging, your dataset folder will contain:
Create training config ( ./example/configs/sft.json ). Below shows key configuration sections - see the full configuration file at ./example/configs/sft.json for all available options:
Important:
- Update dataset paths to point to your vectorized data directory
- This shows only key parameters - refer to ./example/configs/sft.json for the complete configuration with all available options
- To resume from a checkpoint: Add "checkpoint_file_to_resume_from": "/path/to/your/checkpoint.pt" to the checkpointing section
SFT training:
After training completes, you'll find the trained model at ./experiments/my_tts_training/final_model.pt along with model and tokenizer configuration files.
Additional options:
- --dry_run: Test pipeline without training
- --compile_model: Enable torch.compile optimization (works well only if all your batch' samples have the same length)
Track progress via:
- Weights & Biases: Loss curves and training metrics
- Checkpoints: Saved every save_steps iterations
- Console logs: Real-time training information
Once you have a trained model, you can use it for inference to generate speech from text and an audio prompt.
Use the provided inference script for easy speech generation:
Required components:
- Trained model checkpoint (.pt file from training)
- Audio encoder checkpoint (codec .pt/.ckpt file)
- Audio decoder checkpoint (codec .pt/.ckpt file + model_config.json in same directory)
- Audio prompt file (.wav format)
- Prompt transcription (text of what's spoken in the audio prompt)
Note: If you don't want to retrain the decoder, you can use the same checkpoint from xcodec2 for both encoder and decoder paths. We provided a xcodec2 compatible model_config.json file in ./example/codec.
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch: git checkout -b feature/amazing-feature
- Run tests: make test
- Run linting: make lint-fix
- Commit changes: git commit -m 'Add amazing feature'
- Push to branch: git push origin feature/amazing-feature
- Open a Pull Request
- Meta AI team for open-sourcing LLaMA LLMs
- The PyTorch and Hugging Face communities
- Codec architecture inspired by Llasa
- Bug Reports: GitHub Issues
- General Questions: For general inquiries and support, please email us
.png)
