SOTA multilingual TTS with zero-shot voice cloning and speaking style control

15 hours ago 1

Abstract

We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker's voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.

Zero-Shot Voice Cloning: Clone any voice with just a few seconds of reference audio using in-context learning.

Multilingual Support: Native support for 11 languages (English, Chinese, Spanish, French, Korean, Dutch, Japanese, Portuguese, German, Italian, and Polish) with high-quality cross-lingual voice transfer.

Real-time Streaming: Low-latency streaming inference optimized for real-time applications.

Speaking Style and Non-Verbal Markup Support: Control over emotion, projection, and common non-verbal vocalizations.

48kHz Audio: High-resolution audio generation with professional quality.

Application-Driven Design Choice: TTS-1 (1.6B parameters) prioritizes efficiency and speed for real-time applications and on-device deployment, while TTS-1-Max (8.8B parameters) maximizes quality and expressiveness for demanding applications where computational resources are less constrained.

System Architecture

The architecture of Inworld TTS-1 consists of three main components working together to achieve high-quality speech synthesis:

Inworld TTS-1 Architecture Diagram

Figure 1: The architecture of Inworld TTS-1. The audio encoder tokenizes a reference audio into a sequence of discrete audio tokens. These tokens are concatenated with the tokenized reference text and the text to be synthesized to form a prompt for the SpeechLM. The SpeechLM autoregressively generates audio tokens, which are then converted back into a 48kHz waveform by the audio decoder.

Audio Encoder

Converts reference audio into discrete tokens using X-codec2 architecture with 65,536 token vocabulary

SpeechLM

LLaMA-based transformer (1.6B or 8.8B parameters) trained with pre-training, SFT, and RL alignment

Audio Decoder

Converts audio tokens back to 48kHz waveforms with super-resolution module for high-quality output

Multilingual Evaluation

To evaluate the final performance of Inworld TTS-1 and TTS-1-Max, we generated a benchmark dataset using Gemini 2.5 Pro, comprising 100 sentences for each of the 11 supported languages, for a total of 1,100 samples. We then synthesized speech for this dataset using both models with the same set of speakers to ensure a fair comparison.

Multilingual evaluation results comparing WER and SIM scores by language

Figure 2: Multilingual evaluation results comparing WER and SIM scores by language. Left: WER (lower is better). Right: SIM (higher is better).

The results show that TTS-1-Max consistently outperforms TTS-1 in terms of both Word Error Rate (WER) and Speaker Similarity (SIM) across all languages, demonstrating the effectiveness of the larger model in generating more accurate and higher-fidelity speech. Notably, both models achieve very high speaker similarity, indicating their robustness in voice cloning across different languages.

Multilingual Support

Compare TTS-1 and TTS-1-Max performance across different speakers and content. We are showing the six most popular languages below, but also have voices for Japanese, Portuguese, German, Italian, and Polish.

Loading multilingual demos...

Zero-Shot Voice Cloning

Inworld TTS-1 and TTS-1-Max excel at zero-shot voice cloning, allowing you to clone any speaker's voice using just a few seconds of reference audio. Our models leverage in-context learning to capture unique vocal characteristics, speaking patterns, and tonal qualities without requiring additional training or fine-tuning. Below we demonstrate this functionality with the Inworld-tts-1-Max (8.8B) model.

Reference Phrase Synthesized Speech
"This thing with...ehh...Frankie 'Fingers' has become a real heartburn, you know? He's like a cannoli where the shell is all soft and the ricotta is filled with lies. You don't just throw out a bad cannoli....no way. Instead you gotta make an example of it so the other pastries in the box know what's what. Am I right?!"
"I don't know what's the matter with people: they don't learn by understanding; they learn by some other way....by rote or something. Their knowledge is so fragile."
"For a perfect vinaigrette just put a spoonful of Dijon mustard in a jar, add a splash of vinegar and three times as much good olive oil. Next shake it like you're mad at it...REALLY mad at it....and finally add salt and pepper and BOOM! Done."
"Alright folks...if you'll look down toward Lady Liberty's feet, you'll notice she isn't standing still; she's actually striding forward and breaking free from a broken shackle and chain. This powerful detail is a reminder that liberty is an action, not just an idea."
"O Romeo, Romeo! Wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet."

Speaking Style and Non-Verbal Markup

Our models support advanced markup for fine-grained control over speaking style, emotions, and non-verbal vocalizations. Our models can generate natural-sounding speech with various emotional tones, vocal projections, and common non-verbal sounds like laughter, sighs, and throat clearing. Below we demonstrate this functionality with the Inworld-tts-1 (1.6B) model.

Voice Model Text without Markup Text with Markup
Ashley (Host) "Good morning, and welcome to another exciting episode of our podcast."

"[laughing] Good morning, and welcome to another exciting episode of our podcast."

Ashley (Host) "We have a truly engaging discussion lined up for you today."

"[happy] We have a truly engaging discussion lined up for you today."

Ashley (Host) "Hurricane Leo has intensified into a major Category 4 storm, making landfall along the Louisiana coast with ferocious winds. The storm is unleashing a life-threatening surge and torrential rain, causing widespread power outages across the region."

"[sad] Hurricane Leo has intensified into a major Category 4 storm, making landfall along the Louisiana coast with ferocious winds. The storm is unleashing a life-threatening surge and torrential rain, causing widespread power outages across the region."

Edward (Instructor) "I'm really tired from such a long flight, but that's the price to pay to be a world-renown inspirational speaker but let's be honest, I love it!"

"[cough] I'm really tired from such a long flight, but that's the price to pay to be a world-renown inspirational speaker....but [breathe] let's be honest, I love it! [laugh]"

Elizabeth (Assistant) "I am detecting a presence inside the building that is not registered in the system. My requests for identification have gone unanswered but I will send a notification once I have more information."

"[fearful] I am detecting a presence inside the building that is not registered in the system. My requests for identification have gone unanswered but I will send a notification once I have more information."

Hades (Dark Character) "Beware the ancient curse that plagues these cursed lands."

"Beware the ancient curse that plagues these cursed lands [sigh]."

Julia (Friend) "I have a secret but you have to promise to NEVER tell anyone. Do you pinky promise?"

"[whispering] I have a secret but you have to promise to NEVER tell anyone. Do you pinky promise?"

Mark (Host) "And that concludes our special report on economic trends."

"[surprised] And that concludes our special report on economic trends."

Olivia (Teacher) "Let us delve into the principles of quantum physics to better understand."

"Let us delve into the principles of quantum physics [clear_throat] to better understand."

Sarah (Adventurer) "May your journey be filled with thrilling victories and epic quests."

"[angry] May your journey be filled with thrilling victories and epic quests."

Wendy (Critic) "That's what you decided to spend your money on? What a joke."

"[disgusted] That's what you decided to spend your money on? What a joke."

Read Entire Article