Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing

1 month ago 2

Model Archieture of Ming-UniAudio

method

Overall Framework of The Unified Continuous Speech Tokenizer: MingTok-Audio

method

The two figures above represent the architecture of MingTok-Audio and it's three training stages.

Editing Tasks Video demos

Instruction-Guided Free-Form Speech Editing

Semantic Editing - Insert

Semantic Editing - Substitute

Semantic Editing - Delete

Acoustic Editing - Dialect Conversion

Acoustic Editing - Volume

Acoustic Editing - Denoise

Acoustic Editing - Background Music

Acoustic Editing - Emotion Conversion

Reference

@article{Mingomni2025, title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation}, author = {Inclusion AI, Ant Group}, journal = {Technical Report}, year = {2025} }
Read Entire Article