Model Archieture of Ming-UniAudio
Overall Framework of The Unified Continuous Speech Tokenizer: MingTok-Audio

The two figures above represent the architecture of MingTok-Audio and it's three training stages.
Editing Tasks Video demos
Instruction-Guided Free-Form Speech Editing
Semantic Editing - Insert
Semantic Editing - Substitute
Semantic Editing - Delete
Acoustic Editing - Dialect Conversion
Acoustic Editing - Volume
Acoustic Editing - Denoise
Acoustic Editing - Background Music
Acoustic Editing - Emotion Conversion
Reference
@article{Mingomni2025,
title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation},
author = {Inclusion AI, Ant Group},
journal = {Technical Report},
year = {2025}
}