Show HN: HuMo AI – Create Realistic Videos with Text, Image, and Audio Inputs

3 hours ago 1

Supports Text+Image (TI), Text+Audio (TA), and Text+Image+Audio (TIA) collaborative conditioning with strong subject consistency, text following, and audio‑visual sync.

Subject ConsistencyA/V SyncMulti‑ModalText‑Controllable

Collaboration: Tsinghua University · Bytedance Intelligent Creation Team

Three Generation Modes

TI / TA / TIA cover core needs for subject consistency, semantic alignment, and precise A/V sync.

Text + Image (TI)

Generate videos that follow text while preserving the subject based on a reference image.

Example: a man in a black suit gracefully putting on brown leather gloves; a woman sleeping with headphones beside a Chihuahua.
Example: a young witch with a red bow flying with a black kitten through a sun‑dappled forest.

Text + Audio (TA)

Generate videos with precise audio‑visual sync; lip motion and facial expressions align with the speech signal.

Examples: a torch‑bearing warrior speaking in a cave; an elderly sailor narrating on deck with a cat curled beside him.
Example: a scientist discussing a vial of glowing liquid in a high‑tech lab.

TIA

Text + Image + Audio (TIA)

Tri‑modal conditioning that balances text alignment, subject consistency, and A/V synchronization for complex, human‑driven scenes.

Examples: a flight attendant speaking on a corded phone in the cabin; an astronaut delivering lines against a Mars backdrop.
Examples: a man playing with a Labrador in a yard; a cyberpunk heroine moving through a neon corridor.

Text Control / Edit

Keep the same subject identity while changing appearance (outfits, hairstyle, accessories) and scene via different text prompts.

Same person: switch glasses, hats, suits vs. casual wear, etc.

Baby example: outfit and hairstyle changes while identity remains stable.

Female example: hair color from platinum‑blonde with aqua tips to deep chestnut with a floral headband.

Compared to other methods, HuMo shows strong subject preservation and audio‑visual synchronization.

Subject Preservation

A young witch, adorned with a large red bow on her head, wearing a black top and a white apron, takes flight on a broomstick. Accompanying her is a black kitten with a red bow around its neck. They soar through the gaps between lush, green trees, where sunlight filters through the leaves. Above them is a clear blue sky dotted with fluffy white clouds.

Audio-Visual Sync

A man in a checkered shirt and headphones sings, plays a silver guitar, and speaks to the camera in a recording studio. A static front shot captures his rhythmic movements and deeply focused, emotionally engaged expression against a lit, card-decorated black wall.

Typical Use Cases

Discover how HuMo AI transforms industries with human-centric video generation

Film / Short Drama

Quickly generate character shots and reduce production costs.

Virtual Humans

E‑commerce presenters, brand ambassadors, virtual hosts, and support agents.

Advertising

Rapid creative prototyping and on‑brand short videos.

Education & Training

Virtual instructors and scenario‑based language learning.

Social & Entertainment

Personalized avatars and interactive short‑form content.

E‑commerce Showcases

Dynamic try‑ons for apparel and accessories to boost conversion.

Frequently Asked Questions

Everything you need to know about HuMo AI

What is HUMO AI?

HUMO AI is a video generation system that takes text, images, and audio as input to create videos with consistent identity, accurate prompt following, and natural audio-visual sync.

Is HUMO AI open source?

The research paper and reference code are available for learning and experimentation.

How to improve audio sync?

Use clean audio and adjust the audio guidance scale. Removing background noise helps.

How long can the videos be?

By default, it generates around 4 seconds (97 frames at 25 FPS). Longer videos are possible but may lose quality.

Can it run on multiple GPUs?

Yes, the reference setup supports multi-GPU inference.

What resolutions are supported?

480p and 720p. 720p gives better detail.

What inputs are supported?

Text + Audio (TA)

Text + Image + Audio (TIA)

Reference images help keep the subject consistent.

Paper & Code

Explore our research and implementation

Quick Start

Get started in just 4 simple steps

Prepare a text prompt, a reference image, and/or an audio clip.

Select a generation mode: TI / TA / TIA.

Set resolution and duration, then submit the job.

Preview and download the result.

Try Now

Read Entire Article