CineDub: Scaling End-to-End Video Dubbing

Overview

A Unified End-to-End Framework for Cinematic Audio

Automatic video dubbing in the wild remains fundamentally limited by two competing constraints: hierarchical methods depend on brittle, multi-stage preprocessing pipelines, while holistic approaches suffer from weak temporal alignment and speaker-utterance ambiguity in multi-speaker settings. We propose CineDub, a unified diffusion-based model that achieves precise multi-speaker dialogue dubbing directly from uncropped videos — without face cropping or speaker diarization. Central to our approach is the Implicitly-Coupled Holistic Conditioning (ICHC) paradigm, which resolves speaker ambiguity through cross-modal coupling of holistic visual features and semantic-bundled transcriptions, enabling multi-turn dialogue dubbing at scale. Building on these unified temporal cues, CineDub further extends to joint speech and coherent audio generation in a single pass — eliminating the ghost speech artifacts and acoustic incoherence inherent to cascaded pipelines.

🎯

Implicitly-Coupled Holistic Conditioning (ICHC)

SynchFormer features exhibit emergent attention-switching, dynamically tracking the active speaker without diarization. A semantic-bundled transcription format resolves speaker-utterance ambiguity — directly generatable by MLLMs.

🔗

Unified Speech & Audio Joint Generation

A single end-to-end model simultaneously generates dubbed speech and coherent sound effects from uncropped video, bypassing cascaded pipelines that introduce ghost speech artifacts and acoustic incoherence.

📚

Ambient-to-Linguistic Curriculum Learning

Inspired by human linguistic evolution, the model first learns general audio then specializes toward speech, resolving the optimization conflict between tasks. A decoupled textual branch with learnable meta-tokens further routes speech and audio prompts independently, eliminating cross-prompt interference and achieving expert-mode inference performance.

🏆

Two New In-the-Wild Benchmarks

CineDub-Multi for multi-speaker dialogue dubbing and CineDub-SA for V2SA evaluation — enabling realistic assessment beyond single-speaker assumptions.

Main Figures

Overview: Paradigm Comparison

Fig. 1

(a) Existing video dubbing methods rely on complex preprocessing — speaker diarization, face & lip detection, and forced alignment — limiting scalability and practicality in the wild. (b) Generating accompanying sound effects requires a separate V2A model, which inevitably introduces ghost speech artifacts — spurious speech-like sounds from the V2A model that corrupt the final mix — and acoustic incoherence between independently generated speech and audio. (c) CineDub jointly generates multi-speaker dialogue with coherent sound effects through a single unified end-to-end model, requiring only an uncropped video and an MLLM-generated semantic-bundled transcription.

Fig. 2

The ICHC paradigm of CineDub. The holistic visual condition extracted by SynchFormer encodes both event-level audio-visual correspondences and fine-grained lip-sync alignment. The semantic-bundled transcription provides per-segment speaker-utterance grounding cues, implicitly coupling with visual features via multi-conditional training to resolve speaker ambiguity. Speech and audio prompts are routed through decoupled textual branches to prevent cross-prompt interference, with learnable meta-tokens replacing inactive branches during single-task inference to achieve expert-mode performance.

Comparison

CineDub vs. Prior Work

CineDub is the first holistic model that simultaneously handles multi-speaker dialogue, precise lip sync, and end-to-end joint speech+audio generation — no preprocessing required.

Method	Paradigm	No Multi-Preprocessing	Lip Sync	Multi-Speaker	Joint Speech+Audio	vs. Expert Model	Checkpoint
Automatic Video Dubbing
HPMDubbing	Lip/Face Crop	✗	✓	✗	✗	—	✓
EmoDubber	Lip/Face Crop	✗	✓	✗	✗	—	✓
AlignDiT	Lip/Face Crop	✗	✓	✗	✗	—	✓
DeepDubber	Holistic	✓	✓	✗	✗	—	✓
DeepAudio	Holistic	✓	✓	✗	△ cascade	—	✓
FunCineForge	Lip/Face Crop	✗	△ AR	✓	✗	—	✓
Video-to-Speech-and-Audio (V2SA)
DualDub	Holistic	✓	△ AR	✗	✓	✗	✗
BVS	Holistic	✓	△ AR	✗	✓	✗	✗
AudioGen-Omni	Holistic	✓	✓	✗	✓	✗	✗
VSSFlow	Lip/Face Crop	✗	✓	✗	✓	✗	✗
Ours
CineDub	Holistic	✓	✓	✓	✓	✓	△ Under Review

Section 1

Automatic Video Dubbing

Input: silent video + transcription caption → Output: dubbed speech with coherent sound effects.
Transcription captions are directly generated by Gemini 2.5 Pro without any manual annotation. CineDub processes everything in a single unified end-to-end DiT model — no face cropping, no speaker diarization, no preprocessing.

Comparison with FunCineForge — All FunCineForge samples are generated via the official demo interface at modelscope.cn/studios/FunAudioLLM. We ran multiple rounds and selected relatively good results for comparison.

Audio Prompt Control: CineDub accepts an optional reference audio to customize the output voice timbre and speaking style. Each group shows a reference audio and two corresponding video output samples generated using that reference.

Text Prompt Control: By modifying the speaker description in the transcription caption at the clause level, we can fine-tune a speaker's state, tone, and emotion — enabling fine-grained prosody control.

Section 2

Video to Speech and Audio

Input: silent video + transcription caption + audio caption (optional) → Output: speech and audio.

Out-of-Distribution: CineDub generalizes to AI-generated video.

Baselines: AlignDiT+MMAudio | DeepAudio+MMAudio | CineDub (Ours).

Section 3

Video to Audio Generation

Input: silent video + audio prompt (optional) → Output: audio.
CineDub's unified visual conditioning enables high-quality video-to-audio generation with optional text prompt for fine-grained control.

4-Way Comparison: AudioX | HuanyuanVideo-Foley | MMAudio | CineDub (Ours).