CineDub
Scaling End-to-End Video Dubbing
to Multi-Speaker Dialogues with Coherent Sound Effects

Anonymous Authors

⏳ On first visit, please allow 1–2 minutes for all videos to preload.

A Unified End-to-End Framework for Cinematic Audio

Automatic video dubbing in the wild remains fundamentally limited by two competing constraints: hierarchical methods depend on brittle, multi-stage preprocessing pipelines, while holistic approaches suffer from weak temporal alignment and speaker-utterance ambiguity in multi-speaker settings. We propose CineDub, a unified diffusion-based model that achieves precise multi-speaker dialogue dubbing directly from uncropped videos — without face cropping or speaker diarization. Central to our approach is the Implicitly-Coupled Holistic Conditioning (ICHC) paradigm, which resolves speaker ambiguity through cross-modal coupling of holistic visual features and semantic-bundled transcriptions, enabling multi-turn dialogue dubbing at scale. Building on these unified temporal cues, CineDub further extends to joint speech and coherent audio generation in a single pass — eliminating the ghost speech artifacts and acoustic incoherence inherent to cascaded pipelines.

🎯

Implicitly-Coupled Holistic Conditioning (ICHC)

SynchFormer features exhibit emergent attention-switching, dynamically tracking the active speaker without diarization. A semantic-bundled transcription format resolves speaker-utterance ambiguity — directly generatable by MLLMs.

🔗

Unified Speech & Audio Joint Generation

A single end-to-end model simultaneously generates dubbed speech and coherent sound effects from uncropped video, bypassing cascaded pipelines that introduce ghost speech artifacts and acoustic incoherence.

📚

Ambient-to-Linguistic Curriculum Learning

Inspired by human linguistic evolution, the model first learns general audio then specializes toward speech, resolving the optimization conflict between tasks. A decoupled textual branch with learnable meta-tokens further routes speech and audio prompts independently, eliminating cross-prompt interference and achieving expert-mode inference performance.

🏆

Two New In-the-Wild Benchmarks

CineDub-Multi for multi-speaker dialogue dubbing and CineDub-SA for V2SA evaluation — enabling realistic assessment beyond single-speaker assumptions.

Overview: Paradigm Comparison

Paradigm comparison figure
Fig. 1

(a) Existing video dubbing methods rely on complex preprocessing — speaker diarization, face & lip detection, and forced alignment — limiting scalability and practicality in the wild. (b) Generating accompanying sound effects requires a separate V2A model, which inevitably introduces ghost speech artifacts — spurious speech-like sounds from the V2A model that corrupt the final mix — and acoustic incoherence between independently generated speech and audio. (c) CineDub jointly generates multi-speaker dialogue with coherent sound effects through a single unified end-to-end model, requiring only an uncropped video and an MLLM-generated semantic-bundled transcription.

CineDub ICHC framework
Fig. 2

The ICHC paradigm of CineDub. The holistic visual condition extracted by SynchFormer encodes both event-level audio-visual correspondences and fine-grained lip-sync alignment. The semantic-bundled transcription provides per-segment speaker-utterance grounding cues, implicitly coupling with visual features via multi-conditional training to resolve speaker ambiguity. Speech and audio prompts are routed through decoupled textual branches to prevent cross-prompt interference, with learnable meta-tokens replacing inactive branches during single-task inference to achieve expert-mode performance.

CineDub vs. Prior Work

CineDub is the first holistic model that simultaneously handles multi-speaker dialogue, precise lip sync, and end-to-end joint speech+audio generation — no preprocessing required.

Method Paradigm No Multi-Preprocessing Lip Sync Multi-Speaker Joint Speech+Audio vs. Expert Model Checkpoint
Automatic Video Dubbing
HPMDubbingLip/Face Crop
EmoDubberLip/Face Crop
AlignDiTLip/Face Crop
DeepDubberHolistic
DeepAudioHolisticcascade
FunCineForgeLip/Face CropAR
Video-to-Speech-and-Audio (V2SA)
DualDubHolisticAR
BVSHolisticAR
AudioGen-OmniHolistic
VSSFlowLip/Face Crop
Ours
CineDub Holistic Under Review

Demos

All samples generated by a single unified end-to-end DiT model — raw video input only, single-pass generation, no cherry-picking.

Automatic Video Dubbing

Input: silent video + transcription caption → Output: dubbed speech with coherent sound effects.
Transcription captions are directly generated by Gemini 2.5 Pro without any manual annotation. CineDub processes everything in a single unified end-to-end DiT model — no face cropping, no speaker diarization, no preprocessing.

Comparison with FunCineForge — All FunCineForge samples are generated via the official demo interface at modelscope.cn/studios/FunAudioLLM. We ran multiple rounds and selected relatively good results for comparison.
Audio Prompt Control: CineDub accepts an optional reference audio to customize the output voice timbre and speaking style. Each group shows a reference audio and two corresponding video output samples generated using that reference.
Text Prompt Control: By modifying the speaker description in the transcription caption at the clause level, we can fine-tune a speaker's state, tone, and emotion — enabling fine-grained prosody control.

Video to Speech and Audio

Input: silent video + transcription caption + audio caption (optional) → Output: speech and audio.

Out-of-Distribution: CineDub generalizes to AI-generated video.
Baselines: AlignDiT+MMAudio | DeepAudio+MMAudio | CineDub (Ours).

Video to Audio Generation

Input: silent video + audio prompt (optional) → Output: audio.
CineDub's unified visual conditioning enables high-quality video-to-audio generation with optional text prompt for fine-grained control.

4-Way Comparison: AudioX | HuanyuanVideo-Foley | MMAudio | CineDub (Ours).