Guide

Seedance 2.0 Omni-Reference — Multimodal Input Guide

Seedance 2.0 features an Omni-Reference system — a unified multimodal pipeline that lets you combine text with up to 9 images, 3 video clips, and 3 audio tracks in one request (subject to platform limits). According to the official ByteDance Seed blog (Feb 2026), the model can reference composition, motion, camera, effects, and sound from these inputs. This section summarizes the public description of the Omni-Reference system.

Last updated: 2026-03-27Last verified: 2026-03-27

Source basis and reading boundary

These guides are written as third-party reference summaries, not official product documentation or support content.

Source basis

ByteDance official launch blog: Seedance 2.0(2026-03-27)
ByteDance Seedance 2.0 project page(2026-03-27)

Supported inputs

Text: natural-language prompt. Images: often up to 9 (e.g. 30 MB each in some docs). Video: up to 3 clips, often 2–15 s total, ~50 MB per clip. Audio: up to 3 files, often ≤15 s total, ~15 MB each. Total of up to 12 reference files in one go. The model uses them for layout, motion, camera, style, and sound as directed by your prompt and @ tags.

@ tag reference system

You can refer to uploaded assets in the prompt with @ tags (e.g. @Image1, @Video1, @Audio1). Examples from public docs: “@Image1 as the first frame,” “Reference @Video1 for camera movement,” “Use @Audio1 for background music.” This gives precise control over which image drives character, which video drives motion, and which audio drives music or dialogue.

@ reference practical examples

Common @ tag patterns for Omni-Reference: (1) First-frame lock: '@Image1 as the opening frame, character walks toward camera' — pins the starting composition. (2) Character consistency: 'Same character as @Image1, wearing the same outfit as @Image2' — locks identity across shots. (3) Camera replication: 'Replicate the camera movement from @Video1, apply to new scene with @Image1 as subject' — transfers motion path. (4) Audio-driven: 'Use @Audio1 as background music, lip-sync dialogue to @Audio2' — separates music and voice tracks. (5) Multi-reference combo: '@Image1 as character, @Image2 as background, reference @Video1 for camera motion, @Audio1 for ambient sound' — full scene assembly using 4 references. Always explicitly state each asset's role in the prompt; unnamed references may be ignored or misinterpreted.

Native audio-video generation

Seedance 2.0 generates video and audio in a single joint process (not post-dubbing). It supports stereo output, lip-sync (including multiple languages in public reports), and alignment of music and sound effects with the picture. Useful for ads, MV, and dialogue-heavy clips.

Frequently asked questions

How many reference images can I use?

According to public documentation, up to 9 images in one request, plus 3 videos and 3 audio files. Check your platform's current limits and file size rules.

What inputs does Seedance 2.0 multimodal support?

According to public reports, Seedance 2.0 supports text, up to 9 images, 3 video clips, and 3 audio tracks plus natural language. Up to 12 reference files per request. See our tutorial for the full workflow.

How does audio input affect video output?

According to public documentation, audio input can drive background music, dialogue, or sound effects. The model jointly generates picture and audio, with sound aligned to the visuals. Supports multi-language lip-sync. See our tutorial for more.

Can I combine image and video references?

Yes. Per public documentation, you can combine up to 9 images and 3 video clips in one request. Use @ tags in the prompt to assign each asset's role. See our image-to-video guide for details.

Related guides

Guide

Seedance 2.0 Tutorial — How to Use Text-to-Video & Image-to-Video (Step by Step)

Step-by-step Seedance 2.0 tutorial for beginners: text-to-video, image-to-video, prompt structure, settings, and your first generation on Dreamina. Updated April 2026.

Open guide

Guide

Seedance 2.0 Technical Architecture — How the Model Works Under the Hood

Technical overview of Seedance 2.0: dual-branch diffusion transformer, multimodal inputs (9 images, 3 videos, 3 audio), 2K output, 4–15 s, native audio-video joint generation.

Open guide

Guide

Seedance 2.0 Prompt Writing Tips — How to Write Better Video Prompts

Write better Seedance 2.0 prompts: subject + action + camera + style formulas, @ reference tags, and practical before/after tips for text-to-video and image-to-video workflows.

Open guide

Explore more guides