Seedance2

Guide

Seedance 2.0 Glossary

This glossary translates the most common Seedance 2.0 terms into practical language so you can read public guides faster, write prompts more clearly, and debug workflow issues without guessing what each phrase means.

Last updated: Last verified:

Source basis and reading boundary

These guides are written as third-party reference summaries, not official product documentation or support content.

Source basis

Consistency, references, and identity lock

When public guides mention character consistency, they usually mean keeping the same face, outfit, silhouette, and overall identity across multiple shots. In practice, that depends on reference hygiene: use a small number of clear images, keep those images visually close to the result you want, and repeat the same identity constraints in each prompt. If consistency breaks, the model is often being asked to carry too many visual variables at once.

Shot planning, extension, and continuity

Video extension means continuing an existing clip from its last frames rather than generating a brand-new scene. Multi-shot storytelling means designing several shots that still feel like one sequence. These ideas are related but not identical: extension is mostly about temporal continuity, while multi-shot planning is about story structure, pacing, camera changes, and visual anchors that survive from one shot to the next.

Multimodal control, native audio, and lip-sync

Multimodal input means combining text with images, video, and sometimes audio cues in one request. Native audio means the output can include generated sound instead of relying only on silent video. Lip-sync refers to mouth motion aligning with spoken dialogue. These terms matter because they change how you brief the model: a text-only prompt needs more descriptive detail, while a multimodal prompt can delegate part of the control to references, source footage, and audio timing.

2K output and resolution tiers

2K output refers to the maximum video resolution available in Seedance 2.0, roughly 2048 pixels on the long edge. Public materials describe this as a significant step up from earlier model versions that often capped at 720p or 1080p. In practice, 2K matters most when output will be viewed on larger screens or cropped in post-production. Not all access surfaces may expose the full 2K option at every pricing tier, so verify what your specific platform offers.

Camera motion replication

Camera motion replication means uploading a reference video clip and telling the model to copy its camera movement — pans, tilts, tracking shots, dolly zooms, orbits, or handheld shake — rather than describing the motion in text. In Seedance 2.0, you tag the reference clip with @video1 and specify its role as camera guidance. This is especially useful when you want a precise movement style that is hard to describe in words, such as a specific Steadicam feel or a particular drone flight path.

Image-to-video vs text-to-video

Text-to-video (T2V) generates a clip entirely from a written description. Image-to-video (I2V) starts from one or more uploaded images, so the model inherits composition, subject appearance, and color from the source. Most production workflows mix both: use I2V when you need visual fidelity to existing assets (product shots, storyboard frames, character designs) and T2V when you want the model to generate the full scene from imagination. Seedance 2.0 also supports hybrid workflows where images supply the look and text supplies the motion.

Reference images, reference videos, and @mention tags

References are the source files you upload alongside a text prompt to guide the generation. In Seedance 2.0's All-round Reference mode, each file is tagged with an @mention (e.g., @image1, @video1, @audio1) to specify its role: first frame, character face lock, style palette, camera motion source, or music sync track. Reference hygiene — using a small number of clear, visually consistent files — is one of the biggest factors in output quality. Conflicting references (different lighting, mismatched faces, clashing styles) are a common cause of drift and artifacts.

Identity drift

Identity drift is the visible change in a character's appearance — face shape, hairstyle, clothing details, or body proportions — between frames or across shots within the same project. It typically occurs when the model receives overloaded references, conflicting style anchors, or prompts that introduce too many visual variables at once. The most reliable countermeasure is reference simplification: fewer images, consistent lighting across references, and repeating the same identity-locking phrase in every generation request.

Reference-first architecture

Reference-first architecture is a workflow philosophy where uploaded images and videos drive the generation more than the text description. Instead of writing a 200-word paragraph describing every visual detail, you supply clear reference files and use the text prompt mainly for motion, timing, and context that the images cannot convey. This approach tends to produce more visually consistent results because the model inherits composition, color, and identity directly from the source material rather than interpreting ambiguous language.

Neural asset

A neural asset is a reusable AI representation of a character, product, or scene that has been encoded via reference images so the model can reproduce it consistently across multiple generations. Unlike traditional 3D assets, a neural asset lives in the model's latent space and is invoked by uploading the same reference set with consistent tagging. Building a reliable neural asset means curating 2–5 high-quality reference images that agree on lighting, angle, and costume, then testing them across several generations before committing to a production run.

Canonical character sheet

A canonical character sheet is a multi-angle reference collage — typically front, side, and three-quarter views — used as a persistent identity anchor for a character across shots. The concept borrows from traditional animation model sheets. In Seedance 2.0 workflows, uploading a well-made character sheet as a reference image gives the model a richer spatial understanding of the subject than a single photo can. Best results come from sheets where all angles share the same lighting, background, and outfit.

Prompt leakage

Prompt leakage happens when style elements, color palettes, or visual motifs from one generation bleed into an unrelated generation, often within the same session or when conflicting references remain active. It is distinct from identity drift because the contamination comes from a previous prompt context rather than from the current reference set. Mitigation strategies include starting a fresh session for unrelated projects, clearing uploaded references between runs, and keeping style anchor images separate per project.

Keyframe injection

Keyframe injection is the practice of re-uploading reference images at specific moments in a multi-shot workflow to reset the model's visual memory and prevent gradual drift. For example, if you are generating a 5-shot sequence, you might re-inject the original character reference at shot 3 and shot 5 instead of relying on the model to carry identity forward from shot 1 alone. This technique is especially important for longer sequences where cumulative drift would otherwise make the final shot look noticeably different from the first.

Anchor-and-master method

The anchor-and-master method is a production workflow where one master shot establishes the definitive look — lighting, color grade, composition, and character appearance — and all subsequent shots are generated using that master as the primary reference. This creates a visual anchor that keeps the entire sequence cohesive. The method works best when the master shot is generated first, reviewed carefully, and only promoted to anchor status after it meets quality standards. Changing the master mid-project usually requires re-generating downstream shots.

SCELA framework

SCELA stands for Subject + Context + Effect + Lighting + Action. It is a structured prompt formula designed to produce consistent, predictable results by ensuring every generation request covers the five key dimensions the model needs. Subject defines who or what is in the scene. Context sets the environment and mood. Effect specifies stylistic treatment. Lighting controls the visual atmosphere. Action describes what happens and how the camera moves. Following SCELA reduces the chance of the model filling in gaps with unwanted defaults.

Shot-script format

Shot-script format is a timecoded prompt structure where each segment of a video gets its own instruction block — for example, '0–3s: wide establishing shot of city skyline at dusk, 4–8s: cut to medium shot of character walking, 9–13s: close-up on hands opening a letter.' This format is designed for 13–15 second Seedance 2.0 generations and gives the model explicit timing cues for transitions, camera changes, and action beats. It works best when each segment has one clear action and one camera setup.

Content decay

Content decay describes the rate at which AI-generated video content becomes visually or technically obsolete as newer models improve. A clip that looked impressive when generated by an earlier model may appear stiff or low-fidelity compared to outputs from the next generation. This matters for production planning: teams building libraries of AI video content should factor in re-generation cycles, version their assets, and avoid over-investing in polish for content that will be superseded within months.

Negative prompting (unsupported in Seedance 2.0)

Negative prompting — writing instructions like 'no blur,' 'don't show hands,' or 'avoid text' — is a technique from image generation models that does not work reliably in Seedance 2.0. The model may parse the descriptive noun ('blur,' 'hands,' 'text') and actually increase the presence of the unwanted element. The recommended alternative is positive phrasing: instead of 'no blur,' write 'sharp, in-focus, high clarity.' Instead of 'don't show hands,' frame the shot so hands are naturally out of view. This inversion habit is one of the most impactful prompt-writing adjustments for users coming from Stable Diffusion or Midjourney backgrounds.

First/last frame mode vs all-round reference mode

These are two distinct reference strategies in Seedance 2.0. First/last frame mode pins the opening and closing frames of a video to specific uploaded images, and the model interpolates the motion between them. All-round reference mode uses multiple reference files — images, video clips, audio — as persistent context throughout the generation without restricting specific frames. First/last frame mode gives tighter control over start and end states, making it ideal for product reveals and transitions. All-round reference mode offers more flexibility for complex scenes where identity, style, and motion references all need to coexist.

Video as software

Video as software is a mental model that treats AI-generated video outputs as updateable, versionable, and iteratable assets rather than one-shot renders. Under this paradigm, a generated clip is not a final deliverable but a draft that can be re-prompted, extended, re-referenced, and remixed — much like source code that gets committed, reviewed, and revised. This shifts production thinking from 'get it right in one take' to 'build, test, iterate,' which better matches the probabilistic nature of generative models and enables faster creative cycles.

Three-layer lighting structure

A structured approach to describing lighting in video prompts that replaces vague instructions like 'add a light' with precise, reproducible descriptions the model can parse reliably. The three layers are: (1) Source Layer — what creates the light, such as sunset backlight, neon signs, candles, or rim light; (2) Behavior Layer — how light interacts with the scene, including dust diffusion, specular highlights, volumetric fog, and lens flare; (3) Color Tone Layer — the overall palette, for example warm gold base with cool blue highlights, or teal-and-orange contrast. A complete lighting instruction might read: 'sunset backlight dark gold plus core ice-blue self-illumination (source), dust diffusion softening contours plus corroded metal specular highlights (behavior), dark gold warm base plus ice-blue cold-warm clash (tone).' This structure originated in the woodfantasy/Seedance2.0-ShotDesign-Skills project and is designed to give consistent, reproducible lighting across shots.

Six-element precision assembly

A production-grade prompt formula that ensures every professional video prompt covers six structural dimensions: (1) Subject and Appearance Details — who or what is in the scene and how they look; (2) Action and Physics Continuity — what happens and how motion stays physically plausible; (3) Scene Environment — the setting, time of day, weather, and spatial context; (4) Visual Style and Physical Lighting — the aesthetic treatment and three-layer lighting setup; (5) Physical Focal Length and Camera Movement — lens choice and camera motion (dolly, crane, orbit, etc.); (6) Native Sound Effects — material-specific sounds and spatial acoustic modifiers. This framework extends the five-element SCELA formula by treating sound as a first-class structural element rather than an afterthought. Professional workflows use six-element assembly to ensure no generation dimension is left to model defaults.

Director style presets

Parameterized visual recipes based on known cinematographic signatures that serve as style anchors for video prompt generation. Using a director preset helps the model converge on a consistent visual language faster than listing individual style attributes. Examples include: Nolan (temporal paradox, IMAX scale, practical effects, deep-focus composites), Villeneuve (vast desolation, geometric framing, minimal dialogue, desaturated palette), Wong Kar-wai (melancholic beauty, step-printed slow motion, saturated color, handheld intimacy), and Kurosawa (compositional power, weather as character, wipe transitions, deep staging). The woodfantasy/Seedance2.0-ShotDesign-Skills project catalogs 28+ presets across Hollywood, Asian cinema, genre styles, and social/commercial categories. You can also combine presets — for example, 'Villeneuve composition with Wong Kar-wai color' — to create hybrid visual identities.

Smart multi-segment storyboard

An automatic splitting strategy for videos that exceed the 15-second generation limit. Each segment gets its own self-contained prompt with timestamps starting from 0, a unified style preamble, consistent three-layer lighting across segments, stable handoff frames at segment boundaries (freeze, slow push, or fade), and a shared forbidden-items list. The splitting rule is: ⌈total_duration / 15⌉ segments, with the constraint that the final segment must be at least 8 seconds long. For example, a 40-second video becomes three segments (15s + 15s + 10s), not four segments with a short tail. Narrative arc is distributed across segments: opening, development, climax, and resolution. This approach ensures visual and tonal consistency even when the model generates each segment independently.

Quality anchors and anti-filler

A production philosophy that replaces generic filler words — 'masterpiece,' '4K,' '8K,' 'ultra HD,' 'ultra-clear' — with specific physical material descriptions that give the model concrete visual targets. Instead of 'ultra-clear,' write 'Kodak 5219 film stock warmth, brushed aluminum reflections, wet concrete texture.' Instead of 'high quality,' describe the actual surface properties and imperfections you want: film grain type (Fuji Eterna cool shadow), material texture (aged leather, corroded metal), or organic artifacts (lens dust, subtle focus breathing, micro camera shake). Filler words are abstract quality claims the model cannot act on; quality anchors are physical descriptions the model can render. This distinction is one of the most impactful improvements users can make to their prompt-writing practice.

Copyright-safe IP fallback

A three-tier progressive strategy for handling recognizable intellectual property in video prompts without triggering platform content filters. Level 1 — Name Replacement: swap the recognizable name for an original descriptive nickname (for example, 'Iron Man' becomes 'Alloy Sentinel'). Level 2 — Feature Modification: replace iconic visual traits like signature color schemes, costume silhouettes, or weapon designs with original alternatives. Level 3 — Category Abstraction: fully abstract the concept so only the narrative role remains, with no visual or nominal connection to the original IP. Each level adds more creative distance from the source material. Always add explicit forbidden items for potential trigger words. Use Level 1 when the character archetype is common enough to avoid filters, escalate to Level 2 or 3 when platform filters still flag the content.

Camera term disambiguation

A safety practice for avoiding platform content filter false positives when using camera terminology in video prompts. Bare words like 'Dolly,' 'Crane,' 'Pan,' and 'Dutch' can be misinterpreted by safety filters as proper nouns — person names, brand names, or geographic references — rather than cinematographic instructions. The fix is straightforward: in English prompts, always use full compound phrases such as 'dolly tracking shot,' 'crane shot,' 'Dutch angle tilt,' or 'pan left across the skyline' instead of single capitalized words. In Chinese prompts, use the Chinese camera terms exclusively (推轨镜头, 摇臂镜头, 荷兰角) to avoid English-language filter ambiguity entirely. This practice eliminates a common source of unexpected content flags in otherwise professional prompts.

Physics-based sound design vocabulary

A structured approach to describing audio in video prompts using material-specific onomatopoeia and spatial acoustic modifiers rather than generic music terms. Sound descriptions are organized into categories: ambient (wind through corridors, rain on metal roof, distant thunder), action (blade cutting air, footsteps on gravel, glass shattering), vocal (whispered dialogue, breath condensing in cold, crowd murmur), and material-based onomatopoeia (silk rustling, metal scraping, wood creaking, water dripping on stone). Spatial modifiers describe the acoustic environment: cathedral reverb, tight room dampening, outdoor open-air, underground echo. This vocabulary works with Seedance 2.0's native audio generation, giving the model physical sound targets instead of vague music requests. Combining material-based sounds with spatial modifiers produces more convincing and immersive audio tracks.

Examples & sources

Mini pattern: keep one hero character stable

Use one to three strong reference images, repeat the same identity phrase in every shot, and avoid adding unrelated style changes until the base look is stable.

Same woman as reference, same cream trench coat, same short black hair, stable face and proportions, walking through a rainy neon street, slow tracking shot, cinematic night lighting.

Mini pattern: extend a good shot instead of rewriting it

If the first clip already has the framing and motion you want, extend it with a short continuation prompt instead of rebuilding the whole scene from scratch.

Continue from the final frame, same woman and same street, camera keeps moving forward slowly, umbrella lifts in the wind, preserve lighting and color palette.

Mini pattern: use multimodal input when text alone is too vague

If the prompt needs both a specific product look and a specific camera mood, combine a clean product image with a short text instruction instead of writing a very long generic paragraph.

Reference image for product shape and branding. Text prompt: premium studio ad, slow arc camera move, soft rim light, shallow depth of field, subtle electronic sound bed.

Mini pattern: replicate a camera move from a reference clip

When you need a specific camera feel that is hard to describe in text (e.g., a particular handheld style or drone path), upload a reference video and tag it as camera guidance.

@video1 as camera reference. Same product as @image1, match the slow push-in and subtle parallax from the reference clip, studio lighting, clean background.

Frequently asked questions

What does character consistency mean in practice?

It means the same person, outfit, and visual identity stay stable across shots. Public workflows usually achieve that with cleaner references, fewer visual variables, and repeated identity instructions instead of one overloaded prompt.

When should I use video extension instead of a new generation?

Use extension when the first clip already has the right framing, motion, or mood and you only need more seconds. Start a new generation when you need a different composition, subject setup, or camera language.

What does multimodal input change compared with text-only prompting?

It lets references carry part of the control. Instead of describing every visual trait in text, you can use images, video, or audio cues to lock style, motion, and timing more reliably.

Why do public tutorials keep talking about reference hygiene?

Because mixed or low-quality references often cause drift. If your source images disagree on face shape, wardrobe, lighting, or camera angle, the result becomes less predictable even with a strong prompt.

What is a good next page after reading the glossary?

If you are trying to write better prompts, move to the prompt library or prompt writing guide. If you are debugging unstable output, move to the troubleshooting page.

What does 2K output mean and when does it matter?

2K refers to the maximum resolution tier in Seedance 2.0, approximately 2048 pixels on the long edge. It matters most for large-screen playback, detail-heavy scenes, or footage that will be cropped in post-production. Not all access surfaces or pricing tiers may expose the full 2K option — verify availability on the platform you use.

How does camera motion replication differ from text-described camera moves?

Text-described camera moves rely on phrases like 'slow dolly in' or 'orbit shot'. Camera motion replication uses an uploaded reference video tagged with @video1 so the model copies the actual movement pattern. Replication is more precise for complex or nuanced motions that are hard to describe in words.

When should I choose image-to-video over text-to-video?

Use image-to-video when you have existing assets (product photos, storyboard frames, character art) and need the output to match that look. Use text-to-video when you want the model to create the full scene from a description. Many production workflows combine both: images lock the visual identity while text controls the motion.

What are @image and @video reference tags used for?

In All-round Reference mode, @mention tags assign a role to each uploaded file. For instance, @image1 can set the first frame, @image2 can lock a character face, @video1 can supply camera motion, and @audio1 can provide music for synchronization. Using clear, consistent references with explicit tags produces more predictable results than relying on text alone.

What is identity drift and how do I prevent it?

Identity drift is the gradual change in a character's appearance across frames or shots. It usually happens when references conflict or the prompt carries too many visual variables. Prevent it by using fewer, higher-quality references, repeating identity-locking phrases, and re-injecting reference images at regular intervals in multi-shot workflows.

Why does Seedance 2.0 not support negative prompts like 'no blur'?

Seedance 2.0 does not have a dedicated negative prompt mechanism. Writing 'no blur' or 'don't show X' can cause the model to focus on the very element you want to avoid. Use positive phrasing instead — for example, 'sharp, in-focus, high clarity' rather than 'no blur.' This is one of the most common adjustments for users migrating from image generation tools.

What is the SCELA framework and when should I use it?

SCELA stands for Subject + Context + Effect + Lighting + Action. It is a structured prompt template that ensures every generation request covers the five dimensions the model needs most. Use it whenever your prompts feel inconsistent or when you want a repeatable formula for production work.

What is the difference between first/last frame mode and all-round reference mode?

First/last frame mode pins specific images to the opening and closing frames and lets the model interpolate between them. All-round reference mode uses multiple files as persistent context without locking specific frames. Choose first/last frame for controlled transitions and product reveals; choose all-round reference for complex scenes with multiple identity, style, and motion requirements.

What does 'video as software' mean for my production workflow?

It means treating AI-generated clips as iteratable drafts rather than final renders. Version your outputs, plan for re-generation cycles as models improve, and design workflows around build-test-iterate loops instead of one-shot perfection. This mindset better matches how generative models actually work.

How does keyframe injection help with long multi-shot sequences?

In long sequences, the model can gradually lose track of the original character appearance. Keyframe injection means re-uploading the original reference images at regular intervals — for example, every 2–3 shots — to reset the model's visual memory. This is especially important for sequences longer than 3 shots where cumulative drift would otherwise become noticeable.

What is three-layer lighting and how does it improve my prompts?

Three-layer lighting replaces vague instructions like 'add dramatic light' with a structured description covering three dimensions: the Source Layer (what creates the light — sunset backlight, neon signs, rim light), the Behavior Layer (how light interacts with the scene — dust diffusion, specular highlights, volumetric fog), and the Color Tone Layer (the overall palette — warm gold base with cool blue highlights). By addressing all three layers, you give the model precise visual targets for lighting instead of leaving it to guess.

What are quality anchors and why should I avoid filler words like '4K' or 'masterpiece'?

Filler words like 'masterpiece,' '4K,' 'ultra HD,' and 'ultra-clear' are abstract quality claims the model cannot act on — they do not describe anything the model can render. Quality anchors replace them with specific physical material descriptions: 'Kodak 5219 film stock warmth,' 'brushed aluminum reflections,' 'wet concrete texture.' These concrete descriptions give the model actual visual targets, producing more consistent and professional-looking results.

How does the copyright-safe IP fallback strategy work?

It is a three-tier system for handling recognizable intellectual property. Level 1 replaces the name only (Iron Man → Alloy Sentinel). Level 2 modifies iconic visual traits like signature colors or costume silhouettes. Level 3 fully abstracts the concept so only the narrative role remains. Each level adds more creative distance from the original IP. Start at Level 1 and escalate if the platform content filter still flags the prompt.

Why do camera terms like 'Dolly' or 'Crane' sometimes trigger content filters?

Platform safety filters can misinterpret bare capitalized camera terms as proper nouns — person names, brand names, or geographic references. The fix is to always use full compound phrases in English prompts: 'dolly tracking shot' instead of 'Dolly,' 'crane shot' instead of 'Crane,' 'Dutch angle tilt' instead of 'Dutch.' In Chinese prompts, using the Chinese camera terms (推轨镜头, 摇臂镜头) avoids this ambiguity entirely.

Related guides