GitHub - debajyotidasgupta/IdentityFlow: Identity-preserving image-to-video generation: vision-grounded prompt simplification via Qwen3-VL, Lightning LoRA 4-step inference, and SAM3-masked DINOv3 candidate reranking for fluid 720p video from a single reference image.

Identity-consistent image-to-video generation for the CVPR 2026 VGBE Challenge.

Ultra-enriched MJ-style prompts · SAM3-masked DINOv3 reranking · Lightning LoRA · 8-step · Flash Attention 2 · 720p · Seeds {1337, 2024, 7777}

Author: Debajyoti Dasgupta

Quick Start

Minimum steps to reproduce the final submission on any machine with NVIDIA GPU(s):

# 1. Clone
git clone https://github.com/debajyotidasgupta/IdentityFlow.git
cd IdentityFlow
git submodule update --init --recursive

# 2. Download models to /tmp for fast loading (under 1 min vs 16+ min from network storage)
# WARNING: total download is ~180 GB inference-only — ensure sufficient space in /tmp
pip install huggingface_hub
python scripts/download_models.py --inference_only --cache_dir /tmp/model_cache

# 3. Pull pre-built Docker image  (or: docker build -t vgbe2026-i2v:latest .)
# arm64 / amd64 - check your system
docker pull docker.io/debajyotidasgupta/vgbe2026-i2v:amd64
docker tag debajyotidasgupta/vgbe2026-i2v:amd64 vgbe2026-i2v:latest

# 4a. Run a single sample first to verify everything works
#     Expected end-to-end time: ~16 min (model load ~45 s + video generation ~15 min on H100)
mkdir -p outputs
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest \
  --config configs/final_pipeline.yaml \
  --sample_ids 88afa2050d422c64

# 4b. Run all 70 samples — all available GPUs → outputs/final/
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest

That's it. The container uses configs/final_pipeline.yaml by default (720p · 8-step · seeds {1337, 2024, 7777} · SAM3+DINOv3 reranking). Expected runtimes on H100: ~16 min for a single sample (model load ~45 s + video generation ~15 min), ~60 min for all 70 samples on 8× H100.

Note: run_parallel.sh, run.sh, and related scripts are designed to run inside the container. To use them directly, first open a shell inside the container with --entrypoint bash, then execute the scripts from there:
docker run --gpus all --entrypoint bash \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  -it vgbe2026-i2v:latest
# inside the container:
./run_parallel.sh --config configs/final_pipeline.yaml --output_dir outputs/final

GPU count variants

Setup	Command
All GPUs (default)	`docker run --gpus all … vgbe2026-i2v:latest`
Limit to N GPUs	`docker run --gpus all … vgbe2026-i2v:latest --config configs/final_pipeline.yaml --gpus N`
Single GPU	`docker run --gpus '"device=0"' … vgbe2026-i2v:latest --config configs/final_pipeline.yaml --gpus 1`
Specific GPU	`CUDA_VISIBLE_DEVICES=2 ./run.sh --config configs/final_pipeline.yaml`

When passing extra arguments (e.g. --gpus N) you must re-specify --config configs/final_pipeline.yaml — extra args replace the default CMD entirely.

Overview
Final System Architecture
Novelty and Contributions
Full Experiment History
Key Finding: 6-Step is the Final Choice After Reranker Fixes
What Failed and Why
Prompt Simplification: v1 → v4 Evolution
Impact of Visual Input to Qwen
Results
Performance Optimizations
HuggingFace Models
Environment Setup
Running Inference
Docker Deployment
Project Structure
Compressing & Exporting Outputs

Overview

This project adapts Wan2.2-I2V-A14B-Diffusers — a 14B dual-transformer image-to-video diffusion model — for the CVPR 2026 VGBE Challenge through a systematic 30+ configuration ablation study spanning acceleration, reranking, prompt engineering, motion quality, super-resolution post-processing, and seed selection.

Task: Given a reference image and a text prompt, generate a ≥720p, 81-frame (5-second at 16 fps) video that preserves the visual identity of the subject, reflects the intended action, and avoids motion blur or geometric distortions.

Final submission: configs/final_pipeline.yaml

720p · 8-step Lightning LoRA · MidJourney-style ultra_enrich prompts
3 candidates (seeds 1337, 2024, 7777) · SAM3+DINOv3 reranking

Metric	Baseline (30-step)	Phase 8 Final	Final Pipeline	Δ vs Ph.8
Identity Fidelity	0.9243	0.9466	0.9474	+0.0008
Visual Quality	0.5121	0.5484	0.5344	−0.014†
Motion Quality	0.3876	0.4172	0.4121	−0.005†
Text Alignment	0.5690	0.6073	0.5789	−0.028†
Geometry Consistency	0.9012	0.9186	0.8940	−0.025†
Runtime per sample	~382s	~381s	~380s	≈0%

† VideoReward scores are lower for the new pipeline on the 4-sample competition subset due to metric distribution shift (VideoReward was trained on ~720p; the 4-sample competition subset differs from the 6-sample ablation set used for Phases 1–8). Identity Fidelity, the primary competition metric, improves. See Phase 9 for full context.

Final System Architecture

flowchart TB
    subgraph Q["Qwen3-VL · Ultra-Enrich Prompt"]
        direction TB
        Q1["Read reference image\n(spatial layout, subject, scene)"]
        Q2["Read verbose text prompt\n(intended action)"]
        Q3["MidJourney-style 80–120 word prompt\nmaterial · lighting · quality · motion tags"]
        Q1 --> Q3
        Q2 --> Q3
    end

    subgraph W["Wan2.2-I2V-A14B · 3 Candidates"]
        direction TB
        W1["Lightning LoRA · 8-step · float8 · 720p"]
        C1["Candidate · seed=1337"]
        C2["Candidate · seed=2024"]
        C3["Candidate · seed=7777"]
        W1 --> C1
        W1 --> C2
        W1 --> C3
    end

    subgraph R["SAM3 + DINOv3 · Composite Reranker (Phase 12)"]
        direction TB
        R1["Segment subject with SAM3"]
        R2["0.65 × DINOv3 masked identity\n(per-frame · subject crop)"]
        R3["0.25 × Patch concentration\n(spatial peak/mean ratio — static ghost detector)"]
        R4["0.10 × Temporal consistency\n(consecutive-frame DINO sim — dynamic artifact detector)"]
        R5["Composite score → argmax → best candidate"]
        R1 --> R2 --> R5
        R3 --> R5
        R4 --> R5
    end

    A(["Reference Image"]) --> Q
    A --> W
    A --> R1
    B(["Verbose Text Prompt"]) --> Q
    Q3 --> W
    C1 --> R1
    C2 --> R1
    C3 --> R1
    R2 --> OUT(["Best Video · 720p · 81 frames"])

    class A,B input
    class OUT output
    classDef input  fill:#fde8d8,stroke:#e8a87c,color:#3d2b1f,font-weight:bold
    classDef output fill:#d4ecd4,stroke:#7cbf8e,color:#1f3d2b,font-weight:bold

Qwen3-VL (4-bit, ~5 GB) loads, runs all prompts in ~5s/batch, then fully unloads before the 28 GB diffusion pipeline loads. The two models never co-exist in VRAM.

Seed selection rationale: Seeds {1337, 2024, 7777} were chosen from a 5-seed ablation on the 4-sample competition subset. Seeds 42 and 9999 were dropped after consistently ranking 4th–5th across all sample types. See Phase 11.

Novelty and Contributions

Contribution	Description	Gain
Lightning LoRA at 6-step	Use 4-step distilled LoRA at 6 steps for better fine-detail quality without retraining	+1.3pp id_avg vs native 4-step (complex prompts)
SAM3-masked DINOv3 reranking (fixed)	Segment subject with SAM3, compute DINOv3 cosine only on subject pixels; fixed 3 bugs (wrong subject prompt, pathological boxes, no spread threshold)	+0.02 id_avg vs no reranking
Composite reranker (Phase 12)	3-term score: 0.65×masked_identity + 0.25×patch_concentration + 0.10×temporal_consistency; patch concentration detects static ghosting; prompt caching eliminates LLM non-determinism	Ghosted seed correctly ranks last; reproducible across regeneration runs
720p resolution	Select Wan2.2 bucket nearest to input aspect ratio, min short-side 720	+0.003–0.005 geo_c
Prompt simplification	Strip verbose prompts to 15-25 word SVO motion descriptions via Qwen3-VL	+0.009 id_fid over rerank-only
Vision-grounded simplification	Pass reference image to Qwen so it reads spatial layout from the scene	+0.02 id_fid vs text-only simplification
Slow-motion conditioning	Negative prompt + Qwen slow-verb bias reduce motion blur and identity drift	+0.012 mot_q
6-step with fixed reranker	After reranker bug fixes, 6-step correctly selects best candidate; provides cleaner fine-detail on close-up scenes (hands, jewelry)	Better visual quality on hard samples
Flash Attention 2	`attn_implementation="flash_attention_2"` via flash_attn==2.8.3	Reduced attention VRAM
Load-once parallel GPU strategy	Each GPU loads model once, processes all assigned samples sequentially	~8× fewer model loads for 70 samples

Full Experiment History

Phase 1 — Lightning Acceleration

The baseline Wan2.2-I2V-A14B runs 30 denoising steps, taking ~382s per sample. With 70 final samples this would be >7 hours sequentially. To make experiments tractable, we adopted the WAN Lightning LoRA — a rank-64 LoRA trained via score-distillation to compress the denoising schedule to 4 steps.

flowchart LR
    A[Baseline\n30 steps\n382 s/sample] -->|Lightning LoRA\nrank-64 distilled| B[4-step\n~180 s/sample\n5x speedup]
    B -->|Run at 6 steps\nextras denoising budget| C[6-step\n~260 s/sample]
    C -->|measured +1.3pp id_avg\ncomplex prompts| D{Better quality\nfor complex prompts}

Key observation: The LoRA was trained for native 4-step inference, but running it at 6 steps gave +1.3pp id_avg on complex, verbose prompts from the challenge dataset. Extra denoising iterations help the model resolve ambiguous or multi-clause prompt conditioning. This 6-step advantage disappears once prompts are simplified — see Phase 7.

Phase 2 — Reranking and Resolution

flowchart LR
    A[6-step 480p\nno reranking] -->|Add 3 candidates\nSAM3-masked DINOv3| B[Rerank x3\n480p]
    B -->|Scale resolution| C[Rerank x3\n720p]
    B -->|Test 5 candidates| D[Rerank x5\n480p]
    D -->|marginal gain\n+0.002 vs x3| E[Diminishing returns\nabandon x5]
    C -->|Final rerank config| F[lightning_rerank_720p_v1\nid_fid 0.9377]

Reranking: Generating 3 candidates with seeds [42, 123, 456] and picking the winner by SAM3-masked DINOv3 cosine similarity consistently improved identity fidelity. SAM3 segments the subject from the reference image; DINOv3 cosine is computed only on the masked subject region — this focuses the selection criterion on the person or object rather than background similarity.

Resolution: 720p improved geometry consistency (+0.003–0.005 geo_c) and visual quality. Wan2.2 supports discrete resolution buckets; we select the nearest bucket to the input aspect ratio with min_short_side=720.

Phase 3 — Prompt Enhancement (Failed)

flowchart LR
    A[Verbose prompt] -->|Qwen3-VL\nadds visual details| B[Enhanced prompt\n60-80 words]
    B -->|id_fid drops -0.006| C[FAILED ✗\nabandon enhancement]

    D[Reason:] --> E[Detailed appearance tokens\nsuppress motion signal]
    E --> F[Model anchors on appearance\nnot motion trajectory]

Qwen3-VL was used to enrich prompts with phrases like "the man's weathered hands carefully grasp the blue cartridge". This consistently hurt identity fidelity by −0.006 id_avg. The extra appearance tokens caused the diffusion model's cross-attention to focus on recreating static visual details rather than generating fluid motion. The entire enhancement branch was abandoned.

Phase 4 — Prompt Simplification

Original challenge prompts are often multi-sentence descriptions with context, brand names, and setting details:

"The video is a tutorial on how to modify a NES cartridge. A person is shown carefully drilling a hole into the back of the cartridge, while explaining the process. The workspace is cluttered with tools..."

For a 5-second clip (81 frames) the model needs a single, unambiguous motion target. Two strategies were tried:

flowchart TD
    P[Original verbose prompt] --> EA[Strategy A\nEnhancement]
    P --> SB[Strategy B\nSimplification]

    EA --> EA2[Longer richer prompt\n60-80 words]
    SB --> SB2[Concise SVO sentence\n15-25 words]

    EA2 --> EA3[Suppresses motion signal\nid_fid DOWN -0.006\nAbandoned]
    SB2 --> SB3[Cleaner motion signal\nid_fid UP +0.009\nKept]

Simplification won because fewer tokens means more attention mass per token in the diffusion cross-attention. The model can resolve "Man slowly picks up drill and brings it toward the cartridge" fully in 4 steps; it struggles to resolve a 70-word prompt in the same budget.

Phase 5 — Slow-Motion Conditioning

Fast motions in 5-second clips cause two failure modes: motion blur (subject features smear → low temporal DINOv3) and identity drift (large pose changes → model can't maintain consistent appearance).

Three independent conditioning signals were introduced and tested incrementally:

flowchart LR
    subgraph S1 [" Signal 1: Qwen slow-verb bias "]
        A1[Rule added to system prompt:\nprefer slowly, gently,\ncarefully, smoothly]
    end

    subgraph S2 [" Signal 2: Negative prompt "]
        A2[Condition away from:\nfast motion, motion blur,\ncamera shake, jerky, abrupt]
    end

    subgraph S3 [" Signal 3: Prompt prefix "]
        A3[Prepend to each prompt:\nSlowly and smoothly]
    end

    S1 --> D[Diffusion model]
    S2 --> D
    S3 --> D
    D --> E[Smoother identity-stable video]

The final submission uses Signals 1 and 2. Signal 3 (prompt prefix) was tested in v4_6step but did not improve over the combination of 1+2 at 4 steps.

Phase 6 — Vision-Grounded Simplification

Text-only simplification (v1) couldn't read the spatial layout from the image — it might describe "drill the cartridge" without knowing the drill was to the subject's left. Passing the reference image as a visual token to Qwen3-VL fixed this.

flowchart TD
    subgraph TXT [" v1: Text-only "]
        T1[Verbose prompt only] --> T2[Qwen3-VL text mode]
        T2 --> T3[Hallucinated spatial layout\nbrand names present\nstatic end-state described]
    end

    subgraph VIS [" v2: Vision-grounded "]
        V1[Verbose prompt] --> V2[Qwen3-VL vision+text mode]
        V3[Reference image] --> V2
        V2 --> V4[Reads actual spatial layout\nno brand names\nmotion arc described]
    end

    T3 -->|id_fid 0.9251\ngeo_c 0.9095| R1[Text-only result]
    V4 -->|id_fid 0.9466\ngeo_c 0.9186| R2[Vision result\n+0.0215 id_fid]

Adding the image input to Qwen improved geometry consistency by +0.009 because generated motion trajectories are now consistent with the actual 3D scene layout visible in the reference image rather than hallucinated positions.

Phase 7 — Step Count with Simplified Prompts

This phase revealed the central insight of the project.

flowchart LR
    subgraph Complex [" Complex verbose prompts "]
        C1[6-step] -->|id_fid 0.9377| C2[Better]
        C3[4-step] -->|id_fid 0.9350| C4[Worse]
    end

    subgraph Simple [" Simplified 15-25 word prompts "]
        S1[6-step\nv2_6step] -->|id_fid 0.9466| S2[Good]
        S3[4-step\nv2_4step] -->|id_fid 0.9452\nvis_q 0.5583\nmot_q 0.4237| S4[Better on 4/5 metrics]
    end

    Complex -->|Simplify prompts| Simple
    S4 -->|19% faster| WIN[FINAL SUBMISSION]

Initial finding: With simplified prompts, 4-step outperformed 6-step on 4 of 5 aggregate metrics in the 6-sample ablation. This result was later found to be confounded by three reranker bugs (see Phase 8). On individual hard samples (close-up hands, jewelry, fine textures), the buggy reranker was selecting poor 6-step candidates that 4-step happened to avoid — masking the true 6-step quality advantage. The aggregate metrics did not surface this because most samples don't involve extreme close-ups.

Step count	With verbose prompt	With simplified prompt (ablation)
4-step (native)	id_fid 0.9350, vis_q 0.5425	id_fid 0.9452, vis_q 0.5583
6-step (+50% budget)	id_fid 0.9377, vis_q 0.5427	id_fid 0.9466, vis_q 0.5484

Phase 8 — Reranker Bug Fixes and Final Step Count Decision

Post-ablation inspection of the full 70-sample generation revealed three samples with severe distortion: a ring close-up (039854ea40eab601), a workshop scene (02104dbb12391f56), and a food-cutting scene (294f210fed8f7dd5). Root-cause analysis identified three structural bugs in src/masked_scorer.py and scripts/run_inference.py:

flowchart TD
    B1["Bug 1 — Wrong subject prompt\n_SUBJECT_KEYWORDS: 'person' matched first\n'hand' keyword triggered before 'ring'\n→ SAM3 segmented person not jewelry"]
    B2["Bug 2 — Pathological SAM3 union box\n(90,0)–(1192,171): top 24% of frame only\naspect ratio 6.4 → masked wrong region\n→ reranker scored background not subject"]
    B3["Bug 3 — No spread threshold\nall 3 candidates scored 0.63–0.69\nargmax picked 'best' of equally bad candidates\n→ noise-driven selection"]

    F1["Fix 1 — Reorder keyword priorities\njewelry / animal / device / food\nchecked BEFORE person\nAlso pass base_prompt as fallback\n(Qwen may strip subject nouns)"]
    F2["Fix 2 — Aspect ratio rejection\nreject SAM3 union boxes where\nmax(W/H, H/W) > 7\nfall back to full-image DINOv3"]
    F3["Fix 3 — Spread threshold\ncollect all 3 scores first\nif max−min < 0.015, use seed=42\nno noise-driven selection"]

    B1 --> F1
    B2 --> F2
    B3 --> F3

After applying all three fixes and re-running the problematic samples at 4-step vs 6-step:

4-step: Ring scene still shows hand distortion — 4 denoising steps cannot resolve fine finger/jewelry detail at 720p
6-step: Cleaner fine detail on close-ups; spread scores improved (0.06–0.08), reranker correctly identifies best candidate

Decision: 6-step is the final submission. It provides noticeably better fine-detail quality on hard close-up samples with the fixed reranker, at equal runtime (~381s) to the previous rerank-only config. The aggregate metric gap vs 4-step (−0.014 vis_q) was smaller than the visible quality improvement on close-up cases.

Phase 9 — MidJourney-Style Prompts and Resolution Variants

After Phase 8 established the 6-step + fixed reranker baseline at ID Pres 0.9466, we explored whether richer prompt conditioning could further improve identity retention — specifically MidJourney-style ultra_enrich prompts (80–120 words with material, lighting, and quality tags) vs. the 15–25 word simplified prompts from v2.

flowchart TD
    A["Phase 8 final\n720p · 6-step · simplify_v2\nID Pres 0.9466"] --> B

    subgraph B["Phase 9 explorations"]
        direction LR
        P1["240p · 1 cand\ntest_v3_mj_240p_1cand\nID 0.9190 → too low-res"]
        P2["480p · 4 cand · rerank4\ntest_v3_mj_480p_rerank4\nID 0.9363 → VideoReward 0.69\nbut ID lower than 720p"]
        P3["720p · 8-step · force_enrich\n5 cands · ultra_enrich prompts\ntest_v3_mj_force_enrich\nID 0.9474 ★"]
        P4["480p · 1 cand · realesrgan→720p\ntest_v3_mj_480p_1cand_realesrgan720p\nID 0.9306, VQ 0.5242"]
    end

    B --> C["Key finding:\n480p scores high on VideoReward (0.69)\nbut lower on ID Pres vs 720p native\nVideoReward bias: trained on ~720p content"]
    P3 --> D["Best ID Pres: 0.9474\nForce ultra_enrich for all samples"]

    style P3 fill:#d4ecd4,stroke:#7cbf8e,color:#1a1a2e
    style D fill:#d4ecd4,stroke:#7cbf8e,color:#1a1a2e

Resolution findings:

240p (test_v3_mj_240p_1cand): ID Pres drops to 0.9190 — too little resolution for fine identity detail
480p (test_v3_mj_480p_rerank4): Visual Quality 0.69 on VideoReward but ID Pres only 0.9363 vs 720p's 0.9474
720p native (test_v3_mj_force_enrich): Best ID Pres at 0.9474 with 5 candidates + SAM3+DINOv3

VideoReward bias discovery: VideoReward was trained on ~720p content. Both native 720p and 4× SR outputs (~3.4K) score identically (~0.52–0.54 VQ) while raw 480p outputs score 0.69 — this is a distribution shift artifact, not a genuine quality signal for the competition. ID Pres, not VideoReward, is the primary ranking metric.

ultra_enrich prompt strategy: Tested against simplify_v2 on the 4-sample competition subset (force_enrich config). Result: ID Pres 0.9474 vs 0.9466 for simplify_v2 — ultra_enrich gives +0.0008. MJ-style prompts with explicit material/lighting/quality descriptors help anchor fine-detail appearance across 81 frames.

Phase 10 — Super-Resolution Post-Processing (FlashVSR & Real-ESRGAN)

Two SR post-processing approaches were explored to upgrade 480p outputs to publication quality.

flowchart LR
    subgraph F["FlashVSR-v1.1 Tiny"]
        direction TB
        F1["Temporal video SR\n4× upscale\nBlock-Sparse Attention (LCSA)"]
        F2["480p → ~1920p\nTemporal coherence\nGhost objects on test samples"]
        F1 --> F2
        F3["Root cause:\nCUDA 12.9 system vs 13.0 PyTorch\nBlock-sparse-attn fails to compile\nFalls back to dense SDPA\n→ ghost objects inherent"]
        F2 --> F3
    end

    subgraph R["Real-ESRGAN x4plus"]
        direction TB
        R1["Frame-by-frame SR\n4× upscale · RRDB network\nZero hallucinations"]
        R2["480p → ~3.4K portrait\n720p → ~5K portrait\n~15s/video (vs 90s FlashVSR)"]
        R1 --> R2
        R3["Metrics: unchanged\nVideoReward insensitive >720p\nMEt3R insensitive to res\nPerceptual quality: visibly better"]
        R2 --> R3
    end

    IN["Input video"] --> F
    IN --> R

    F --> VERDICT["FlashVSR: ABANDONED\nGhost objects on 3/6 samples\nCUDA version mismatch unresolvable"]
    R --> VERDICT2["Real-ESRGAN: KEPT for visual demos\nZero metric gain on competition scores\nbut sharper for human judges"]

    style VERDICT fill:#fde8d8,stroke:#e8a87c,color:#1a1a2e
    style VERDICT2 fill:#fdefd8,stroke:#e8c87c,color:#1a1a2e

FlashVSR failure (ghost objects):

Samples 08e60c2e16a64921, 02843aae628b291c, 0893210e6609d201 showed ghost hands and floating objects
Root cause: LCSA (Block-Sparse Attention) requires CUDA compilation; system has CUDA 12.9, PyTorch compiled with CUDA 13.0 → mismatch → FlashVSR silently falls back to dense SDPA → ghost objects inherent to this fallback path
Confirmed by reading FlashVSR wan_video_dit.py: block_sparse_attn_func is None → uses dense SDPA
No workaround available without matching CUDA versions

Real-ESRGAN evaluation (force_enrich 720p → ~5K portrait / ~2.9K landscape):

Config	n	ID Pres	Geo Con	Vis Q	Mot Q	Txt Al
force_enrich native 720p	4	0.9474	0.8940	0.5344	0.4121	0.5789
force_enrich + Real-ESRGAN	4	0.9503	0.9001	0.5328	0.4125	0.5946
Δ		+0.003 ↑	+0.006 ↑	−0.002	+0.0004	+0.016 ↑

Marginal positive: ID Pres +0.003, Geo Con +0.006. Likely explanation — sharper edge definition from RRDB upscaling slightly improves DUSt3R depth estimation (MEt3R) and CLIP feature quality (ID Pres), even though CLIP internally resizes to 224×224. VQ drops −0.002 due to VideoReward distribution shift above 720p. Overall: small net positive, not compelling enough to add to the default pipeline given the additional inference time (~15s/video) and disk cost (~4× larger files).

Phase 11 — Seed Ablation and Final Pipeline Consolidation

A 5-seed × 4-sample competition-subset ablation was run to determine the best candidate pool for the final pipeline.

flowchart TB
    subgraph ABLATION["5-seed ablation (force_enrich settings · 720p · 8-step · ultra_enrich)"]
        direction LR
        S42["seed=42\nAvg rank: 4.0\nID Pres: 0.9177\nConsistently worst"]
        S1337["seed=1337\nAvg rank: 2.75\nID Pres: 0.9294\nStrong on human/action"]
        S2024["seed=2024\nAvg rank: 2.50\nID Pres: 0.9405\nMost consistent, highest ID"]
        S7777["seed=7777\nAvg rank: 2.25\nID Pres: 0.9384\nBest overall, won 2/4 samples"]
        S9999["seed=9999\nAvg rank: 3.50\nID Pres: 0.9245\nInconsistent"]
    end

    S42 --> DROP["DROPPED\nseeds 42 and 9999"]
    S9999 --> DROP
    S1337 --> KEEP["KEPT\nseeds 1337, 2024, 7777"]
    S2024 --> KEEP
    S7777 --> KEEP

    KEEP --> FINAL["final_pipeline.yaml\n3 candidates · seeds {1337,2024,7777}\nSAM3+DINOv3 reranking\nID Pres: 0.9474"]

    style DROP fill:#fde8d8,stroke:#e8a87c,color:#1a1a2e
    style KEEP fill:#d4ecd4,stroke:#7cbf8e,color:#1a1a2e
    style FINAL fill:#d4ecd4,stroke:#7cbf8e,color:#1a1a2e,font-weight:bold

Q2 experiment — competition-metric reranking vs SAM3+DINOv3:

The 5-seed ablation also tested whether using competition metrics directly (ID Pres + Geo Con + Vis Q + Mot Q + Txt Al, equal weights) to select the best candidate would outperform the SAM3+DINOv3 reranker.

Method	ID Pres	Geo Con	Vis Q	Mot Q	Txt Al	Avg
SAM3+DINOv3 (baseline)	0.9474	0.8940	0.5344	0.4121	0.5789	0.6733
Competition-metric ranking	0.9479	0.8878	0.5484	0.4117	0.5789	0.6749

Verdict: Difference is +0.0016 average — within measurement noise. SAM3+DINOv3 wins on ID Pres and Geo Con (the two primary metrics). The existing reranker is a good proxy for competition metrics and does not need to be replaced.

Final pipeline summary:

Parameter	Value	Reason
Steps	8	Better detail resolution for MJ-style prompts vs 6
Seeds	{1337, 2024, 7777}	Best avg rank in 5-seed ablation; drops 42 and 9999
Prompt	ultra_enrich	+0.0008 ID Pres vs simplify_v2 on competition subset
Reranker	SAM3+DINOv3	Competition-metric reranking shows no meaningful improvement
Resolution	720p	Best ID Pres; 480p+SR gives no metric gain
SR post-processing	None	Zero metric gain; FlashVSR produces ghost objects

Phase 12 — Composite Reranker: Full Investigation and Final Fix

Problem discovered (post Phase 11): Manual inspection of sample 07a91369fcfa544c (Tiffany gold watch on acrylic stand) revealed that the final pipeline selected seed=7777, which had visible ghosting — a double-image artifact where the watch appeared superimposed at two positions. Seeds 1337 and 2024 were clean. Yet the reranker scored seed=7777 highest.

Root cause 1 — DINOv3 identity scoring favours ghosted frames:

DINOv3 CLS similarity measures per-frame static identity — it rewards frames that contain watch-like features anywhere in the masked region. A ghosted video produces frames where the watch appears at two overlapping positions, inadvertently creating more "watch-like" patch tokens. The reranker sees a higher identity score for the ghosted video than for a clean smooth-motion video.

Initial fix attempt — temporal consistency (0.75×id + 0.25×temporal):

Added score_temporal_consistency() — mean cosine similarity between consecutive DINO frame embeddings. A smooth video scores ~0.98+; a video with dynamic artifacts scores lower. Test run on GPU 0 verified seed=2024 won. Full pipeline was re-launched.

The fix still failed on the full pipeline run:

After the full 70-sample regeneration, 07a91369fcfa544c was still ghosted. Investigation revealed two additional problems.

Root cause 2 — LLM prompt non-determinism:

Qwen3-VL with do_sample=False still produces different outputs run-to-run due to non-deterministic CUDA kernel ordering in Flash Attention. The test run generated a prompt diverging at character 367 from the full-pipeline prompt ("brilliant-cut diamond bezel" vs "bezel encrusted with brilliant-cut diamonds"). With the new prompt, seed=7777 scored highest on ALL metrics — a completely different generation regime than the test.

Root cause 3 — Static ghosting is invisible to the temporal metric:

Static ghosting (double-image frozen in every frame) has high temporal consistency — the same ghost appears in every frame, so consecutive frames look nearly identical. The temporal metric only penalises dynamic artifacts that change over time.

Final composite fix (3 terms):

Added score_patch_concentration() to src/reranker.py. This computes per-patch DINOv3 similarity to the reference CLS token, producing a spatial heatmap over the image grid. A clean frame has one concentrated subject region (high peak-to-mean ratio). A ghosted frame has two overlapping subject regions — the heatmap flattens (lower peak-to-mean ratio). Patch concentration detects static ghosting that temporal consistency misses.

Added prompt caching: LLM-generated prompts are saved to <output_dir>/prompts/<sample_id>.prompt.txt on first run and loaded from cache on all subsequent runs. This eliminates Flash Attention non-determinism and ensures reproducibility across regenerations.

composite = 0.65 × masked_identity
          + 0.25 × patch_concentration_normalized   ← static ghosting detector
          + 0.10 × temporal_consistency             ← dynamic artifact detector

patch_concentration_normalized = tanh((raw_conc - 2.0) / 0.5) × 0.5 + 0.5

Final verification on sample 07a91369fcfa544c (with prompt cache):

Seed	masked_id	patch_conc	temporal	composite	Winner
1337	0.9097	1.821	0.9824	0.7716	← selected (clean)
2024	0.9068	1.791	0.9800	0.7629
7777	0.8949	1.745	0.9827	0.7462	(last — ghosting)

Seed=7777 correctly ranks last. The prompt cache guarantees this result is reproducible.

flowchart TB
    PROB["Problem: seed=7777 selected\ndespite visible ghosting artifact"]

    RC1["Root cause 1: DINOv3 CLS measures static similarity\nGhosted frames → more subject patches → higher score"]
    RC2["Root cause 2: Flash Attention non-determinism\nQwen3-VL greedy decode ≠ reproducible across runs\nDifferent prompt → different ghosting regime"]
    RC3["Root cause 3: Static ghosting invisible to temporal metric\nSame ghost frozen every frame → high consecutive-frame sim"]

    FIX1["Fix 1: patch_concentration score\nSpatial peak-to-mean ratio of DINO patch heatmap\nDetects double-image spatial distribution flattening"]
    FIX2["Fix 2: Prompt caching\nSave LLM output to outputs/…/prompts/<id>.prompt.txt\nLoad on subsequent runs — eliminates non-determinism"]

    FINAL["Final composite:\n0.65 × masked_identity\n+ 0.25 × patch_conc_norm\n+ 0.10 × temporal_consistency"]

    PROB --> RC1 & RC2 & RC3
    RC1 & RC3 --> FIX1
    RC2 --> FIX2
    FIX1 & FIX2 --> FINAL

    style PROB fill:#F5B8B8,stroke:#cc7070,color:#1a1a2e
    style RC1 fill:#FEE8CC,stroke:#ccaa60,color:#1a1a2e
    style RC2 fill:#FEE8CC,stroke:#ccaa60,color:#1a1a2e
    style RC3 fill:#FEE8CC,stroke:#ccaa60,color:#1a1a2e
    style FIX1 fill:#D4EAF8,stroke:#7aa8cc,color:#1a1a2e
    style FIX2 fill:#D4EAF8,stroke:#7aa8cc,color:#1a1a2e
    style FINAL fill:#A8ECD4,stroke:#40aa70,color:#1a1a2e,font-weight:bold

Files changed:

src/reranker.py — added score_temporal_consistency() and score_patch_concentration()
scripts/run_inference.py — composite 3-term score; prompt caching in <output_dir>/prompts/
scripts/run_final_v2.sh — full 70-sample regeneration with fixed reranker

Key Finding: 6-Step is the Final Choice After Reranker Fixes

Phase 7's "4-step beats 6-step with simplified prompts" conclusion was measured with a buggy reranker that was misidentifying subjects, using pathological masks, and making noise-driven selections. After fixing all three bugs, 6-step correctly selects the best of three candidates and produces cleaner results on fine-detail scenes (close-up hands, jewelry, food textures) that aggregate metrics over 6 canonical samples did not capture.

The revised implication: denoising budget matters for fine-detail close-ups even with simplified prompts. The reranker must function correctly to surface this — a broken reranker can mask real quality differences.

Step count	id_fid	vis_q	mot_q	txt_al	geo_c	rt(s)
4-step + fixed reranker	0.9452	0.5583	0.4237	0.6115	0.9222	309
6-step + fixed reranker ★	0.9466	0.5484	0.4172	0.6073	0.9186	381

6-step wins on identity fidelity (+0.0014) and produces visibly better results on close-up scenes. The quality/latency trade-off (72s more per sample) is acceptable for a challenge submission.

What Failed and Why

Experiment	Config	Root cause of failure
Qwen3-VL enhancement	`lightning_full_*`	Detailed appearance tokens suppress the motion signal in cross-attention; model prioritizes recreating static appearance over fluid motion
Reference latent anchoring	`anchor_alpha > 0`	Blends reference image latents into every denoising step → double-exposure ghosting on any video with motion
3-phase chained generation	`lightning_chain3_720p_v1`	Identity drift accumulates across 3 chained phases; coordination complexity; fragile when sub-actions aren't temporally separable
Rerank×5 candidates	`lightning_rerank5_*`	+0.002 id_avg over ×3 at 67% more compute — diminishing returns
v3 system prompt	`lightning_simplify_720p_slow_v3`	Over-prescriptive IMAGE=STARTING_STATE / TEXT=INTENDED_ACTION rules produced higher output variability; id_fid dropped below v2
4-step as final (reversed)	`v2_4step → v2`	Phase 7 ablation was confounded by 3 reranker bugs; after fixing, 6-step produces better fine-detail on close-up scenes
FlashVSR post-processing	`_flashvsr`	Ghost hands and objects on 3/6 test samples. Root cause: LCSA (Block-Sparse Attention) cannot compile due to CUDA 12.9 system / CUDA 13.0 PyTorch mismatch → silent fallback to dense SDPA → hallucinations inherent
Real-ESRGAN post-processing	`_realesrgan`	Zero metric gain on competition scores (VideoReward insensitive >720p; MEt3R geometry-based not resolution-based). Useful for visual demos only — does not improve leaderboard position
480p + SR pipeline	`test_v3_mj_480p_1cand_realesrgan720p`	VideoReward bias (trained on ~720p) gives 480p raw outputs 0.69 VQ vs 0.52 for 720p/SR — not a genuine quality signal. ID Pres at 480p (0.9306) is lower than native 720p (0.9474)
Competition-metric reranking	Phase 11 Q2	Replacing SAM3+DINOv3 with direct competition-metric weighted selection gives only +0.0016 improvement over 5 seeds. SAM3+DINOv3 is already a near-optimal proxy
Seeds 42 and 9999	Phase 11 ablation	Seed 42 ranked last (avg rank 4.0/5) across all 4 competition samples. Seed 9999 ranked 3.5/5 with high variability. Dropped in favor of {1337, 2024, 7777} which avg 2.25–2.75 rank

Prompt Simplification: v1 → v4 Evolution

Each version addressed a specific failure mode discovered in evaluation:

flowchart TD
    V1[v1: Text-only\nno image input] -->|Failure: hallucinated spatial\nlayout, brand names| V2

    V2[v2: Vision-grounded\nimage + text to Qwen] -->|Failure: described static\nhand pose as the action| V3

    V3[v3: Starting state vs\nintended action separation] -->|Failure: confused TOOL\nwith TARGET object| V4

    V4[v4: TOOL vs TARGET\nexplicit distinction]

    V2 -->|Best overall metrics\nid_fid 0.9466| BEST[FINAL SUBMISSION]

    style BEST fill:#A8D9B8,stroke:#4CAF82,color:#1a1a2e
    style V2 fill:#A8D9B8,stroke:#4CAF82,color:#1a1a2e

v1 (text-only): Qwen had no image input. Prompts included brand names ("NES cartridge") and described the static end-state ("inserts piece into hole") rather than the motion arc.

v2 (vision-grounded): Passing the reference image let Qwen read the actual spatial layout. Prompts became physically plausible ("reaches left to pick up drill from workbench"). Best overall metrics.

v3 (starting-state/intended-action): Explicitly separated what the image shows (starting state) from what should happen (intended action). Fixed the "describes static hand pose" failure but introduced higher output variability → identity fidelity dropped.

v4 (TOOL vs TARGET): Added explicit TOOL/TARGET distinction: "TOOL = instrument subject picks up; TARGET = object action is performed ON." Fixed the NES drill sample (correctly generates "picks up drill" not "picks up cartridge") but the more complex system prompt produced subtly different distributions → id_fid 0.9384 vs v2's 0.9466 across the full 6-sample set.

v2 remains the submission because simpler, more consistent prompt distributions lead to better diffusion model convergence.

Impact of Visual Input to Qwen

Config	Qwen receives image?	id_fid	geo_c	Notes
`simplify_v1`	No (text only)	0.9251	0.9095	Hallucinated spatial layout
`simplify_v2`	Yes	0.9466	0.9186	+0.0215 id_fid, +0.0091 geo_c

The geometry consistency gain (+0.009) is explained by motion trajectories now being grounded in the actual 3D scene: if the drill is to the subject's left in the reference image, the simplified prompt says "reaches left" — and the diffusion model generates motion consistent with that geometry.

Results

Full 21-config ablation (6 canonical samples, --ablation_samples mode):

Config	Steps	Simplify	Slow	id_fid	vis_q	mot_q	txt_al	geo_c	rt(s)
baseline_v1	30	—	—	0.9243	0.5121	0.3876	0.5690	0.9012	382
fast_480p_v1	4	—	—	0.8920	0.5088	0.3752	0.5501	0.8843	180
lightning_6step_480p_v1	6	—	—	0.9105	0.5201	0.3941	0.5682	0.8951	260
lightning_rerank_480p_v1	6	—	—	0.9198	0.5273	0.4021	0.5771	0.9034	261
lightning_full_480p_v1	6	Enhance	—	0.9102	0.5198	0.3987	0.5834	0.8992	267
lightning_rerank_720p_4step_v1	4	—	—	0.9331	0.5389	0.4078	0.5908	0.9118	309
lightning_rerank_720p_v1	6	—	—	0.9377	0.5427	0.4112	0.5943	0.9140	382
lightning_rerank_720p_slow_v1	6	—	Prefix+Neg	0.9361	0.5441	0.4138	0.5981	0.9129	382
lightning_simplify_720p_slow_v1	6	Text-only	Neg	0.9251	0.5318	0.4052	0.5879	0.9095	383
lightning_simplify_720p_slow_v2 ★	6	Vision	Neg	0.9466	0.5484	0.4172	0.6073	0.9186	381
lightning_simplify_720p_slow_v2_4step	4	Vision	Neg	0.9452	0.5583	0.4237	0.6115	0.9222	309
lightning_simplify_720p_slow_v3	4	Vision v3	Neg	0.9343	0.5458	0.4156	0.5990	0.9159	308
lightning_simplify_720p_slow_v4	4	Vision v4	Neg	0.9384	0.5474	0.4175	0.6073	0.9145	310
lightning_simplify_720p_slow_v4_6step	6	Vision v4	Prefix+Neg	0.9407	0.5443	0.4185	0.6005	0.9126	381

★ = final submission. Highest identity fidelity; best visual quality on close-up scenes with fixed reranker.

See reports/ for full visualizations:

Figure	Content
`fig1_identity_bars.png`	DINOv3 identity avg / min / SAM3-masked per config
`fig2_runtime_vs_identity.png`	Quality–speed scatter across all 21 configs
`fig4_ablation.png`	10-step incremental ablation: baseline → final
`fig5_challenge_metrics.png`	All 5 VGBE official metrics per config
`fig8_final_radar.png`	Radar chart for the final submission config
`fig9_composite_ranking.png`	All configs ranked by composite VGBE score

Performance Optimizations

Flash Attention 2

flash_attn==2.8.3 is installed and enabled automatically:

# In load_pipeline() — run_inference.py
try:
    import flash_attn  # noqa
    pipe = WanImageToVideoPipeline.from_pretrained(
        model_id, torch_dtype=bfloat16,
        attn_implementation="flash_attention_2",
    )
    # Reduces attention VRAM and speeds up transformer layers
except ImportError:
    pipe = WanImageToVideoPipeline.from_pretrained(
        model_id, torch_dtype=bfloat16,
    )

Float8 Weight-Only Quantization

from torchao.quantization import quantize_, Float8WeightOnlyConfig
quantize_(pipe.transformer, Float8WeightOnlyConfig())   # 14B → ~7B params
quantize_(pipe.transformer_2, Float8WeightOnlyConfig()) # 14B → ~7B params
# Combined: ~56 GB VRAM → ~28 GB VRAM, <1% quality drop

LoRA must be applied before quantize_() — torchao wraps linear layer parameter names, which breaks the key mapping used by apply_lora_to_transformer.

Load-Once Multi-GPU Strategy

run_final_v2.sh divides all 70 samples into 8 chunks upfront (interleaved round-robin for load balance) and assigns each chunk to one GPU process. Within each process, both models load exactly once for the entire chunk:

Per-GPU process (sequential within the chunk):

  ┌─ Qwen3-VL loads (~30s) ──────────────────────────────┐
  │  prompt₁, prompt₂, … prompt₉  (cache miss → LLM)    │
  │  OR: all loaded from prompts/ cache  (0s LLM time)   │
  └─ unload_qwen() → VRAM freed ─────────────────────────┘
  ┌─ Wan2.2-I2V-A14B loads (~124s) ──────────────────────┐
  │  video₁ (3 candidates → rerank) ~380s                │
  │  video₂ (3 candidates → rerank) ~380s                │
  │  …                                                    │
  │  video₉ (3 candidates → rerank) ~380s                │
  └───────────────────────────────────────────────────────┘

The two models cannot coexist in VRAM (Qwen ~16 GB + Wan2.2 ~28 GB + activations > 80 GB H100 budget), so unload_qwen() explicitly frees VRAM before Wan2.2 loads. All 8 GPU processes run in parallel — there is a single wait at the end of the script.

flowchart TB
    SCRIPT["run_final_v2.sh\n70 samples → 8 chunks"]

    subgraph GPU0["GPU 0  (9 samples)"]
        direction TB
        Q0["Qwen3-VL\nall 9 prompts\n(load once, unload)"]
        W0["Wan2.2\nall 9 videos\n(load once)"]
        Q0 --> W0
    end
    subgraph GPU1["GPU 1  (9 samples)"]
        direction TB
        Q1["Qwen3-VL\n(load once, unload)"]
        W1["Wan2.2\n(load once)"]
        Q1 --> W1
    end
    subgraph GPU27["GPUs 2–7  (~9 samples each)"]
        direction TB
        Q2["Qwen3-VL\n(load once, unload)"]
        W2["Wan2.2\n(load once)"]
        Q2 --> W2
    end

    SCRIPT --> GPU0 & GPU1 & GPU27

Why this matters — model load cost:

Approach	Qwen3-VL loads	Wan2.2 loads	Wasted load time
Old: 1 sample per process	70	70	~70 × 124s = 2.4 h
New: chunked per GPU	8	8	~8 × 124s = 17 min

Per-sample compute breakdown (single GPU, sequential):

Step	Time
Qwen3-VL prompt (amortized over chunk, or 0s if cached)	~3s
Wan2.2 model load (amortized over 9 samples)	~14s
Generate candidate 1 — seed 1337 (8-step, 720p)	~127s
Generate candidate 2 — seed 2024	~127s
Generate candidate 3 — seed 7777	~127s
SAM3 + DINOv3 composite reranking	~15s
Total per sample	~413s (~6m 53s)

Why 8 steps with a 4-step LoRA? The Lightning LoRA was distilled for native 4-step inference, but running it at 8 steps gives the diffusion model more denoising budget to resolve the richer conditioning from MJ-style ultra_enrich prompts (80–120 words, dense material/lighting/motion tags). At 4 steps, complex prompts leave residual noise in fine-detail regions — close-up textures, jewelry, finger geometry. At 8 steps, that detail converges cleanly. The cost is ~127s vs ~63s per candidate, but since the LoRA keeps the per-step compute low (rank-64 residual), 8-step Lightning LoRA is still 3× faster than 4-step native Wan2.2 (30 steps), striking the right balance between speed and quality for a challenge submission.

Observed wall-clock runtime (final 70-sample run, 8× H100):

Stage	Time
Qwen3-VL prompts (8 GPUs parallel, ~9 prompts each)	~2 min
Wan2.2 load (8 GPUs parallel, once each)	~2 min
Inference + reranking (9 samples × ~413s, sequential per GPU)	~62 min
Total wall-clock (measured)	2h 10m
Average per sample (wall-clock ÷ 70)	~1m 53s
Average per sample (sequential compute on one GPU)	~6m 53s

HuggingFace Models

All models are public and hosted under debajyotidasgupta/. No authentication token required.

Model	Repo	Used for
`Wan2.2-I2V-A14B-Diffusers`	`debajyotidasgupta/Wan2.2-I2V-A14B-Diffusers`	Main I2V diffusion model
`Wan2.2-Lightning LoRA`	`debajyotidasgupta/Wan2.2-Lightning`	4-step distillation weights
`Qwen3-VL-8B-Instruct`	`debajyotidasgupta/Qwen3-VL-8B-Instruct`	Prompt simplification
`DINOv3 ViT-B/16`	`debajyotidasgupta/dinov3-vitb16-pretrain-lvd1689m`	Identity reranking
`SAM3`	`debajyotidasgupta/sam3`	Subject segmentation for masked reranking
`VideoReward`	`debajyotidasgupta/VideoReward`	Visual/motion/text quality metrics
`DUSt3R ViT-L`	`debajyotidasgupta/DUSt3R_ViTLarge_BaseDecoder_512_dpt`	MEt3R geometry consistency
`FeatUp DINOv2`	`debajyotidasgupta/FeatUp`	DINOv2 torchhub checkpoints
`CLIP ViT-B/32`	`debajyotidasgupta/vit_base_patch32_clip_224.openai`	Text alignment scoring

Environment Setup

Steps 1 and 2 are common to both local and Docker workflows — do them once regardless of how you plan to run inference.

Step 1 — Clone the repository (common)

git clone https://github.com/debajyotidasgupta/IdentityFlow.git
cd IdentityFlow
git submodule update --init --recursive   # initialises VideoAlign evaluation harness

Step 2 — Download models (common, ~180 GB)

The large inference models are not bundled in the Docker image (they would make it impractical to distribute). Download them once to a local directory. The same directory is then used by both local inference and Docker via a bind-mount.

⚠️ Use /tmp for dramatically faster model loading. Loading models from network-attached or slow spinning storage (NFS, HDD) takes 16 minutes or more per run. Loading from local SSD/tmpfs (/tmp) takes under 1 minute. Always download to /tmp/model_cache unless you have a specific reason to use a persistent path.

# huggingface_hub is the only requirement — no GPU needed for this step:
pip install huggingface_hub

# Recommended — inference models to /tmp (fast load, ~180 GB):
python scripts/download_models.py --inference_only --cache_dir /tmp/model_cache

# Include geometry-consistency eval models (~182 GB total):
python scripts/download_models.py --cache_dir /tmp/model_cache

# Persistent destination (slow if on NFS/HDD — expect 16+ min model load per run):
python scripts/download_models.py --inference_only --cache_dir /data/model_cache

The script is resumable — re-running it after an interruption skips already-completed repos.

What download_models.py fetches vs what docker build bakes in:

Model	Size	Fetched by	Destination
Wan2.2-I2V-A14B-Diffusers	~56 GB	`download_models.py`	`/tmp/model_cache/huggingface/hub/`
Wan2.2-Lightning LoRA	~1 GB	`download_models.py`	`/tmp/model_cache/huggingface/hub/`
Qwen3-VL-8B-Instruct	~8 GB	`download_models.py`	`/tmp/model_cache/huggingface/hub/`
DINOv3 ViT-B/16	~300 MB	`download_models.py`	`/tmp/model_cache/huggingface/hub/`
SAM3	~2.5 GB	`download_models.py`	`/tmp/model_cache/huggingface/hub/`
CLIP ViT-B/32	~350 MB	`download_models.py` + `docker build`	`/tmp/model_cache/huggingface/hub/` + `/workspace/checkpoints/clip/` (baked in)
DUSt3R (MEt3R backbone)	~1.5 GB	`download_models.py`	`/tmp/model_cache/huggingface/hub/`
VideoReward checkpoint	~5 GB	`docker build`	`/workspace/checkpoints/VideoReward/` (baked in)
FeatUp / DINOv2 torchhub	~1.5 GB	`docker build`	`/workspace/checkpoints/torchhub/` (baked in)

VideoReward, FeatUp, and CLIP are baked into the image rather than downloaded to ./model_cache because Docker bind-mounts /workspace/.cache at runtime — anything written there during build would be hidden. Placing them under /workspace/checkpoints/ (outside the mounted volume) keeps them accessible in every container run without re-downloading. The CLIP_WEIGHTS_PATH env var (set in the image) points eval_quality.py to the baked checkpoint; outside Docker it falls back to ./model_cache via hf_hub_download.

Step 3a — Local venv setup

Skip this if you are using Docker.

python3.12 -m venv .venv && source .venv/bin/activate
pip install torch==2.9.0 torchvision==0.24.0 \
    --index-url https://download.pytorch.org/whl/cu130
pip install torchao==0.16.0 triton==3.5.0
MAX_JOBS=48 CUDA_HOME=/usr/local/cuda \
    pip install flash_attn==2.8.3 --no-build-isolation --no-deps
pip install -r requirements.txt

Step 3b — Docker image: pull (recommended) or build

Skip this if you are using local inference.

Option A — Use the pre-built image from Docker Hub (no docker build on your machine). You must still complete Steps 1–2 (model_cache is not inside the image). Match the tag to your CPU architecture (e.g. arm64 for aarch64 / many Grace Hopper nodes).

docker pull docker.io/debajyotidasgupta/vgbe2026-i2v:arm64
docker tag debajyotidasgupta/vgbe2026-i2v:arm64 vgbe2026-i2v:latest

Option B — Build locally (~20–30 min first time: torch, flash_attn, FeatUp, MEt3R, pytorch3d, plus VideoReward + FeatUp checkpoints baked in):

docker build -t vgbe2026-i2v:latest .

# Pin platform when host and default image arch differ, e.g. arm64:
# docker build --platform linux/arm64 -t vgbe2026-i2v:arm64 .

Running Inference

Prerequisite: Steps 1 and 2 of Environment Setup must be complete — models must be downloaded to /tmp/model_cache (or your chosen --cache_dir).

Important: run_parallel.sh, run.sh, and all related scripts run inside the container. Before using them, open an interactive shell in the container first:
docker run --gpus all --entrypoint bash \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  -it vgbe2026-i2v:latest
Then run the scripts below from within that shell.

Full 70-sample run — all GPUs (recommended)

./run_parallel.sh --config configs/final_pipeline.yaml \
                  --output_dir outputs/final

Limit to N GPUs

./run_parallel.sh --config configs/final_pipeline.yaml \
                  --output_dir outputs/final --gpus 6

Single GPU (debug / specific samples)

./run.sh --config configs/final_pipeline.yaml \
         --sample_ids a50a70b67b89feb1 e85432e145830b6b

Evaluation and plots

# Score all generated runs:
python scripts/eval_quality.py --all

# Regenerate plots (6-sample ablation comparison):
python scripts/make_plots.py --ablation_samples \
    f8c054d1aa3f6487 e85432e145830b6b a9ab2b16bc2bddee \
    07a91369fcfa544c e90a9a89e15b285b a50a70b67b89feb1

Docker Deployment

Prerequisite: complete Steps 1–2 (clone + model download) and Step 3b — either docker pull the pre-built image or docker build locally — from Environment Setup above. Models must already be downloaded to /tmp/model_cache before running the container.

Once setup is done every docker run starts immediately — no downloads, model load in ~45 s from /tmp (vs 16+ min from NFS/slow disk):

Arguments after the image name replace the default command. The image CMD invokes run_parallel.sh with --config configs/final_pipeline.yaml. If you pass any extra flags to run_parallel.sh (for example --gpus, --sample_ids, --num_samples, --output_dir), you must include --config … again — otherwise the container only receives your flags and exits with [run_parallel.sh] ERROR: --config is required. (docker run --gpus all is separate: it selects which GPUs the container may use; run_parallel.sh --gpus N controls how work is split across those GPUs.)

# All available GPUs — uses default CMD (includes --config):
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest

# Limit parallel workers to 4 GPUs (--config required; it replaces the default CMD):
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest \
  --config configs/final_pipeline.yaml \
  --gpus 4

# Only specific validation samples (--config required; sample IDs = basenames without .jpg):
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest \
  --config configs/final_pipeline.yaml \
  --sample_ids 017ab5d3f9382339 f8c054d1aa3f6487 e85432e145830b6b

# docker compose shortcuts (MODEL_CACHE / VAL_DATA / OUTPUTS env vars respected):
MODEL_CACHE=/tmp/model_cache \
VAL_DATA=$(pwd)/val_data_released_by_0321 \
OUTPUTS=$(pwd)/outputs \
docker compose run infer-multi          # all GPUs, final config

docker compose run infer                # single GPU (debug)
docker compose run eval                 # score all output directories

Bind-mount contract — the container expects exactly these three paths:

Host path	Container path	Purpose
`/tmp/model_cache`	`/workspace/.cache`	HF + torch model cache (populated by `python scripts/download_models.py --inference_only --cache_dir /tmp/model_cache`) — use `/tmp` for fast load (<1 min); network/NFS paths cause 16+ min load times
`./val_data_released_by_0321`	`/workspace/val_data_released_by_0321`	Validation images and prompts (read-only)
`./outputs`	`/workspace/outputs`	Generated videos written here

Project Structure

image-to-video/
├── configs/                              # 30+ YAML experiment configs
│   ├── final_pipeline.yaml                          # FINAL SUBMISSION ★ (Phase 11)
│   ├── lightning_simplify_720p_slow_v2.yaml         # Phase 8 best (6-step, simplify_v2)
│   ├── test_v3_mj_force_enrich.yaml                 # Phase 9 best (5-cand, ultra_enrich)
│   ├── test_v3_mj_fe_seed{42,1337,2024,7777,9999}.yaml  # Phase 11 seed ablation
│   ├── lightning_rerank_720p_v1.yaml                # Phase 2 best (no simplify)
│   ├── lightning_simplify_720p_slow_v3.yaml         # v3 system prompt (ablation)
│   ├── lightning_simplify_720p_slow_v4.yaml         # TOOL/TARGET fix (ablation)
│   ├── lightning_simplify_720p_slow_v4_6step.yaml   # v4 6-step (ablation)
│   └── [22 earlier ablation configs]
│
├── scripts/
│   ├── run_inference.py       # Main I2V pipeline (Flash Attn 2, float8, LoRA)
│   ├── eval_quality.py        # Evaluation harness (DINOv3, MEt3R, VideoReward)
│   ├── make_plots.py          # Publication plots (30+ configs, soft pastel palette)
│   ├── upscale_realesrgan.py  # Real-ESRGAN x4plus video SR (Phase 10)
│   ├── upscale_flashvsr.py    # FlashVSR 4× SR — abandoned (ghost objects, Phase 10)
│   ├── run_q2_allseeds.sh     # Phase 11 seed ablation launcher (5-seed × 4-sample)
│   ├── run_simplify_v4.sh     # v4 4-step parallel launcher
│   └── run_simplify_v4_6step.sh
│
├── src/
│   ├── prompt_simplifier.py   # Qwen3-VL vision-grounded simplification (v2–v4)
│   ├── prompt_enhancer.py     # Qwen3-VL enhancement (Phase 3, now abandoned)
│   ├── prompt_decomposer.py   # 3-phase decomposition (chain experiment, abandoned)
│   ├── reranker.py            # DINOv3 identity scorer (full-image)
│   ├── masked_scorer.py       # SAM3-masked DINOv3 identity scorer
│   ├── lora_utils.py          # Lightning LoRA weight merging (pre-quantization)
│   ├── pipeline_pool.py       # Multi-GPU worker pool (for 3-phase chaining)
│   ├── anchoring.py           # Reference latent anchoring (abandoned, ghosting)
│   └── final_metrics.py       # MEt3R + VideoReward official VGBE metrics
│
├── run.sh                     # Single-GPU entry point
├── run_parallel.sh            # Multi-GPU parallel entry point (RECOMMENDED)
├── Dockerfile                 # CUDA 12.9, flash_attn, all deps, baked checkpoints
├── docker-compose.yml         # infer / infer-multi / eval services
├── requirements.txt           # Python dependencies
│
├── reports/                   # PNG result plots (fig1–fig9b)
├── outputs/                   # Generated videos (<config_name>/<sample_id>.mp4)
├── logs/                      # Execution logs
└── val_data_released_by_0321/ # VGBE validation set
    ├── images/                # Reference images (.jpg, 70 total)
    └── prompts/               # Text prompts (.txt, 70 total)

VGBE 2026 Official Metrics

All metrics are evaluated against the original verbose prompt (not the simplified one) to ensure fair comparison with systems that do not use prompt engineering.

Metric	Method	Measures
Identity Fidelity	ConsID-Gen CLIP cosine	Subject appearance consistency across frames
Visual Quality	VideoReward	Perceptual quality, sharpness, artifact absence
Motion Quality	VideoReward	Temporal smoothness, realistic dynamics
Text Alignment	VideoReward / VideoAlign	How well the video reflects the prompt
Geometry Consistency	MEt3R (DUSt3R-based)	3D structural consistency across frames

Compressing & Exporting Outputs

After generating videos, use compress_export.sh to validate all 70 outputs, re-encode them with H.265 CRF 28 (~4× smaller than the H.264 CRF 18 originals), and package everything into a single .tar.gz for submission or transfer.

Quick start

# Export outputs/final/ (default)
./compress_export.sh

# Export a specific config's outputs
./compress_export.sh --input outputs/lightning_simplify_720p_slow_v2

# Custom output path
./compress_export.sh --input outputs/final --out /tmp/submission.tar.gz

Options

Flag	Default	Description
`--input <dir>`	`outputs/final`	Folder of `.mp4` files to compress
`--out <path>`	`<input>_export.tar.gz`	Output archive path
`--crf <n>`	`28`	H.265 CRF — lower = better quality, larger file
`--preset <p>`	`medium`	x265 preset (`ultrafast` … `veryslow`)
`--jobs <n>`	`8`	Parallel re-encode workers
`--keep`	off	Keep the re-encoded folder after archiving

What it does

Validates — checks every sample ID from val_data_released_by_0321/images/ has a non-empty .mp4; aborts with a list of missing files if not.
Re-encodes — re-encodes all videos in parallel using the bundled imageio-ffmpeg binary (no system ffmpeg needed), with libx265 -tag:v hvc1 -pix_fmt yuv420p -movflags +faststart.
Archives — tars the re-encoded folder with gzip and reports the final size and compression ratio.
Cleans up — removes the intermediate re-encoded folder unless --keep is passed.

Typical results

	Size
Original `final/` — H.264 CRF 18	~126 MB
Re-encoded — H.265 CRF 28	~30 MB
`final_export.tar.gz`	~30 MB

Note: ffmpeg is sourced from the imageio-ffmpeg wheel bundled in the .venv (no separate installation required). A system ffmpeg is used as a fallback if imageio-ffmpeg is not available.

IdentityFlow — Consistent Identity · Fluid Motion

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
assets		assets
configs		configs
docker		docker
reports		reports
scripts		scripts
src		src
third_party		third_party
val_data_released_by_0321		val_data_released_by_0321
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
compress_export.sh		compress_export.sh
docker-compose.yml		docker-compose.yml
local_eval.tsv		local_eval.tsv
requirements.txt		requirements.txt
run.sh		run.sh
run_ablation.sh		run_ablation.sh
run_parallel.sh		run_parallel.sh
stress_eval.tsv		stress_eval.tsv
wait_and_plot.sh		wait_and_plot.sh

Folders and files

Latest commit

History

Repository files navigation

Quick Start

GPU count variants

Table of Contents

Overview

Final System Architecture

Novelty and Contributions

Full Experiment History

Phase 1 — Lightning Acceleration

Phase 2 — Reranking and Resolution

Phase 3 — Prompt Enhancement (Failed)

Phase 4 — Prompt Simplification

Phase 5 — Slow-Motion Conditioning

Phase 6 — Vision-Grounded Simplification

Phase 7 — Step Count with Simplified Prompts

Phase 8 — Reranker Bug Fixes and Final Step Count Decision

Phase 9 — MidJourney-Style Prompts and Resolution Variants

Phase 10 — Super-Resolution Post-Processing (FlashVSR & Real-ESRGAN)

Phase 11 — Seed Ablation and Final Pipeline Consolidation

Phase 12 — Composite Reranker: Full Investigation and Final Fix

Key Finding: 6-Step is the Final Choice After Reranker Fixes

What Failed and Why

Prompt Simplification: v1 → v4 Evolution

Impact of Visual Input to Qwen

Results

Performance Optimizations

Flash Attention 2

Float8 Weight-Only Quantization

Load-Once Multi-GPU Strategy

HuggingFace Models

Environment Setup

Step 1 — Clone the repository (common)

Step 2 — Download models (common, ~180 GB)

Step 3a — Local venv setup

Step 3b — Docker image: pull (recommended) or build

Running Inference

Full 70-sample run — all GPUs (recommended)

Limit to N GPUs

Single GPU (debug / specific samples)

Evaluation and plots

Docker Deployment

Project Structure

VGBE 2026 Official Metrics

Compressing & Exporting Outputs

Quick start

Options

What it does

Typical results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages