Skip to content

debajyotidasgupta/IdentityFlow

Repository files navigation

IdentityFlow

Model LoRA Reranker Simplifier Challenge Python PyTorch

Identity-consistent image-to-video generation for the CVPR 2026 VGBE Challenge.

Ultra-enriched MJ-style prompts · SAM3-masked DINOv3 reranking · Lightning LoRA · 8-step · Flash Attention 2 · 720p · Seeds {1337, 2024, 7777}

Author: Debajyoti Dasgupta


Quick Start

Minimum steps to reproduce the final submission on any machine with NVIDIA GPU(s):

# 1. Clone
git clone https://github.com/debajyotidasgupta/IdentityFlow.git
cd IdentityFlow
git submodule update --init --recursive

# 2. Download models to /tmp for fast loading (under 1 min vs 16+ min from network storage)
# WARNING: total download is ~180 GB inference-only — ensure sufficient space in /tmp
pip install huggingface_hub
python scripts/download_models.py --inference_only --cache_dir /tmp/model_cache

# 3. Pull pre-built Docker image  (or: docker build -t vgbe2026-i2v:latest .)
# arm64 / amd64 - check your system
docker pull docker.io/debajyotidasgupta/vgbe2026-i2v:amd64
docker tag debajyotidasgupta/vgbe2026-i2v:amd64 vgbe2026-i2v:latest

# 4a. Run a single sample first to verify everything works
#     Expected end-to-end time: ~16 min (model load ~45 s + video generation ~15 min on H100)
mkdir -p outputs
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest \
  --config configs/final_pipeline.yaml \
  --sample_ids 88afa2050d422c64

# 4b. Run all 70 samples — all available GPUs → outputs/final/
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest

That's it. The container uses configs/final_pipeline.yaml by default (720p · 8-step · seeds {1337, 2024, 7777} · SAM3+DINOv3 reranking). Expected runtimes on H100: ~16 min for a single sample (model load ~45 s + video generation ~15 min), ~60 min for all 70 samples on 8× H100.

Note: run_parallel.sh, run.sh, and related scripts are designed to run inside the container. To use them directly, first open a shell inside the container with --entrypoint bash, then execute the scripts from there:

docker run --gpus all --entrypoint bash \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  -it vgbe2026-i2v:latest
# inside the container:
./run_parallel.sh --config configs/final_pipeline.yaml --output_dir outputs/final

GPU count variants

Setup Command
All GPUs (default) docker run --gpus all … vgbe2026-i2v:latest
Limit to N GPUs docker run --gpus all … vgbe2026-i2v:latest --config configs/final_pipeline.yaml --gpus N
Single GPU docker run --gpus '"device=0"' … vgbe2026-i2v:latest --config configs/final_pipeline.yaml --gpus 1
Specific GPU CUDA_VISIBLE_DEVICES=2 ./run.sh --config configs/final_pipeline.yaml

When passing extra arguments (e.g. --gpus N) you must re-specify --config configs/final_pipeline.yaml — extra args replace the default CMD entirely.


Table of Contents

  1. Overview
  2. Final System Architecture
  3. Novelty and Contributions
  4. Full Experiment History
  5. Key Finding: 6-Step is the Final Choice After Reranker Fixes
  6. What Failed and Why
  7. Prompt Simplification: v1 → v4 Evolution
  8. Impact of Visual Input to Qwen
  9. Results
  10. Performance Optimizations
  11. HuggingFace Models
  12. Environment Setup
  13. Running Inference
  14. Docker Deployment
  15. Project Structure
  16. Compressing & Exporting Outputs

Overview

This project adapts Wan2.2-I2V-A14B-Diffusers — a 14B dual-transformer image-to-video diffusion model — for the CVPR 2026 VGBE Challenge through a systematic 30+ configuration ablation study spanning acceleration, reranking, prompt engineering, motion quality, super-resolution post-processing, and seed selection.

Task: Given a reference image and a text prompt, generate a ≥720p, 81-frame (5-second at 16 fps) video that preserves the visual identity of the subject, reflects the intended action, and avoids motion blur or geometric distortions.

Final submission: configs/final_pipeline.yaml

  • 720p · 8-step Lightning LoRA · MidJourney-style ultra_enrich prompts
  • 3 candidates (seeds 1337, 2024, 7777) · SAM3+DINOv3 reranking
Metric Baseline (30-step) Phase 8 Final Final Pipeline Δ vs Ph.8
Identity Fidelity 0.9243 0.9466 0.9474 +0.0008
Visual Quality 0.5121 0.5484 0.5344 −0.014†
Motion Quality 0.3876 0.4172 0.4121 −0.005†
Text Alignment 0.5690 0.6073 0.5789 −0.028†
Geometry Consistency 0.9012 0.9186 0.8940 −0.025†
Runtime per sample ~382s ~381s ~380s ≈0%

† VideoReward scores are lower for the new pipeline on the 4-sample competition subset due to metric distribution shift (VideoReward was trained on ~720p; the 4-sample competition subset differs from the 6-sample ablation set used for Phases 1–8). Identity Fidelity, the primary competition metric, improves. See Phase 9 for full context.


Final System Architecture

flowchart TB
    subgraph Q["Qwen3-VL · Ultra-Enrich Prompt"]
        direction TB
        Q1["Read reference image\n(spatial layout, subject, scene)"]
        Q2["Read verbose text prompt\n(intended action)"]
        Q3["MidJourney-style 80–120 word prompt\nmaterial · lighting · quality · motion tags"]
        Q1 --> Q3
        Q2 --> Q3
    end

    subgraph W["Wan2.2-I2V-A14B · 3 Candidates"]
        direction TB
        W1["Lightning LoRA · 8-step · float8 · 720p"]
        C1["Candidate · seed=1337"]
        C2["Candidate · seed=2024"]
        C3["Candidate · seed=7777"]
        W1 --> C1
        W1 --> C2
        W1 --> C3
    end

    subgraph R["SAM3 + DINOv3 · Composite Reranker (Phase 12)"]
        direction TB
        R1["Segment subject with SAM3"]
        R2["0.65 × DINOv3 masked identity\n(per-frame · subject crop)"]
        R3["0.25 × Patch concentration\n(spatial peak/mean ratio — static ghost detector)"]
        R4["0.10 × Temporal consistency\n(consecutive-frame DINO sim — dynamic artifact detector)"]
        R5["Composite score → argmax → best candidate"]
        R1 --> R2 --> R5
        R3 --> R5
        R4 --> R5
    end

    A(["Reference Image"]) --> Q
    A --> W
    A --> R1
    B(["Verbose Text Prompt"]) --> Q
    Q3 --> W
    C1 --> R1
    C2 --> R1
    C3 --> R1
    R2 --> OUT(["Best Video · 720p · 81 frames"])

    class A,B input
    class OUT output
    classDef input  fill:#fde8d8,stroke:#e8a87c,color:#3d2b1f,font-weight:bold
    classDef output fill:#d4ecd4,stroke:#7cbf8e,color:#1f3d2b,font-weight:bold
Loading

Qwen3-VL (4-bit, ~5 GB) loads, runs all prompts in ~5s/batch, then fully unloads before the 28 GB diffusion pipeline loads. The two models never co-exist in VRAM.

Seed selection rationale: Seeds {1337, 2024, 7777} were chosen from a 5-seed ablation on the 4-sample competition subset. Seeds 42 and 9999 were dropped after consistently ranking 4th–5th across all sample types. See Phase 11.


Novelty and Contributions

Contribution Description Gain
Lightning LoRA at 6-step Use 4-step distilled LoRA at 6 steps for better fine-detail quality without retraining +1.3pp id_avg vs native 4-step (complex prompts)
SAM3-masked DINOv3 reranking (fixed) Segment subject with SAM3, compute DINOv3 cosine only on subject pixels; fixed 3 bugs (wrong subject prompt, pathological boxes, no spread threshold) +0.02 id_avg vs no reranking
Composite reranker (Phase 12) 3-term score: 0.65×masked_identity + 0.25×patch_concentration + 0.10×temporal_consistency; patch concentration detects static ghosting; prompt caching eliminates LLM non-determinism Ghosted seed correctly ranks last; reproducible across regeneration runs
720p resolution Select Wan2.2 bucket nearest to input aspect ratio, min short-side 720 +0.003–0.005 geo_c
Prompt simplification Strip verbose prompts to 15-25 word SVO motion descriptions via Qwen3-VL +0.009 id_fid over rerank-only
Vision-grounded simplification Pass reference image to Qwen so it reads spatial layout from the scene +0.02 id_fid vs text-only simplification
Slow-motion conditioning Negative prompt + Qwen slow-verb bias reduce motion blur and identity drift +0.012 mot_q
6-step with fixed reranker After reranker bug fixes, 6-step correctly selects best candidate; provides cleaner fine-detail on close-up scenes (hands, jewelry) Better visual quality on hard samples
Flash Attention 2 attn_implementation="flash_attention_2" via flash_attn==2.8.3 Reduced attention VRAM
Load-once parallel GPU strategy Each GPU loads model once, processes all assigned samples sequentially ~8× fewer model loads for 70 samples

Full Experiment History

Phase 1 — Lightning Acceleration

The baseline Wan2.2-I2V-A14B runs 30 denoising steps, taking ~382s per sample. With 70 final samples this would be >7 hours sequentially. To make experiments tractable, we adopted the WAN Lightning LoRA — a rank-64 LoRA trained via score-distillation to compress the denoising schedule to 4 steps.

flowchart LR
    A[Baseline\n30 steps\n382 s/sample] -->|Lightning LoRA\nrank-64 distilled| B[4-step\n~180 s/sample\n5x speedup]
    B -->|Run at 6 steps\nextras denoising budget| C[6-step\n~260 s/sample]
    C -->|measured +1.3pp id_avg\ncomplex prompts| D{Better quality\nfor complex prompts}
Loading

Key observation: The LoRA was trained for native 4-step inference, but running it at 6 steps gave +1.3pp id_avg on complex, verbose prompts from the challenge dataset. Extra denoising iterations help the model resolve ambiguous or multi-clause prompt conditioning. This 6-step advantage disappears once prompts are simplified — see Phase 7.


Phase 2 — Reranking and Resolution

flowchart LR
    A[6-step 480p\nno reranking] -->|Add 3 candidates\nSAM3-masked DINOv3| B[Rerank x3\n480p]
    B -->|Scale resolution| C[Rerank x3\n720p]
    B -->|Test 5 candidates| D[Rerank x5\n480p]
    D -->|marginal gain\n+0.002 vs x3| E[Diminishing returns\nabandon x5]
    C -->|Final rerank config| F[lightning_rerank_720p_v1\nid_fid 0.9377]
Loading

Reranking: Generating 3 candidates with seeds [42, 123, 456] and picking the winner by SAM3-masked DINOv3 cosine similarity consistently improved identity fidelity. SAM3 segments the subject from the reference image; DINOv3 cosine is computed only on the masked subject region — this focuses the selection criterion on the person or object rather than background similarity.

Resolution: 720p improved geometry consistency (+0.003–0.005 geo_c) and visual quality. Wan2.2 supports discrete resolution buckets; we select the nearest bucket to the input aspect ratio with min_short_side=720.


Phase 3 — Prompt Enhancement (Failed)

flowchart LR
    A[Verbose prompt] -->|Qwen3-VL\nadds visual details| B[Enhanced prompt\n60-80 words]
    B -->|id_fid drops -0.006| C[FAILED ✗\nabandon enhancement]

    D[Reason:] --> E[Detailed appearance tokens\nsuppress motion signal]
    E --> F[Model anchors on appearance\nnot motion trajectory]
Loading

Qwen3-VL was used to enrich prompts with phrases like "the man's weathered hands carefully grasp the blue cartridge". This consistently hurt identity fidelity by −0.006 id_avg. The extra appearance tokens caused the diffusion model's cross-attention to focus on recreating static visual details rather than generating fluid motion. The entire enhancement branch was abandoned.


Phase 4 — Prompt Simplification

Original challenge prompts are often multi-sentence descriptions with context, brand names, and setting details:

"The video is a tutorial on how to modify a NES cartridge. A person is shown carefully drilling a hole into the back of the cartridge, while explaining the process. The workspace is cluttered with tools..."

For a 5-second clip (81 frames) the model needs a single, unambiguous motion target. Two strategies were tried:

flowchart TD
    P[Original verbose prompt] --> EA[Strategy A\nEnhancement]
    P --> SB[Strategy B\nSimplification]

    EA --> EA2[Longer richer prompt\n60-80 words]
    SB --> SB2[Concise SVO sentence\n15-25 words]

    EA2 --> EA3[Suppresses motion signal\nid_fid DOWN -0.006\nAbandoned]
    SB2 --> SB3[Cleaner motion signal\nid_fid UP +0.009\nKept]
Loading

Simplification won because fewer tokens means more attention mass per token in the diffusion cross-attention. The model can resolve "Man slowly picks up drill and brings it toward the cartridge" fully in 4 steps; it struggles to resolve a 70-word prompt in the same budget.


Phase 5 — Slow-Motion Conditioning

Fast motions in 5-second clips cause two failure modes: motion blur (subject features smear → low temporal DINOv3) and identity drift (large pose changes → model can't maintain consistent appearance).

Three independent conditioning signals were introduced and tested incrementally:

flowchart LR
    subgraph S1 [" Signal 1: Qwen slow-verb bias "]
        A1[Rule added to system prompt:\nprefer slowly, gently,\ncarefully, smoothly]
    end

    subgraph S2 [" Signal 2: Negative prompt "]
        A2[Condition away from:\nfast motion, motion blur,\ncamera shake, jerky, abrupt]
    end

    subgraph S3 [" Signal 3: Prompt prefix "]
        A3[Prepend to each prompt:\nSlowly and smoothly]
    end

    S1 --> D[Diffusion model]
    S2 --> D
    S3 --> D
    D --> E[Smoother identity-stable video]
Loading

The final submission uses Signals 1 and 2. Signal 3 (prompt prefix) was tested in v4_6step but did not improve over the combination of 1+2 at 4 steps.


Phase 6 — Vision-Grounded Simplification

Text-only simplification (v1) couldn't read the spatial layout from the image — it might describe "drill the cartridge" without knowing the drill was to the subject's left. Passing the reference image as a visual token to Qwen3-VL fixed this.

flowchart TD
    subgraph TXT [" v1: Text-only "]
        T1[Verbose prompt only] --> T2[Qwen3-VL text mode]
        T2 --> T3[Hallucinated spatial layout\nbrand names present\nstatic end-state described]
    end

    subgraph VIS [" v2: Vision-grounded "]
        V1[Verbose prompt] --> V2[Qwen3-VL vision+text mode]
        V3[Reference image] --> V2
        V2 --> V4[Reads actual spatial layout\nno brand names\nmotion arc described]
    end

    T3 -->|id_fid 0.9251\ngeo_c 0.9095| R1[Text-only result]
    V4 -->|id_fid 0.9466\ngeo_c 0.9186| R2[Vision result\n+0.0215 id_fid]
Loading

Adding the image input to Qwen improved geometry consistency by +0.009 because generated motion trajectories are now consistent with the actual 3D scene layout visible in the reference image rather than hallucinated positions.


Phase 7 — Step Count with Simplified Prompts

This phase revealed the central insight of the project.

flowchart LR
    subgraph Complex [" Complex verbose prompts "]
        C1[6-step] -->|id_fid 0.9377| C2[Better]
        C3[4-step] -->|id_fid 0.9350| C4[Worse]
    end

    subgraph Simple [" Simplified 15-25 word prompts "]
        S1[6-step\nv2_6step] -->|id_fid 0.9466| S2[Good]
        S3[4-step\nv2_4step] -->|id_fid 0.9452\nvis_q 0.5583\nmot_q 0.4237| S4[Better on 4/5 metrics]
    end

    Complex -->|Simplify prompts| Simple
    S4 -->|19% faster| WIN[FINAL SUBMISSION]
Loading

Initial finding: With simplified prompts, 4-step outperformed 6-step on 4 of 5 aggregate metrics in the 6-sample ablation. This result was later found to be confounded by three reranker bugs (see Phase 8). On individual hard samples (close-up hands, jewelry, fine textures), the buggy reranker was selecting poor 6-step candidates that 4-step happened to avoid — masking the true 6-step quality advantage. The aggregate metrics did not surface this because most samples don't involve extreme close-ups.

Step count With verbose prompt With simplified prompt (ablation)
4-step (native) id_fid 0.9350, vis_q 0.5425 id_fid 0.9452, vis_q 0.5583
6-step (+50% budget) id_fid 0.9377, vis_q 0.5427 id_fid 0.9466, vis_q 0.5484

Phase 8 — Reranker Bug Fixes and Final Step Count Decision

Post-ablation inspection of the full 70-sample generation revealed three samples with severe distortion: a ring close-up (039854ea40eab601), a workshop scene (02104dbb12391f56), and a food-cutting scene (294f210fed8f7dd5). Root-cause analysis identified three structural bugs in src/masked_scorer.py and scripts/run_inference.py:

flowchart TD
    B1["Bug 1 — Wrong subject prompt\n_SUBJECT_KEYWORDS: 'person' matched first\n'hand' keyword triggered before 'ring'\n→ SAM3 segmented person not jewelry"]
    B2["Bug 2 — Pathological SAM3 union box\n(90,0)–(1192,171): top 24% of frame only\naspect ratio 6.4 → masked wrong region\n→ reranker scored background not subject"]
    B3["Bug 3 — No spread threshold\nall 3 candidates scored 0.63–0.69\nargmax picked 'best' of equally bad candidates\n→ noise-driven selection"]

    F1["Fix 1 — Reorder keyword priorities\njewelry / animal / device / food\nchecked BEFORE person\nAlso pass base_prompt as fallback\n(Qwen may strip subject nouns)"]
    F2["Fix 2 — Aspect ratio rejection\nreject SAM3 union boxes where\nmax(W/H, H/W) > 7\nfall back to full-image DINOv3"]
    F3["Fix 3 — Spread threshold\ncollect all 3 scores first\nif max−min < 0.015, use seed=42\nno noise-driven selection"]

    B1 --> F1
    B2 --> F2
    B3 --> F3
Loading

After applying all three fixes and re-running the problematic samples at 4-step vs 6-step:

  • 4-step: Ring scene still shows hand distortion — 4 denoising steps cannot resolve fine finger/jewelry detail at 720p
  • 6-step: Cleaner fine detail on close-ups; spread scores improved (0.06–0.08), reranker correctly identifies best candidate

Decision: 6-step is the final submission. It provides noticeably better fine-detail quality on hard close-up samples with the fixed reranker, at equal runtime (~381s) to the previous rerank-only config. The aggregate metric gap vs 4-step (−0.014 vis_q) was smaller than the visible quality improvement on close-up cases.


Phase 9 — MidJourney-Style Prompts and Resolution Variants

After Phase 8 established the 6-step + fixed reranker baseline at ID Pres 0.9466, we explored whether richer prompt conditioning could further improve identity retention — specifically MidJourney-style ultra_enrich prompts (80–120 words with material, lighting, and quality tags) vs. the 15–25 word simplified prompts from v2.

flowchart TD
    A["Phase 8 final\n720p · 6-step · simplify_v2\nID Pres 0.9466"] --> B

    subgraph B["Phase 9 explorations"]
        direction LR
        P1["240p · 1 cand\ntest_v3_mj_240p_1cand\nID 0.9190 → too low-res"]
        P2["480p · 4 cand · rerank4\ntest_v3_mj_480p_rerank4\nID 0.9363 → VideoReward 0.69\nbut ID lower than 720p"]
        P3["720p · 8-step · force_enrich\n5 cands · ultra_enrich prompts\ntest_v3_mj_force_enrich\nID 0.9474 ★"]
        P4["480p · 1 cand · realesrgan→720p\ntest_v3_mj_480p_1cand_realesrgan720p\nID 0.9306, VQ 0.5242"]
    end

    B --> C["Key finding:\n480p scores high on VideoReward (0.69)\nbut lower on ID Pres vs 720p native\nVideoReward bias: trained on ~720p content"]
    P3 --> D["Best ID Pres: 0.9474\nForce ultra_enrich for all samples"]

    style P3 fill:#d4ecd4,stroke:#7cbf8e,color:#1a1a2e
    style D fill:#d4ecd4,stroke:#7cbf8e,color:#1a1a2e
Loading

Resolution findings:

  • 240p (test_v3_mj_240p_1cand): ID Pres drops to 0.9190 — too little resolution for fine identity detail
  • 480p (test_v3_mj_480p_rerank4): Visual Quality 0.69 on VideoReward but ID Pres only 0.9363 vs 720p's 0.9474
  • 720p native (test_v3_mj_force_enrich): Best ID Pres at 0.9474 with 5 candidates + SAM3+DINOv3

VideoReward bias discovery: VideoReward was trained on ~720p content. Both native 720p and 4× SR outputs (~3.4K) score identically (~0.52–0.54 VQ) while raw 480p outputs score 0.69 — this is a distribution shift artifact, not a genuine quality signal for the competition. ID Pres, not VideoReward, is the primary ranking metric.

ultra_enrich prompt strategy: Tested against simplify_v2 on the 4-sample competition subset (force_enrich config). Result: ID Pres 0.9474 vs 0.9466 for simplify_v2 — ultra_enrich gives +0.0008. MJ-style prompts with explicit material/lighting/quality descriptors help anchor fine-detail appearance across 81 frames.


Phase 10 — Super-Resolution Post-Processing (FlashVSR & Real-ESRGAN)

Two SR post-processing approaches were explored to upgrade 480p outputs to publication quality.

flowchart LR
    subgraph F["FlashVSR-v1.1 Tiny"]
        direction TB
        F1["Temporal video SR\n4× upscale\nBlock-Sparse Attention (LCSA)"]
        F2["480p → ~1920p\nTemporal coherence\nGhost objects on test samples"]
        F1 --> F2
        F3["Root cause:\nCUDA 12.9 system vs 13.0 PyTorch\nBlock-sparse-attn fails to compile\nFalls back to dense SDPA\n→ ghost objects inherent"]
        F2 --> F3
    end

    subgraph R["Real-ESRGAN x4plus"]
        direction TB
        R1["Frame-by-frame SR\n4× upscale · RRDB network\nZero hallucinations"]
        R2["480p → ~3.4K portrait\n720p → ~5K portrait\n~15s/video (vs 90s FlashVSR)"]
        R1 --> R2
        R3["Metrics: unchanged\nVideoReward insensitive >720p\nMEt3R insensitive to res\nPerceptual quality: visibly better"]
        R2 --> R3
    end

    IN["Input video"] --> F
    IN --> R

    F --> VERDICT["FlashVSR: ABANDONED\nGhost objects on 3/6 samples\nCUDA version mismatch unresolvable"]
    R --> VERDICT2["Real-ESRGAN: KEPT for visual demos\nZero metric gain on competition scores\nbut sharper for human judges"]

    style VERDICT fill:#fde8d8,stroke:#e8a87c,color:#1a1a2e
    style VERDICT2 fill:#fdefd8,stroke:#e8c87c,color:#1a1a2e
Loading

FlashVSR failure (ghost objects):

  • Samples 08e60c2e16a64921, 02843aae628b291c, 0893210e6609d201 showed ghost hands and floating objects
  • Root cause: LCSA (Block-Sparse Attention) requires CUDA compilation; system has CUDA 12.9, PyTorch compiled with CUDA 13.0 → mismatch → FlashVSR silently falls back to dense SDPA → ghost objects inherent to this fallback path
  • Confirmed by reading FlashVSR wan_video_dit.py: block_sparse_attn_func is None → uses dense SDPA
  • No workaround available without matching CUDA versions

Real-ESRGAN evaluation (force_enrich 720p → ~5K portrait / ~2.9K landscape):

Config n ID Pres Geo Con Vis Q Mot Q Txt Al
force_enrich native 720p 4 0.9474 0.8940 0.5344 0.4121 0.5789
force_enrich + Real-ESRGAN 4 0.9503 0.9001 0.5328 0.4125 0.5946
Δ +0.003 ↑ +0.006 ↑ −0.002 +0.0004 +0.016 ↑

Marginal positive: ID Pres +0.003, Geo Con +0.006. Likely explanation — sharper edge definition from RRDB upscaling slightly improves DUSt3R depth estimation (MEt3R) and CLIP feature quality (ID Pres), even though CLIP internally resizes to 224×224. VQ drops −0.002 due to VideoReward distribution shift above 720p. Overall: small net positive, not compelling enough to add to the default pipeline given the additional inference time (~15s/video) and disk cost (~4× larger files).


Phase 11 — Seed Ablation and Final Pipeline Consolidation

A 5-seed × 4-sample competition-subset ablation was run to determine the best candidate pool for the final pipeline.

flowchart TB
    subgraph ABLATION["5-seed ablation (force_enrich settings · 720p · 8-step · ultra_enrich)"]
        direction LR
        S42["seed=42\nAvg rank: 4.0\nID Pres: 0.9177\nConsistently worst"]
        S1337["seed=1337\nAvg rank: 2.75\nID Pres: 0.9294\nStrong on human/action"]
        S2024["seed=2024\nAvg rank: 2.50\nID Pres: 0.9405\nMost consistent, highest ID"]
        S7777["seed=7777\nAvg rank: 2.25\nID Pres: 0.9384\nBest overall, won 2/4 samples"]
        S9999["seed=9999\nAvg rank: 3.50\nID Pres: 0.9245\nInconsistent"]
    end

    S42 --> DROP["DROPPED\nseeds 42 and 9999"]
    S9999 --> DROP
    S1337 --> KEEP["KEPT\nseeds 1337, 2024, 7777"]
    S2024 --> KEEP
    S7777 --> KEEP

    KEEP --> FINAL["final_pipeline.yaml\n3 candidates · seeds {1337,2024,7777}\nSAM3+DINOv3 reranking\nID Pres: 0.9474"]

    style DROP fill:#fde8d8,stroke:#e8a87c,color:#1a1a2e
    style KEEP fill:#d4ecd4,stroke:#7cbf8e,color:#1a1a2e
    style FINAL fill:#d4ecd4,stroke:#7cbf8e,color:#1a1a2e,font-weight:bold
Loading

Q2 experiment — competition-metric reranking vs SAM3+DINOv3:

The 5-seed ablation also tested whether using competition metrics directly (ID Pres + Geo Con + Vis Q + Mot Q + Txt Al, equal weights) to select the best candidate would outperform the SAM3+DINOv3 reranker.

Method ID Pres Geo Con Vis Q Mot Q Txt Al Avg
SAM3+DINOv3 (baseline) 0.9474 0.8940 0.5344 0.4121 0.5789 0.6733
Competition-metric ranking 0.9479 0.8878 0.5484 0.4117 0.5789 0.6749

Verdict: Difference is +0.0016 average — within measurement noise. SAM3+DINOv3 wins on ID Pres and Geo Con (the two primary metrics). The existing reranker is a good proxy for competition metrics and does not need to be replaced.

Final pipeline summary:

Parameter Value Reason
Steps 8 Better detail resolution for MJ-style prompts vs 6
Seeds {1337, 2024, 7777} Best avg rank in 5-seed ablation; drops 42 and 9999
Prompt ultra_enrich +0.0008 ID Pres vs simplify_v2 on competition subset
Reranker SAM3+DINOv3 Competition-metric reranking shows no meaningful improvement
Resolution 720p Best ID Pres; 480p+SR gives no metric gain
SR post-processing None Zero metric gain; FlashVSR produces ghost objects

Phase 12 — Composite Reranker: Full Investigation and Final Fix

Problem discovered (post Phase 11): Manual inspection of sample 07a91369fcfa544c (Tiffany gold watch on acrylic stand) revealed that the final pipeline selected seed=7777, which had visible ghosting — a double-image artifact where the watch appeared superimposed at two positions. Seeds 1337 and 2024 were clean. Yet the reranker scored seed=7777 highest.

Root cause 1 — DINOv3 identity scoring favours ghosted frames:

DINOv3 CLS similarity measures per-frame static identity — it rewards frames that contain watch-like features anywhere in the masked region. A ghosted video produces frames where the watch appears at two overlapping positions, inadvertently creating more "watch-like" patch tokens. The reranker sees a higher identity score for the ghosted video than for a clean smooth-motion video.

Initial fix attempt — temporal consistency (0.75×id + 0.25×temporal):

Added score_temporal_consistency() — mean cosine similarity between consecutive DINO frame embeddings. A smooth video scores ~0.98+; a video with dynamic artifacts scores lower. Test run on GPU 0 verified seed=2024 won. Full pipeline was re-launched.

The fix still failed on the full pipeline run:

After the full 70-sample regeneration, 07a91369fcfa544c was still ghosted. Investigation revealed two additional problems.

Root cause 2 — LLM prompt non-determinism:

Qwen3-VL with do_sample=False still produces different outputs run-to-run due to non-deterministic CUDA kernel ordering in Flash Attention. The test run generated a prompt diverging at character 367 from the full-pipeline prompt ("brilliant-cut diamond bezel" vs "bezel encrusted with brilliant-cut diamonds"). With the new prompt, seed=7777 scored highest on ALL metrics — a completely different generation regime than the test.

Root cause 3 — Static ghosting is invisible to the temporal metric:

Static ghosting (double-image frozen in every frame) has high temporal consistency — the same ghost appears in every frame, so consecutive frames look nearly identical. The temporal metric only penalises dynamic artifacts that change over time.

Final composite fix (3 terms):

Added score_patch_concentration() to src/reranker.py. This computes per-patch DINOv3 similarity to the reference CLS token, producing a spatial heatmap over the image grid. A clean frame has one concentrated subject region (high peak-to-mean ratio). A ghosted frame has two overlapping subject regions — the heatmap flattens (lower peak-to-mean ratio). Patch concentration detects static ghosting that temporal consistency misses.

Added prompt caching: LLM-generated prompts are saved to <output_dir>/prompts/<sample_id>.prompt.txt on first run and loaded from cache on all subsequent runs. This eliminates Flash Attention non-determinism and ensures reproducibility across regenerations.

composite = 0.65 × masked_identity
          + 0.25 × patch_concentration_normalized   ← static ghosting detector
          + 0.10 × temporal_consistency             ← dynamic artifact detector

patch_concentration_normalized = tanh((raw_conc - 2.0) / 0.5) × 0.5 + 0.5

Final verification on sample 07a91369fcfa544c (with prompt cache):

Seed masked_id patch_conc temporal composite Winner
1337 0.9097 1.821 0.9824 0.7716 ← selected (clean)
2024 0.9068 1.791 0.9800 0.7629
7777 0.8949 1.745 0.9827 0.7462 (last — ghosting)

Seed=7777 correctly ranks last. The prompt cache guarantees this result is reproducible.

flowchart TB
    PROB["Problem: seed=7777 selected\ndespite visible ghosting artifact"]

    RC1["Root cause 1: DINOv3 CLS measures static similarity\nGhosted frames → more subject patches → higher score"]
    RC2["Root cause 2: Flash Attention non-determinism\nQwen3-VL greedy decode ≠ reproducible across runs\nDifferent prompt → different ghosting regime"]
    RC3["Root cause 3: Static ghosting invisible to temporal metric\nSame ghost frozen every frame → high consecutive-frame sim"]

    FIX1["Fix 1: patch_concentration score\nSpatial peak-to-mean ratio of DINO patch heatmap\nDetects double-image spatial distribution flattening"]
    FIX2["Fix 2: Prompt caching\nSave LLM output to outputs/…/prompts/<id>.prompt.txt\nLoad on subsequent runs — eliminates non-determinism"]

    FINAL["Final composite:\n0.65 × masked_identity\n+ 0.25 × patch_conc_norm\n+ 0.10 × temporal_consistency"]

    PROB --> RC1 & RC2 & RC3
    RC1 & RC3 --> FIX1
    RC2 --> FIX2
    FIX1 & FIX2 --> FINAL

    style PROB fill:#F5B8B8,stroke:#cc7070,color:#1a1a2e
    style RC1 fill:#FEE8CC,stroke:#ccaa60,color:#1a1a2e
    style RC2 fill:#FEE8CC,stroke:#ccaa60,color:#1a1a2e
    style RC3 fill:#FEE8CC,stroke:#ccaa60,color:#1a1a2e
    style FIX1 fill:#D4EAF8,stroke:#7aa8cc,color:#1a1a2e
    style FIX2 fill:#D4EAF8,stroke:#7aa8cc,color:#1a1a2e
    style FINAL fill:#A8ECD4,stroke:#40aa70,color:#1a1a2e,font-weight:bold
Loading

Files changed:

  • src/reranker.py — added score_temporal_consistency() and score_patch_concentration()
  • scripts/run_inference.py — composite 3-term score; prompt caching in <output_dir>/prompts/
  • scripts/run_final_v2.sh — full 70-sample regeneration with fixed reranker

Key Finding: 6-Step is the Final Choice After Reranker Fixes

Phase 7's "4-step beats 6-step with simplified prompts" conclusion was measured with a buggy reranker that was misidentifying subjects, using pathological masks, and making noise-driven selections. After fixing all three bugs, 6-step correctly selects the best of three candidates and produces cleaner results on fine-detail scenes (close-up hands, jewelry, food textures) that aggregate metrics over 6 canonical samples did not capture.

The revised implication: denoising budget matters for fine-detail close-ups even with simplified prompts. The reranker must function correctly to surface this — a broken reranker can mask real quality differences.

Step count id_fid vis_q mot_q txt_al geo_c rt(s)
4-step + fixed reranker 0.9452 0.5583 0.4237 0.6115 0.9222 309
6-step + fixed reranker ★ 0.9466 0.5484 0.4172 0.6073 0.9186 381

6-step wins on identity fidelity (+0.0014) and produces visibly better results on close-up scenes. The quality/latency trade-off (72s more per sample) is acceptable for a challenge submission.


What Failed and Why

Experiment Config Root cause of failure
Qwen3-VL enhancement lightning_full_* Detailed appearance tokens suppress the motion signal in cross-attention; model prioritizes recreating static appearance over fluid motion
Reference latent anchoring anchor_alpha > 0 Blends reference image latents into every denoising step → double-exposure ghosting on any video with motion
3-phase chained generation lightning_chain3_720p_v1 Identity drift accumulates across 3 chained phases; coordination complexity; fragile when sub-actions aren't temporally separable
Rerank×5 candidates lightning_rerank5_* +0.002 id_avg over ×3 at 67% more compute — diminishing returns
v3 system prompt lightning_simplify_720p_slow_v3 Over-prescriptive IMAGE=STARTING_STATE / TEXT=INTENDED_ACTION rules produced higher output variability; id_fid dropped below v2
4-step as final (reversed) v2_4step → v2 Phase 7 ablation was confounded by 3 reranker bugs; after fixing, 6-step produces better fine-detail on close-up scenes
FlashVSR post-processing *_flashvsr* Ghost hands and objects on 3/6 test samples. Root cause: LCSA (Block-Sparse Attention) cannot compile due to CUDA 12.9 system / CUDA 13.0 PyTorch mismatch → silent fallback to dense SDPA → hallucinations inherent
Real-ESRGAN post-processing *_realesrgan* Zero metric gain on competition scores (VideoReward insensitive >720p; MEt3R geometry-based not resolution-based). Useful for visual demos only — does not improve leaderboard position
480p + SR pipeline test_v3_mj_480p_1cand_realesrgan720p VideoReward bias (trained on ~720p) gives 480p raw outputs 0.69 VQ vs 0.52 for 720p/SR — not a genuine quality signal. ID Pres at 480p (0.9306) is lower than native 720p (0.9474)
Competition-metric reranking Phase 11 Q2 Replacing SAM3+DINOv3 with direct competition-metric weighted selection gives only +0.0016 improvement over 5 seeds. SAM3+DINOv3 is already a near-optimal proxy
Seeds 42 and 9999 Phase 11 ablation Seed 42 ranked last (avg rank 4.0/5) across all 4 competition samples. Seed 9999 ranked 3.5/5 with high variability. Dropped in favor of {1337, 2024, 7777} which avg 2.25–2.75 rank

Prompt Simplification: v1 → v4 Evolution

Each version addressed a specific failure mode discovered in evaluation:

flowchart TD
    V1[v1: Text-only\nno image input] -->|Failure: hallucinated spatial\nlayout, brand names| V2

    V2[v2: Vision-grounded\nimage + text to Qwen] -->|Failure: described static\nhand pose as the action| V3

    V3[v3: Starting state vs\nintended action separation] -->|Failure: confused TOOL\nwith TARGET object| V4

    V4[v4: TOOL vs TARGET\nexplicit distinction]

    V2 -->|Best overall metrics\nid_fid 0.9466| BEST[FINAL SUBMISSION]

    style BEST fill:#A8D9B8,stroke:#4CAF82,color:#1a1a2e
    style V2 fill:#A8D9B8,stroke:#4CAF82,color:#1a1a2e
Loading

v1 (text-only): Qwen had no image input. Prompts included brand names ("NES cartridge") and described the static end-state ("inserts piece into hole") rather than the motion arc.

v2 (vision-grounded): Passing the reference image let Qwen read the actual spatial layout. Prompts became physically plausible ("reaches left to pick up drill from workbench"). Best overall metrics.

v3 (starting-state/intended-action): Explicitly separated what the image shows (starting state) from what should happen (intended action). Fixed the "describes static hand pose" failure but introduced higher output variability → identity fidelity dropped.

v4 (TOOL vs TARGET): Added explicit TOOL/TARGET distinction: "TOOL = instrument subject picks up; TARGET = object action is performed ON." Fixed the NES drill sample (correctly generates "picks up drill" not "picks up cartridge") but the more complex system prompt produced subtly different distributions → id_fid 0.9384 vs v2's 0.9466 across the full 6-sample set.

v2 remains the submission because simpler, more consistent prompt distributions lead to better diffusion model convergence.


Impact of Visual Input to Qwen

Config Qwen receives image? id_fid geo_c Notes
simplify_v1 No (text only) 0.9251 0.9095 Hallucinated spatial layout
simplify_v2 Yes 0.9466 0.9186 +0.0215 id_fid, +0.0091 geo_c

The geometry consistency gain (+0.009) is explained by motion trajectories now being grounded in the actual 3D scene: if the drill is to the subject's left in the reference image, the simplified prompt says "reaches left" — and the diffusion model generates motion consistent with that geometry.


Results

Full 21-config ablation (6 canonical samples, --ablation_samples mode):

Config Steps Simplify Slow id_fid vis_q mot_q txt_al geo_c rt(s)
baseline_v1 30 0.9243 0.5121 0.3876 0.5690 0.9012 382
fast_480p_v1 4 0.8920 0.5088 0.3752 0.5501 0.8843 180
lightning_6step_480p_v1 6 0.9105 0.5201 0.3941 0.5682 0.8951 260
lightning_rerank_480p_v1 6 0.9198 0.5273 0.4021 0.5771 0.9034 261
lightning_full_480p_v1 6 Enhance 0.9102 0.5198 0.3987 0.5834 0.8992 267
lightning_rerank_720p_4step_v1 4 0.9331 0.5389 0.4078 0.5908 0.9118 309
lightning_rerank_720p_v1 6 0.9377 0.5427 0.4112 0.5943 0.9140 382
lightning_rerank_720p_slow_v1 6 Prefix+Neg 0.9361 0.5441 0.4138 0.5981 0.9129 382
lightning_simplify_720p_slow_v1 6 Text-only Neg 0.9251 0.5318 0.4052 0.5879 0.9095 383
lightning_simplify_720p_slow_v2 ★ 6 Vision Neg 0.9466 0.5484 0.4172 0.6073 0.9186 381
lightning_simplify_720p_slow_v2_4step 4 Vision Neg 0.9452 0.5583 0.4237 0.6115 0.9222 309
lightning_simplify_720p_slow_v3 4 Vision v3 Neg 0.9343 0.5458 0.4156 0.5990 0.9159 308
lightning_simplify_720p_slow_v4 4 Vision v4 Neg 0.9384 0.5474 0.4175 0.6073 0.9145 310
lightning_simplify_720p_slow_v4_6step 6 Vision v4 Prefix+Neg 0.9407 0.5443 0.4185 0.6005 0.9126 381

★ = final submission. Highest identity fidelity; best visual quality on close-up scenes with fixed reranker.

See reports/ for full visualizations:

Figure Content
fig1_identity_bars.png DINOv3 identity avg / min / SAM3-masked per config
fig2_runtime_vs_identity.png Quality–speed scatter across all 21 configs
fig4_ablation.png 10-step incremental ablation: baseline → final
fig5_challenge_metrics.png All 5 VGBE official metrics per config
fig8_final_radar.png Radar chart for the final submission config
fig9_composite_ranking.png All configs ranked by composite VGBE score

Performance Optimizations

Flash Attention 2

flash_attn==2.8.3 is installed and enabled automatically:

# In load_pipeline() — run_inference.py
try:
    import flash_attn  # noqa
    pipe = WanImageToVideoPipeline.from_pretrained(
        model_id, torch_dtype=bfloat16,
        attn_implementation="flash_attention_2",
    )
    # Reduces attention VRAM and speeds up transformer layers
except ImportError:
    pipe = WanImageToVideoPipeline.from_pretrained(
        model_id, torch_dtype=bfloat16,
    )

Float8 Weight-Only Quantization

from torchao.quantization import quantize_, Float8WeightOnlyConfig
quantize_(pipe.transformer, Float8WeightOnlyConfig())   # 14B → ~7B params
quantize_(pipe.transformer_2, Float8WeightOnlyConfig()) # 14B → ~7B params
# Combined: ~56 GB VRAM → ~28 GB VRAM, <1% quality drop

LoRA must be applied before quantize_() — torchao wraps linear layer parameter names, which breaks the key mapping used by apply_lora_to_transformer.

Load-Once Multi-GPU Strategy

run_final_v2.sh divides all 70 samples into 8 chunks upfront (interleaved round-robin for load balance) and assigns each chunk to one GPU process. Within each process, both models load exactly once for the entire chunk:

Per-GPU process (sequential within the chunk):

  ┌─ Qwen3-VL loads (~30s) ──────────────────────────────┐
  │  prompt₁, prompt₂, … prompt₉  (cache miss → LLM)    │
  │  OR: all loaded from prompts/ cache  (0s LLM time)   │
  └─ unload_qwen() → VRAM freed ─────────────────────────┘
  ┌─ Wan2.2-I2V-A14B loads (~124s) ──────────────────────┐
  │  video₁ (3 candidates → rerank) ~380s                │
  │  video₂ (3 candidates → rerank) ~380s                │
  │  …                                                    │
  │  video₉ (3 candidates → rerank) ~380s                │
  └───────────────────────────────────────────────────────┘

The two models cannot coexist in VRAM (Qwen ~16 GB + Wan2.2 ~28 GB + activations > 80 GB H100 budget), so unload_qwen() explicitly frees VRAM before Wan2.2 loads. All 8 GPU processes run in parallel — there is a single wait at the end of the script.

flowchart TB
    SCRIPT["run_final_v2.sh\n70 samples → 8 chunks"]

    subgraph GPU0["GPU 0  (9 samples)"]
        direction TB
        Q0["Qwen3-VL\nall 9 prompts\n(load once, unload)"]
        W0["Wan2.2\nall 9 videos\n(load once)"]
        Q0 --> W0
    end
    subgraph GPU1["GPU 1  (9 samples)"]
        direction TB
        Q1["Qwen3-VL\n(load once, unload)"]
        W1["Wan2.2\n(load once)"]
        Q1 --> W1
    end
    subgraph GPU27["GPUs 2–7  (~9 samples each)"]
        direction TB
        Q2["Qwen3-VL\n(load once, unload)"]
        W2["Wan2.2\n(load once)"]
        Q2 --> W2
    end

    SCRIPT --> GPU0 & GPU1 & GPU27
Loading

Why this matters — model load cost:

Approach Qwen3-VL loads Wan2.2 loads Wasted load time
Old: 1 sample per process 70 70 ~70 × 124s = 2.4 h
New: chunked per GPU 8 8 ~8 × 124s = 17 min

Per-sample compute breakdown (single GPU, sequential):

Step Time
Qwen3-VL prompt (amortized over chunk, or 0s if cached) ~3s
Wan2.2 model load (amortized over 9 samples) ~14s
Generate candidate 1 — seed 1337 (8-step, 720p) ~127s
Generate candidate 2 — seed 2024 ~127s
Generate candidate 3 — seed 7777 ~127s
SAM3 + DINOv3 composite reranking ~15s
Total per sample ~413s (~6m 53s)

Why 8 steps with a 4-step LoRA? The Lightning LoRA was distilled for native 4-step inference, but running it at 8 steps gives the diffusion model more denoising budget to resolve the richer conditioning from MJ-style ultra_enrich prompts (80–120 words, dense material/lighting/motion tags). At 4 steps, complex prompts leave residual noise in fine-detail regions — close-up textures, jewelry, finger geometry. At 8 steps, that detail converges cleanly. The cost is ~127s vs ~63s per candidate, but since the LoRA keeps the per-step compute low (rank-64 residual), 8-step Lightning LoRA is still 3× faster than 4-step native Wan2.2 (30 steps), striking the right balance between speed and quality for a challenge submission.

Observed wall-clock runtime (final 70-sample run, 8× H100):

Stage Time
Qwen3-VL prompts (8 GPUs parallel, ~9 prompts each) ~2 min
Wan2.2 load (8 GPUs parallel, once each) ~2 min
Inference + reranking (9 samples × ~413s, sequential per GPU) ~62 min
Total wall-clock (measured) 2h 10m
Average per sample (wall-clock ÷ 70) ~1m 53s
Average per sample (sequential compute on one GPU) ~6m 53s

HuggingFace Models

All models are public and hosted under debajyotidasgupta/. No authentication token required.

Model Repo Used for
Wan2.2-I2V-A14B-Diffusers debajyotidasgupta/Wan2.2-I2V-A14B-Diffusers Main I2V diffusion model
Wan2.2-Lightning LoRA debajyotidasgupta/Wan2.2-Lightning 4-step distillation weights
Qwen3-VL-8B-Instruct debajyotidasgupta/Qwen3-VL-8B-Instruct Prompt simplification
DINOv3 ViT-B/16 debajyotidasgupta/dinov3-vitb16-pretrain-lvd1689m Identity reranking
SAM3 debajyotidasgupta/sam3 Subject segmentation for masked reranking
VideoReward debajyotidasgupta/VideoReward Visual/motion/text quality metrics
DUSt3R ViT-L debajyotidasgupta/DUSt3R_ViTLarge_BaseDecoder_512_dpt MEt3R geometry consistency
FeatUp DINOv2 debajyotidasgupta/FeatUp DINOv2 torchhub checkpoints
CLIP ViT-B/32 debajyotidasgupta/vit_base_patch32_clip_224.openai Text alignment scoring

Environment Setup

Steps 1 and 2 are common to both local and Docker workflows — do them once regardless of how you plan to run inference.


Step 1 — Clone the repository (common)

git clone https://github.com/debajyotidasgupta/IdentityFlow.git
cd IdentityFlow
git submodule update --init --recursive   # initialises VideoAlign evaluation harness

Step 2 — Download models (common, ~180 GB)

The large inference models are not bundled in the Docker image (they would make it impractical to distribute). Download them once to a local directory. The same directory is then used by both local inference and Docker via a bind-mount.

⚠️ Use /tmp for dramatically faster model loading. Loading models from network-attached or slow spinning storage (NFS, HDD) takes 16 minutes or more per run. Loading from local SSD/tmpfs (/tmp) takes under 1 minute. Always download to /tmp/model_cache unless you have a specific reason to use a persistent path.

# huggingface_hub is the only requirement — no GPU needed for this step:
pip install huggingface_hub

# Recommended — inference models to /tmp (fast load, ~180 GB):
python scripts/download_models.py --inference_only --cache_dir /tmp/model_cache

# Include geometry-consistency eval models (~182 GB total):
python scripts/download_models.py --cache_dir /tmp/model_cache

# Persistent destination (slow if on NFS/HDD — expect 16+ min model load per run):
python scripts/download_models.py --inference_only --cache_dir /data/model_cache

The script is resumable — re-running it after an interruption skips already-completed repos.

What download_models.py fetches vs what docker build bakes in:

Model Size Fetched by Destination
Wan2.2-I2V-A14B-Diffusers ~56 GB download_models.py /tmp/model_cache/huggingface/hub/
Wan2.2-Lightning LoRA ~1 GB download_models.py /tmp/model_cache/huggingface/hub/
Qwen3-VL-8B-Instruct ~8 GB download_models.py /tmp/model_cache/huggingface/hub/
DINOv3 ViT-B/16 ~300 MB download_models.py /tmp/model_cache/huggingface/hub/
SAM3 ~2.5 GB download_models.py /tmp/model_cache/huggingface/hub/
CLIP ViT-B/32 ~350 MB download_models.py + docker build /tmp/model_cache/huggingface/hub/ + /workspace/checkpoints/clip/ (baked in)
DUSt3R (MEt3R backbone) ~1.5 GB download_models.py /tmp/model_cache/huggingface/hub/
VideoReward checkpoint ~5 GB docker build /workspace/checkpoints/VideoReward/ (baked in)
FeatUp / DINOv2 torchhub ~1.5 GB docker build /workspace/checkpoints/torchhub/ (baked in)

VideoReward, FeatUp, and CLIP are baked into the image rather than downloaded to ./model_cache because Docker bind-mounts /workspace/.cache at runtime — anything written there during build would be hidden. Placing them under /workspace/checkpoints/ (outside the mounted volume) keeps them accessible in every container run without re-downloading. The CLIP_WEIGHTS_PATH env var (set in the image) points eval_quality.py to the baked checkpoint; outside Docker it falls back to ./model_cache via hf_hub_download.


Step 3a — Local venv setup

Skip this if you are using Docker.

python3.12 -m venv .venv && source .venv/bin/activate
pip install torch==2.9.0 torchvision==0.24.0 \
    --index-url https://download.pytorch.org/whl/cu130
pip install torchao==0.16.0 triton==3.5.0
MAX_JOBS=48 CUDA_HOME=/usr/local/cuda \
    pip install flash_attn==2.8.3 --no-build-isolation --no-deps
pip install -r requirements.txt

Step 3b — Docker image: pull (recommended) or build

Skip this if you are using local inference.

Option A — Use the pre-built image from Docker Hub (no docker build on your machine). You must still complete Steps 1–2 (model_cache is not inside the image). Match the tag to your CPU architecture (e.g. arm64 for aarch64 / many Grace Hopper nodes).

docker pull docker.io/debajyotidasgupta/vgbe2026-i2v:arm64
docker tag debajyotidasgupta/vgbe2026-i2v:arm64 vgbe2026-i2v:latest

Option B — Build locally (~20–30 min first time: torch, flash_attn, FeatUp, MEt3R, pytorch3d, plus VideoReward + FeatUp checkpoints baked in):

docker build -t vgbe2026-i2v:latest .

# Pin platform when host and default image arch differ, e.g. arm64:
# docker build --platform linux/arm64 -t vgbe2026-i2v:arm64 .

Running Inference

Prerequisite: Steps 1 and 2 of Environment Setup must be complete — models must be downloaded to /tmp/model_cache (or your chosen --cache_dir).

Important: run_parallel.sh, run.sh, and all related scripts run inside the container. Before using them, open an interactive shell in the container first:

docker run --gpus all --entrypoint bash \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  -it vgbe2026-i2v:latest

Then run the scripts below from within that shell.

Full 70-sample run — all GPUs (recommended)

./run_parallel.sh --config configs/final_pipeline.yaml \
                  --output_dir outputs/final

Limit to N GPUs

./run_parallel.sh --config configs/final_pipeline.yaml \
                  --output_dir outputs/final --gpus 6

Single GPU (debug / specific samples)

./run.sh --config configs/final_pipeline.yaml \
         --sample_ids a50a70b67b89feb1 e85432e145830b6b

Evaluation and plots

# Score all generated runs:
python scripts/eval_quality.py --all

# Regenerate plots (6-sample ablation comparison):
python scripts/make_plots.py --ablation_samples \
    f8c054d1aa3f6487 e85432e145830b6b a9ab2b16bc2bddee \
    07a91369fcfa544c e90a9a89e15b285b a50a70b67b89feb1

Docker Deployment

Prerequisite: complete Steps 1–2 (clone + model download) and Step 3b — either docker pull the pre-built image or docker build locally — from Environment Setup above. Models must already be downloaded to /tmp/model_cache before running the container.

Once setup is done every docker run starts immediately — no downloads, model load in ~45 s from /tmp (vs 16+ min from NFS/slow disk):

Arguments after the image name replace the default command. The image CMD invokes run_parallel.sh with --config configs/final_pipeline.yaml. If you pass any extra flags to run_parallel.sh (for example --gpus, --sample_ids, --num_samples, --output_dir), you must include --config … again — otherwise the container only receives your flags and exits with [run_parallel.sh] ERROR: --config is required. (docker run --gpus all is separate: it selects which GPUs the container may use; run_parallel.sh --gpus N controls how work is split across those GPUs.)

# All available GPUs — uses default CMD (includes --config):
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest

# Limit parallel workers to 4 GPUs (--config required; it replaces the default CMD):
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest \
  --config configs/final_pipeline.yaml \
  --gpus 4

# Only specific validation samples (--config required; sample IDs = basenames without .jpg):
docker run --gpus all \
  -v /tmp/model_cache:/workspace/.cache \
  -v $(pwd)/val_data_released_by_0321:/workspace/val_data_released_by_0321:ro \
  -v $(pwd)/outputs:/workspace/outputs \
  vgbe2026-i2v:latest \
  --config configs/final_pipeline.yaml \
  --sample_ids 017ab5d3f9382339 f8c054d1aa3f6487 e85432e145830b6b

# docker compose shortcuts (MODEL_CACHE / VAL_DATA / OUTPUTS env vars respected):
MODEL_CACHE=/tmp/model_cache \
VAL_DATA=$(pwd)/val_data_released_by_0321 \
OUTPUTS=$(pwd)/outputs \
docker compose run infer-multi          # all GPUs, final config

docker compose run infer                # single GPU (debug)
docker compose run eval                 # score all output directories

Bind-mount contract — the container expects exactly these three paths:

Host path Container path Purpose
/tmp/model_cache /workspace/.cache HF + torch model cache (populated by python scripts/download_models.py --inference_only --cache_dir /tmp/model_cache) — use /tmp for fast load (<1 min); network/NFS paths cause 16+ min load times
./val_data_released_by_0321 /workspace/val_data_released_by_0321 Validation images and prompts (read-only)
./outputs /workspace/outputs Generated videos written here

Project Structure

image-to-video/
├── configs/                              # 30+ YAML experiment configs
│   ├── final_pipeline.yaml                          # FINAL SUBMISSION ★ (Phase 11)
│   ├── lightning_simplify_720p_slow_v2.yaml         # Phase 8 best (6-step, simplify_v2)
│   ├── test_v3_mj_force_enrich.yaml                 # Phase 9 best (5-cand, ultra_enrich)
│   ├── test_v3_mj_fe_seed{42,1337,2024,7777,9999}.yaml  # Phase 11 seed ablation
│   ├── lightning_rerank_720p_v1.yaml                # Phase 2 best (no simplify)
│   ├── lightning_simplify_720p_slow_v3.yaml         # v3 system prompt (ablation)
│   ├── lightning_simplify_720p_slow_v4.yaml         # TOOL/TARGET fix (ablation)
│   ├── lightning_simplify_720p_slow_v4_6step.yaml   # v4 6-step (ablation)
│   └── [22 earlier ablation configs]
│
├── scripts/
│   ├── run_inference.py       # Main I2V pipeline (Flash Attn 2, float8, LoRA)
│   ├── eval_quality.py        # Evaluation harness (DINOv3, MEt3R, VideoReward)
│   ├── make_plots.py          # Publication plots (30+ configs, soft pastel palette)
│   ├── upscale_realesrgan.py  # Real-ESRGAN x4plus video SR (Phase 10)
│   ├── upscale_flashvsr.py    # FlashVSR 4× SR — abandoned (ghost objects, Phase 10)
│   ├── run_q2_allseeds.sh     # Phase 11 seed ablation launcher (5-seed × 4-sample)
│   ├── run_simplify_v4.sh     # v4 4-step parallel launcher
│   └── run_simplify_v4_6step.sh
│
├── src/
│   ├── prompt_simplifier.py   # Qwen3-VL vision-grounded simplification (v2–v4)
│   ├── prompt_enhancer.py     # Qwen3-VL enhancement (Phase 3, now abandoned)
│   ├── prompt_decomposer.py   # 3-phase decomposition (chain experiment, abandoned)
│   ├── reranker.py            # DINOv3 identity scorer (full-image)
│   ├── masked_scorer.py       # SAM3-masked DINOv3 identity scorer
│   ├── lora_utils.py          # Lightning LoRA weight merging (pre-quantization)
│   ├── pipeline_pool.py       # Multi-GPU worker pool (for 3-phase chaining)
│   ├── anchoring.py           # Reference latent anchoring (abandoned, ghosting)
│   └── final_metrics.py       # MEt3R + VideoReward official VGBE metrics
│
├── run.sh                     # Single-GPU entry point
├── run_parallel.sh            # Multi-GPU parallel entry point (RECOMMENDED)
├── Dockerfile                 # CUDA 12.9, flash_attn, all deps, baked checkpoints
├── docker-compose.yml         # infer / infer-multi / eval services
├── requirements.txt           # Python dependencies
│
├── reports/                   # PNG result plots (fig1–fig9b)
├── outputs/                   # Generated videos (<config_name>/<sample_id>.mp4)
├── logs/                      # Execution logs
└── val_data_released_by_0321/ # VGBE validation set
    ├── images/                # Reference images (.jpg, 70 total)
    └── prompts/               # Text prompts (.txt, 70 total)

VGBE 2026 Official Metrics

All metrics are evaluated against the original verbose prompt (not the simplified one) to ensure fair comparison with systems that do not use prompt engineering.

Metric Method Measures
Identity Fidelity ConsID-Gen CLIP cosine Subject appearance consistency across frames
Visual Quality VideoReward Perceptual quality, sharpness, artifact absence
Motion Quality VideoReward Temporal smoothness, realistic dynamics
Text Alignment VideoReward / VideoAlign How well the video reflects the prompt
Geometry Consistency MEt3R (DUSt3R-based) 3D structural consistency across frames

Compressing & Exporting Outputs

After generating videos, use compress_export.sh to validate all 70 outputs, re-encode them with H.265 CRF 28 (~4× smaller than the H.264 CRF 18 originals), and package everything into a single .tar.gz for submission or transfer.

Quick start

# Export outputs/final/ (default)
./compress_export.sh

# Export a specific config's outputs
./compress_export.sh --input outputs/lightning_simplify_720p_slow_v2

# Custom output path
./compress_export.sh --input outputs/final --out /tmp/submission.tar.gz

Options

Flag Default Description
--input <dir> outputs/final Folder of .mp4 files to compress
--out <path> <input>_export.tar.gz Output archive path
--crf <n> 28 H.265 CRF — lower = better quality, larger file
--preset <p> medium x265 preset (ultrafastveryslow)
--jobs <n> 8 Parallel re-encode workers
--keep off Keep the re-encoded folder after archiving

What it does

  1. Validates — checks every sample ID from val_data_released_by_0321/images/ has a non-empty .mp4; aborts with a list of missing files if not.
  2. Re-encodes — re-encodes all videos in parallel using the bundled imageio-ffmpeg binary (no system ffmpeg needed), with libx265 -tag:v hvc1 -pix_fmt yuv420p -movflags +faststart.
  3. Archives — tars the re-encoded folder with gzip and reports the final size and compression ratio.
  4. Cleans up — removes the intermediate re-encoded folder unless --keep is passed.

Typical results

Size
Original final/ — H.264 CRF 18 ~126 MB
Re-encoded — H.265 CRF 28 ~30 MB
final_export.tar.gz ~30 MB

Note: ffmpeg is sourced from the imageio-ffmpeg wheel bundled in the .venv (no separate installation required). A system ffmpeg is used as a fallback if imageio-ffmpeg is not available.


IdentityFlow — Consistent Identity · Fluid Motion

About

Identity-preserving image-to-video generation: vision-grounded prompt simplification via Qwen3-VL, Lightning LoRA 4-step inference, and SAM3-masked DINOv3 candidate reranking for fluid 720p video from a single reference image.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors