Skip to content

Discussion on Evaluation Shortcuts and Trivial Baselines on VBVR-Bench #240

@PabloAcuaviva

Description

@PabloAcuaviva

First of all, thank you for putting together such a cool dataset and benchmark! I evaluated a series of simple baseline predictions on the 500 samples in VBVR-Bench-Data (running on CPU) to see how the current evaluation metrics handle non-reasoning shortcuts. I believe I followed the setup steps correctly, but I would love for someone else to reproduce these results just to ensure everything is right.

I observed two interesting trends:

  • Trivial Baselines (requiring zero knowledge of the task's final state, such as a completely static frame) can outperform SOTA generative models on the leaderboard.

  • End-State Shortcuts (which transition to the final frame while bypassing intermediate reasoning steps) achieve an overall score of ~0.92, indicating that intermediate states and temporal dynamics might not be heavily validated.

Below, I have documented the scores and behaviors of these baselines. I hope this is helpful for refining the benchmark's evaluation framework to better reflect true physical and logical reasoning.


1. Trivial Baselines Outperforming SOTA Models

I tested three trivial baselines that require no knowledge of the final task solution and can be generated purely from the first input frame:

  • Static: Copy the input first frame for all $N$ frames.
  • Rotation: Rotate the first frame from 0 to 30 degrees (filled with gray borders).
  • Noise Fade: Linear interpolation from the first frame to a static Gaussian noise frame.

See examples below:

Task Ground Truth Static Rotation Noise Fade
O-62 (Gravity Physics)
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4
O-56 (Raven Matrix)
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4
O-65 (Animal Size Sorting)
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4

Here is how these trivial baselines rank against models on the official leaderboard (along with the Human Reference), sorted by Overall Average:

Rank Model / Setting Type In-Domain (ID) Out-of-Domain (OOD) Overall Average
- Human Reference 👤 Reference 0.9600 0.9880 0.9740
2 🟩 Static 🛠️ Our Baseline 0.7356 0.6873 0.7114
3 🔷 VBVR-Wan2.2 🏆 Leaderboard #1 0.7599 0.6097 0.6848
4 🟩 Rotation 🛠️ Our Baseline 0.6277 0.5934 0.6106
5 🔷 VBVR-Wan2.1 🏆 Leaderboard #2 0.7239 0.4609 0.5924
6 🔷 Sora 2 🏆 Leaderboard #3 0.5691 0.5225 0.5457
7 🔷 Seedance 2.0 🏆 Leaderboard #4 0.5703 0.5171 0.5437
8 🔷 VBVR-LTX2.3 🏆 Leaderboard #5 0.5801 0.4526 0.5163
9 🔷 Veo 3.1 🏆 Leaderboard #6 0.5307 0.4288 0.4800
10 🟩 Noise Fade 🛠️ Our Baseline 0.4331 0.4244 0.4287
11 🔷 Runway Gen-4 Turbo 🏆 Leaderboard #7 0.3920 0.4141 0.4031
12 🔷 Wan2.2-I2V-A14B 🏆 Leaderboard #8 0.4125 0.3287 0.3714
13 🔷 Kling 2.6 🏆 Leaderboard #9 0.4082 0.3300 0.3691
14 🔷 LTX-2 🏆 Leaderboard #10 0.3287 0.2971 0.3129
15 🔷 CogVideoX1.5-5B-I2V 🏆 Leaderboard #11 0.2831 0.2623 0.2727
16 🔷 HunyuanVideo-I2V 🏆 Leaderboard #12 0.2799 0.2653 0.2726

Why this happens

The overall score calculation, defined in vbvr_bench/evaluators/base_evaluator.py under the _calculate_overall_score() method (lines 373-395), computes a weighted average of the individual dimension scores using a standard set of weights:

    def _calculate_overall_score(self, dimensions: Dict[str, float]) -> float:
        standard_weights = {
            'first_frame_consistency': 0.15,
            'final_frame_accuracy': 0.35,
            'temporal_smoothness': 0.15,
            'visual_quality': 0.10,
            'task_specific': 0.25,
        }
        clamped_dimensions = {k: max(0.0, min(1.0, v)) for k, v in dimensions.items()}
        return max(0.0, min(1.0, weighted_average(clamped_dimensions, standard_weights)))

For reference, here are the detailed component scores obtained by these trivial baselines:

Setting In-Domain Out-of-Domain Overall Average FF Consist LF Acc Temp Smooth Vis Quality Task Spec
Static 0.7356 0.6873 0.7114 1.0000 0.7471 0.9983 0.7333 0.3655
Noise Fade 0.4331 0.4244 0.4287 1.0000 0.0343 0.9434 0.8008 0.1979
Rotation 0.6277 0.5934 0.6106 1.0000 0.5710 0.9744 0.6343 0.2543

Suggested Scoring Change

Rather than blending everything into a single overall score that allows these shortcuts to dominate, it might make sense to record and display two separate numbers on the leaderboard. Of course, this is just an initial suggestion and I would love to hear others' thoughts on how to address this:

  1. Task Reasoning Score: Measuring task-specific accuracy and physics correctness (e.g. Task Spec, LF Acc).
  2. Video Quality Score: Measuring visual quality, consistency, and smoothness (e.g. FF Consist, Temp Smooth, Vis Quality).

2. Bypassing Intermediate Physics via End-State Predictions

I also tested two baselines that assume the final target frame (the solution) is known, but perform no intermediate reasoning or physical dynamics, bypassing the temporal path:

  • Linear: A linear cross-fade transition from the first frame to the ground truth final frame.
  • Discrete: Display the first frame for the first half of the video, then jump directly to the final frame for the second half.

See examples below:

Task Ground Truth Linear Discrete
O-62 (Gravity Physics)
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4
O-56 (Raven Matrix)
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4
O-65 (Animal Size Sorting)
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4

Here are the detailed scores for these baselines:

Setting In-Domain Out-of-Domain Overall Average FF Consist LF Acc Temp Smooth Vis Quality Task Spec
Linear 0.9345 0.9016 0.9181 1.0000 0.9854 0.9963 0.7209 0.8511
Discrete 0.9356 0.9056 0.9206 1.0000 0.9877 0.9672 0.7378 0.8687

These achieve overall scores of 0.9181 and 0.9206 respectively, showing that simply predicting the final image is extremely competitive.

This raises a question of how well the intermediate states/physics of the video are being evaluated. The task-specific scores for these shortcuts are quite high (~0.85+). The spatial alignment checks of the task-specific evaluators do have a non-zero penalizing effect (for example, the discrete step is penalized on graph-path or optical-flow checks in some tasks compared to a true continuous physical sequence, which shows that the alignment evaluation is working to some measure), but it is unclear how much the intermediate temporal consistency actually matters compared to simply getting the final frame correct.


Minor Note: NameError in G-24 & G-54 Evaluators

As a side note, while running the evaluation suite, tasks G-24_separate_objects_no_spin_data-generator and G-54_connecting_color_data-generator silently crashed with a NameError: name 'safe_distance' is not defined inside SeparateObjectsNoSpinEvaluator and ConnectingColorEvaluator (both located in vbvr_bench/evaluators/Out_of_Domain_50_part1.py on lines 91 and 668 respectively). Note that I did not fix this bug during my own evaluations to make sure my baseline runs remain comparable to the scores reported on the current leaderboard.

Adding safe_distance to the imports at the top of that file fixes both crashes:

-from ..utils import compute_optical_flow, normalize_frame_size
+from ..utils import compute_optical_flow, safe_distance, normalize_frame_size

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions