Discussion on Evaluation Shortcuts and Trivial Baselines on VBVR-Bench

First of all, thank you for putting together such a cool dataset and benchmark! I evaluated a series of simple baseline predictions on the 500 samples in VBVR-Bench-Data (running on CPU) to see how the current evaluation metrics handle non-reasoning shortcuts. I believe I followed the setup steps correctly, but I would love for someone else to reproduce these results just to ensure everything is right. 

I observed two interesting trends:

- **Trivial Baselines** (requiring zero knowledge of the task's final state, such as a completely static frame) can outperform SOTA generative models on the leaderboard.

- **End-State Shortcuts** (which transition to the final frame while bypassing intermediate reasoning steps) achieve an overall score of ~0.92, indicating that intermediate states and temporal dynamics might not be heavily validated.

Below, I have documented the scores and behaviors of these baselines. I hope this is helpful for refining the benchmark's evaluation framework to better reflect true physical and logical reasoning.

---

### 1. Trivial Baselines Outperforming SOTA Models

I tested three trivial baselines that require no knowledge of the final task solution and can be generated purely from the first input frame:
- **Static:** Copy the input first frame for all $N$ frames.
- **Rotation:** Rotate the first frame from 0 to 30 degrees (filled with gray borders).
- **Noise Fade:** Linear interpolation from the first frame to a static Gaussian noise frame.

See examples below:

<table>
  <thead>
    <tr>
      <th>Task</th>
      <th>Ground Truth</th>
      <th>Static</th>
      <th>Rotation</th>
      <th>Noise Fade</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>O-62 (Gravity Physics)</strong></td>
      <td><video src="https://github.com/user-attachments/assets/8ab15fc2-353b-4b90-bf48-e6325847a368" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/9d9c9b79-d850-489f-b951-a28952f4d94f" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/35e371b2-fa0f-4750-a82f-779a452dd404" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/f6ebf4e9-f413-4fde-91a7-a6207122b131" width="200" controls></video></td>
    </tr>
    <tr>
      <td><strong>O-56 (Raven Matrix)</strong></td>
      <td><video src="https://github.com/user-attachments/assets/dc5703a0-393b-40c9-a86f-479954da8f51" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/e77f7e67-6c68-4dd3-9c7d-5402a767a8ac" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/b9eebb50-002c-41c1-9637-52280ea92380" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/e93e50de-7857-48f2-97dd-3dd8db8fc643" width="200" controls></video></td>
    </tr>
    <tr>
      <td><strong>O-65 (Animal Size Sorting)</strong></td>
      <td><video src="https://github.com/user-attachments/assets/0c7c1de5-16ee-4ae7-a171-96307fa63dad" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/3c1fa0b2-4352-4c92-ac1c-84145b94d2e6" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/84259d8e-ff48-4b41-9c86-76f5345301d7" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/5e6253fe-9141-4337-9b20-0a2109324d13" width="200" controls></video></td>
    </tr>
  </tbody>
</table>

Here is how these trivial baselines rank against models on the official leaderboard (along with the Human Reference), sorted by Overall Average:

| Rank | Model / Setting | Type | In-Domain (ID) | Out-of-Domain (OOD) | Overall Average |
|---|---|---|---|---|---|
| - | ⬜ **Human Reference** | 👤 Reference | 0.9600 | 0.9880 | **0.9740** |
| 2 | 🟩 **Static** | 🛠️ Our Baseline | 0.7356 | 0.6873 | **0.7114** |
| 3 | 🔷 **VBVR-Wan2.2** | 🏆 Leaderboard #1 | 0.7599 | 0.6097 | **0.6848** |
| 4 | 🟩 **Rotation** | 🛠️ Our Baseline | 0.6277 | 0.5934 | **0.6106** |
| 5 | 🔷 **VBVR-Wan2.1** | 🏆 Leaderboard #2 | 0.7239 | 0.4609 | **0.5924** |
| 6 | 🔷 **Sora 2** | 🏆 Leaderboard #3 | 0.5691 | 0.5225 | **0.5457** |
| 7 | 🔷 **Seedance 2.0** | 🏆 Leaderboard #4 | 0.5703 | 0.5171 | **0.5437** |
| 8 | 🔷 **VBVR-LTX2.3** | 🏆 Leaderboard #5 | 0.5801 | 0.4526 | **0.5163** |
| 9 | 🔷 **Veo 3.1** | 🏆 Leaderboard #6 | 0.5307 | 0.4288 | **0.4800** |
| 10 | 🟩 **Noise Fade** | 🛠️ Our Baseline | 0.4331 | 0.4244 | **0.4287** |
| 11 | 🔷 **Runway Gen-4 Turbo** | 🏆 Leaderboard #7 | 0.3920 | 0.4141 | **0.4031** |
| 12 | 🔷 **Wan2.2-I2V-A14B** | 🏆 Leaderboard #8 | 0.4125 | 0.3287 | **0.3714** |
| 13 | 🔷 **Kling 2.6** | 🏆 Leaderboard #9 | 0.4082 | 0.3300 | **0.3691** |
| 14 | 🔷 **LTX-2** | 🏆 Leaderboard #10 | 0.3287 | 0.2971 | **0.3129** |
| 15 | 🔷 **CogVideoX1.5-5B-I2V** | 🏆 Leaderboard #11 | 0.2831 | 0.2623 | **0.2727** |
| 16 | 🔷 **HunyuanVideo-I2V** | 🏆 Leaderboard #12 | 0.2799 | 0.2653 | **0.2726** |

#### Why this happens

The overall score calculation, defined in `vbvr_bench/evaluators/base_evaluator.py` under the `_calculate_overall_score()` method (lines 373-395), computes a weighted average of the individual dimension scores using a standard set of weights:

```python
    def _calculate_overall_score(self, dimensions: Dict[str, float]) -> float:
        standard_weights = {
            'first_frame_consistency': 0.15,
            'final_frame_accuracy': 0.35,
            'temporal_smoothness': 0.15,
            'visual_quality': 0.10,
            'task_specific': 0.25,
        }
        clamped_dimensions = {k: max(0.0, min(1.0, v)) for k, v in dimensions.items()}
        return max(0.0, min(1.0, weighted_average(clamped_dimensions, standard_weights)))
```

For reference, here are the detailed component scores obtained by these trivial baselines:

| Setting | In-Domain | Out-of-Domain | Overall Average | FF Consist | LF Acc | Temp Smooth | Vis Quality | Task Spec |
|---|---|---|---|---|---|---|---|---|
| **Static** | 0.7356 | 0.6873 | **0.7114** | 1.0000 | 0.7471 | 0.9983 | 0.7333 | 0.3655 |
| **Noise Fade** | 0.4331 | 0.4244 | **0.4287** | 1.0000 | 0.0343 | 0.9434 | 0.8008 | 0.1979 |
| **Rotation** | 0.6277 | 0.5934 | **0.6106** | 1.0000 | 0.5710 | 0.9744 | 0.6343 | 0.2543 |

---

### Suggested Scoring Change

Rather than blending everything into a single overall score that allows these shortcuts to dominate, it might make sense to record and display two separate numbers on the leaderboard. Of course, this is just an initial suggestion and I would love to hear others' thoughts on how to address this:
1. **Task Reasoning Score**: Measuring task-specific accuracy and physics correctness (e.g. `Task Spec`, `LF Acc`).
2. **Video Quality Score**: Measuring visual quality, consistency, and smoothness (e.g. `FF Consist`, `Temp Smooth`, `Vis Quality`).

---

### 2. Bypassing Intermediate Physics via End-State Predictions

I also tested two baselines that assume the final target frame (the solution) is known, but perform no intermediate reasoning or physical dynamics, bypassing the temporal path:
- **Linear:** A linear cross-fade transition from the first frame to the ground truth final frame.
- **Discrete:** Display the first frame for the first half of the video, then jump directly to the final frame for the second half.

See examples below:

<table>
  <thead>
    <tr>
      <th>Task</th>
      <th>Ground Truth</th>
      <th>Linear</th>
      <th>Discrete</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>O-62 (Gravity Physics)</strong></td>
      <td><video src="https://github.com/user-attachments/assets/32462454-994a-4194-a867-0a7cec2f874b" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/286d3512-bd21-46ec-9787-29cc108932bc" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/ee0bad03-447c-4365-aa29-48653ba7402d" width="200" controls></video></td>
    </tr>
    <tr>
      <td><strong>O-56 (Raven Matrix)</strong></td>
      <td><video src="https://github.com/user-attachments/assets/2439e869-cfb9-4030-a359-21a1243e7721" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/056e7c54-4e83-43bf-b940-aff7364f4361" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/3d328c27-735d-47d9-9ea8-dd2fcf91c7ae" width="200" controls></video></td>
    </tr>
    <tr>
      <td><strong>O-65 (Animal Size Sorting)</strong></td>
      <td><video src="https://github.com/user-attachments/assets/dcd4a3da-27a2-4088-bfa5-5d94ee6f8c9f" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/9e68d15f-b522-46da-8ced-89979db9425e" width="200" controls></video></td>
      <td><video src="https://github.com/user-attachments/assets/65f472f9-6c05-4459-b173-bb2da0108d12" width="200" controls></video></td>
    </tr>
  </tbody>
</table>

Here are the detailed scores for these baselines:

| Setting | In-Domain | Out-of-Domain | Overall Average | FF Consist | LF Acc | Temp Smooth | Vis Quality | Task Spec |
|---|---|---|---|---|---|---|---|---|
| **Linear** | 0.9345 | 0.9016 | **0.9181** | 1.0000 | 0.9854 | 0.9963 | 0.7209 | 0.8511 |
| **Discrete** | 0.9356 | 0.9056 | **0.9206** | 1.0000 | 0.9877 | 0.9672 | 0.7378 | 0.8687 |

These achieve overall scores of **`0.9181`** and **`0.9206`** respectively, showing that simply predicting the final image is extremely competitive.

This raises a question of how well the intermediate states/physics of the video are being evaluated. The task-specific scores for these shortcuts are quite high (`~0.85+`). The spatial alignment checks of the task-specific evaluators do have a non-zero penalizing effect (for example, the discrete step is penalized on graph-path or optical-flow checks in some tasks compared to a true continuous physical sequence, which shows that the alignment evaluation is working to some measure), but it is unclear how much the intermediate temporal consistency actually matters compared to simply getting the final frame correct.

---

### Minor Note: NameError in G-24 & G-54 Evaluators

As a side note, while running the evaluation suite, tasks `G-24_separate_objects_no_spin_data-generator` and `G-54_connecting_color_data-generator` silently crashed with a `NameError: name 'safe_distance' is not defined` inside `SeparateObjectsNoSpinEvaluator` and `ConnectingColorEvaluator` (both located in `vbvr_bench/evaluators/Out_of_Domain_50_part1.py` on lines 91 and 668 respectively). Note that I did not fix this bug during my own evaluations to make sure my baseline runs remain comparable to the scores reported on the current leaderboard.

Adding `safe_distance` to the imports at the top of that file fixes both crashes:

```diff
-from ..utils import compute_optical_flow, normalize_frame_size
+from ..utils import compute_optical_flow, safe_distance, normalize_frame_size
```


Task	Ground Truth	Static	Rotation	Noise Fade
O-62 (Gravity Physics)	00000_ground_truth.mp4	00000_setting1_static.mp4	00000_setting5_rotation.mp4	00000_setting4_noise.mp4
O-56 (Raven Matrix)	00000_ground_truth.mp4	00000_setting1_static.mp4	00000_setting5_rotation.mp4	00000_setting4_noise.mp4
O-65 (Animal Size Sorting)	00000_ground_truth.mp4	00000_setting1_static.mp4	00000_setting5_rotation.mp4	00000_setting4_noise.mp4

Task	Ground Truth	Linear	Discrete
O-62 (Gravity Physics)	00000_ground_truth.mp4	00000_setting2_linear.mp4	00000_setting3_discrete.mp4
O-56 (Raven Matrix)	00000_ground_truth.mp4	00000_setting2_linear.mp4	00000_setting3_discrete.mp4
O-65 (Animal Size Sorting)	00000_ground_truth.mp4	00000_setting2_linear.mp4	00000_setting3_discrete.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on Evaluation Shortcuts and Trivial Baselines on VBVR-Bench #240

1. Trivial Baselines Outperforming SOTA Models

Why this happens

Suggested Scoring Change

2. Bypassing Intermediate Physics via End-State Predictions

Minor Note: NameError in G-24 & G-54 Evaluators

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rank	Model / Setting	Type	In-Domain (ID)	Out-of-Domain (OOD)	Overall Average
-	⬜ Human Reference	👤 Reference	0.9600	0.9880	0.9740
2	🟩 Static	🛠️ Our Baseline	0.7356	0.6873	0.7114
3	🔷 VBVR-Wan2.2	🏆 Leaderboard #1	0.7599	0.6097	0.6848
4	🟩 Rotation	🛠️ Our Baseline	0.6277	0.5934	0.6106
5	🔷 VBVR-Wan2.1	🏆 Leaderboard #2	0.7239	0.4609	0.5924
6	🔷 Sora 2	🏆 Leaderboard #3	0.5691	0.5225	0.5457
7	🔷 Seedance 2.0	🏆 Leaderboard #4	0.5703	0.5171	0.5437
8	🔷 VBVR-LTX2.3	🏆 Leaderboard #5	0.5801	0.4526	0.5163
9	🔷 Veo 3.1	🏆 Leaderboard #6	0.5307	0.4288	0.4800
10	🟩 Noise Fade	🛠️ Our Baseline	0.4331	0.4244	0.4287
11	🔷 Runway Gen-4 Turbo	🏆 Leaderboard #7	0.3920	0.4141	0.4031
12	🔷 Wan2.2-I2V-A14B	🏆 Leaderboard #8	0.4125	0.3287	0.3714
13	🔷 Kling 2.6	🏆 Leaderboard #9	0.4082	0.3300	0.3691
14	🔷 LTX-2	🏆 Leaderboard #10	0.3287	0.2971	0.3129
15	🔷 CogVideoX1.5-5B-I2V	🏆 Leaderboard #11	0.2831	0.2623	0.2727
16	🔷 HunyuanVideo-I2V	🏆 Leaderboard #12	0.2799	0.2653	0.2726

Setting	In-Domain	Out-of-Domain	Overall Average	FF Consist	LF Acc	Temp Smooth	Vis Quality	Task Spec
Static	0.7356	0.6873	0.7114	1.0000	0.7471	0.9983	0.7333	0.3655
Noise Fade	0.4331	0.4244	0.4287	1.0000	0.0343	0.9434	0.8008	0.1979
Rotation	0.6277	0.5934	0.6106	1.0000	0.5710	0.9744	0.6343	0.2543

Setting	In-Domain	Out-of-Domain	Overall Average	FF Consist	LF Acc	Temp Smooth	Vis Quality	Task Spec
Linear	0.9345	0.9016	0.9181	1.0000	0.9854	0.9963	0.7209	0.8511
Discrete	0.9356	0.9056	0.9206	1.0000	0.9877	0.9672	0.7378	0.8687

Discussion on Evaluation Shortcuts and Trivial Baselines on VBVR-Bench #240

Description

1. Trivial Baselines Outperforming SOTA Models

Why this happens

Suggested Scoring Change

2. Bypassing Intermediate Physics via End-State Predictions

Minor Note: NameError in G-24 & G-54 Evaluators

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions