You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for putting together such a cool dataset and benchmark! I evaluated a series of simple baseline predictions on the 500 samples in VBVR-Bench-Data (running on CPU) to see how the current evaluation metrics handle non-reasoning shortcuts. I believe I followed the setup steps correctly, but I would love for someone else to reproduce these results just to ensure everything is right.
I observed two interesting trends:
Trivial Baselines (requiring zero knowledge of the task's final state, such as a completely static frame) can outperform SOTA generative models on the leaderboard.
End-State Shortcuts (which transition to the final frame while bypassing intermediate reasoning steps) achieve an overall score of ~0.92, indicating that intermediate states and temporal dynamics might not be heavily validated.
Below, I have documented the scores and behaviors of these baselines. I hope this is helpful for refining the benchmark's evaluation framework to better reflect true physical and logical reasoning.
1. Trivial Baselines Outperforming SOTA Models
I tested three trivial baselines that require no knowledge of the final task solution and can be generated purely from the first input frame:
Static: Copy the input first frame for all $N$ frames.
Rotation: Rotate the first frame from 0 to 30 degrees (filled with gray borders).
Noise Fade: Linear interpolation from the first frame to a static Gaussian noise frame.
See examples below:
Task
Ground Truth
Static
Rotation
Noise Fade
O-62 (Gravity Physics)
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4
O-56 (Raven Matrix)
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4
O-65 (Animal Size Sorting)
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4
Here is how these trivial baselines rank against models on the official leaderboard (along with the Human Reference), sorted by Overall Average:
The overall score calculation, defined in vbvr_bench/evaluators/base_evaluator.py under the _calculate_overall_score() method (lines 373-395), computes a weighted average of the individual dimension scores using a standard set of weights:
For reference, here are the detailed component scores obtained by these trivial baselines:
Setting
In-Domain
Out-of-Domain
Overall Average
FF Consist
LF Acc
Temp Smooth
Vis Quality
Task Spec
Static
0.7356
0.6873
0.7114
1.0000
0.7471
0.9983
0.7333
0.3655
Noise Fade
0.4331
0.4244
0.4287
1.0000
0.0343
0.9434
0.8008
0.1979
Rotation
0.6277
0.5934
0.6106
1.0000
0.5710
0.9744
0.6343
0.2543
Suggested Scoring Change
Rather than blending everything into a single overall score that allows these shortcuts to dominate, it might make sense to record and display two separate numbers on the leaderboard. Of course, this is just an initial suggestion and I would love to hear others' thoughts on how to address this:
Video Quality Score: Measuring visual quality, consistency, and smoothness (e.g. FF Consist, Temp Smooth, Vis Quality).
2. Bypassing Intermediate Physics via End-State Predictions
I also tested two baselines that assume the final target frame (the solution) is known, but perform no intermediate reasoning or physical dynamics, bypassing the temporal path:
Linear: A linear cross-fade transition from the first frame to the ground truth final frame.
Discrete: Display the first frame for the first half of the video, then jump directly to the final frame for the second half.
See examples below:
Task
Ground Truth
Linear
Discrete
O-62 (Gravity Physics)
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4
O-56 (Raven Matrix)
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4
O-65 (Animal Size Sorting)
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4
Here are the detailed scores for these baselines:
Setting
In-Domain
Out-of-Domain
Overall Average
FF Consist
LF Acc
Temp Smooth
Vis Quality
Task Spec
Linear
0.9345
0.9016
0.9181
1.0000
0.9854
0.9963
0.7209
0.8511
Discrete
0.9356
0.9056
0.9206
1.0000
0.9877
0.9672
0.7378
0.8687
These achieve overall scores of 0.9181 and 0.9206 respectively, showing that simply predicting the final image is extremely competitive.
This raises a question of how well the intermediate states/physics of the video are being evaluated. The task-specific scores for these shortcuts are quite high (~0.85+). The spatial alignment checks of the task-specific evaluators do have a non-zero penalizing effect (for example, the discrete step is penalized on graph-path or optical-flow checks in some tasks compared to a true continuous physical sequence, which shows that the alignment evaluation is working to some measure), but it is unclear how much the intermediate temporal consistency actually matters compared to simply getting the final frame correct.
Minor Note: NameError in G-24 & G-54 Evaluators
As a side note, while running the evaluation suite, tasks G-24_separate_objects_no_spin_data-generator and G-54_connecting_color_data-generator silently crashed with a NameError: name 'safe_distance' is not defined inside SeparateObjectsNoSpinEvaluator and ConnectingColorEvaluator (both located in vbvr_bench/evaluators/Out_of_Domain_50_part1.py on lines 91 and 668 respectively). Note that I did not fix this bug during my own evaluations to make sure my baseline runs remain comparable to the scores reported on the current leaderboard.
Adding safe_distance to the imports at the top of that file fixes both crashes:
First of all, thank you for putting together such a cool dataset and benchmark! I evaluated a series of simple baseline predictions on the 500 samples in VBVR-Bench-Data (running on CPU) to see how the current evaluation metrics handle non-reasoning shortcuts. I believe I followed the setup steps correctly, but I would love for someone else to reproduce these results just to ensure everything is right.
I observed two interesting trends:
Trivial Baselines (requiring zero knowledge of the task's final state, such as a completely static frame) can outperform SOTA generative models on the leaderboard.
End-State Shortcuts (which transition to the final frame while bypassing intermediate reasoning steps) achieve an overall score of ~0.92, indicating that intermediate states and temporal dynamics might not be heavily validated.
Below, I have documented the scores and behaviors of these baselines. I hope this is helpful for refining the benchmark's evaluation framework to better reflect true physical and logical reasoning.
1. Trivial Baselines Outperforming SOTA Models
I tested three trivial baselines that require no knowledge of the final task solution and can be generated purely from the first input frame:
See examples below:
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4
00000_ground_truth.mp4
00000_setting1_static.mp4
00000_setting5_rotation.mp4
00000_setting4_noise.mp4
Here is how these trivial baselines rank against models on the official leaderboard (along with the Human Reference), sorted by Overall Average:
Why this happens
The overall score calculation, defined in
vbvr_bench/evaluators/base_evaluator.pyunder the_calculate_overall_score()method (lines 373-395), computes a weighted average of the individual dimension scores using a standard set of weights:For reference, here are the detailed component scores obtained by these trivial baselines:
Suggested Scoring Change
Rather than blending everything into a single overall score that allows these shortcuts to dominate, it might make sense to record and display two separate numbers on the leaderboard. Of course, this is just an initial suggestion and I would love to hear others' thoughts on how to address this:
Task Spec,LF Acc).FF Consist,Temp Smooth,Vis Quality).2. Bypassing Intermediate Physics via End-State Predictions
I also tested two baselines that assume the final target frame (the solution) is known, but perform no intermediate reasoning or physical dynamics, bypassing the temporal path:
See examples below:
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4
00000_ground_truth.mp4
00000_setting2_linear.mp4
00000_setting3_discrete.mp4
Here are the detailed scores for these baselines:
These achieve overall scores of
0.9181and0.9206respectively, showing that simply predicting the final image is extremely competitive.This raises a question of how well the intermediate states/physics of the video are being evaluated. The task-specific scores for these shortcuts are quite high (
~0.85+). The spatial alignment checks of the task-specific evaluators do have a non-zero penalizing effect (for example, the discrete step is penalized on graph-path or optical-flow checks in some tasks compared to a true continuous physical sequence, which shows that the alignment evaluation is working to some measure), but it is unclear how much the intermediate temporal consistency actually matters compared to simply getting the final frame correct.Minor Note: NameError in G-24 & G-54 Evaluators
As a side note, while running the evaluation suite, tasks
G-24_separate_objects_no_spin_data-generatorandG-54_connecting_color_data-generatorsilently crashed with aNameError: name 'safe_distance' is not definedinsideSeparateObjectsNoSpinEvaluatorandConnectingColorEvaluator(both located invbvr_bench/evaluators/Out_of_Domain_50_part1.pyon lines 91 and 668 respectively). Note that I did not fix this bug during my own evaluations to make sure my baseline runs remain comparable to the scores reported on the current leaderboard.Adding
safe_distanceto the imports at the top of that file fixes both crashes: