Hi authors,
First of all, thanks for the amazing work and for open-sourcing the code!
I am currently trying to reproduce the VBench metrics for the Wan base model (3x extrapolation) as reported in Table 1. However, I noticed that the trend for the subject_consistency metric in my local tests doesn't perfectly align with the paper's results, and the gap between PE and UltraViCo is quite large.
Here are the results from my two comparative experiments:
📊 My Experimental Results
Experiment 1:
- UltraViCo:
{
"subject_consistency": 0.9141,
"dynamic_degree": 32.0,
"imaging_quality": 63.957,
"overall_consistency": 19.4289
}
- PE:
{
"subject_consistency": 0.9759,
"dynamic_degree": 9.0,
"imaging_quality": 58.7393,
"overall_consistency": 16.7371
}
Experiment 2:
- UltraViCo:
{
"subject_consistency": 0.9147,
"dynamic_degree": 32.0,
"imaging_quality": 65.1345,
"overall_consistency": 21.4484
}
- PE:
{
"subject_consistency": 0.9821,
"dynamic_degree": 10.0,
"imaging_quality": 58.4083,
"overall_consistency": 18.5526
}
💻 My Evaluation Code
For reference, here is the VBench evaluation script I am currently using to calculate the metrics:
Click to expand the evaluation code
import os
import json
import argparse
import re
from vbench import VBench
def main():
parser = argparse.ArgumentParser(description="VBench Evaluation for UltraViCo Wan2.1")
parser.add_argument("--prompt_file", type=str, default="/data/proj/zbx/DiT-Extrapolation/assets/all_dimension_100_set2.txt")
parser.add_argument("--video_dir", type=str, default="/data/proj/zbx/DiT-Extrapolation/set2/output/vbench_results_wan_ultra_3x")
parser.add_argument("--output_path", type=str, default="set2/output/results_ultra/vbench_wan_metrics.json")
parser.add_argument("--vbench_info", type=str, default="/data/proj/zbx/VBench/vbench/VBench_full_info.json")
args = parser.parse_args()
output_dir = os.path.dirname(args.output_path)
if not os.path.exists(output_dir):
os.makedirs(output_dir, exist_ok=True)
run_name = "wan2.1_extrapolation"
dimension_list = ["subject_consistency", "dynamic_degree", "imaging_quality", "overall_consistency"]
# 1. Read prompts
with open(args.prompt_file, 'r', encoding='utf-8') as f:
prompts_list = [line.strip() for line in f.read().splitlines() if line.strip()]
print(f">>> Successfully read Prompt file, total {len(prompts_list)} valid texts.")
# 2. Build mapping dictionary
video_prompt_mapping = {}
abs_video_dir = os.path.abspath(args.video_dir)
video_files = [f for f in os.listdir(abs_video_dir) if f.endswith(".mp4")]
for filename in video_files:
try:
match = re.search(r'\d+', filename)
if not match:
continue
video_id = int(match.group())
index = video_id - 1 # 001.mp4 -> index 0
if 0 <= index < len(prompts_list):
real_prompt = prompts_list[index]
video_key = os.path.join(abs_video_dir, filename)
video_prompt_mapping[video_key] = real_prompt
else:
print(f"Warning: Video {filename} (ID: {video_id}) index out of bounds. Total prompts: {len(prompts_list)}")
except Exception as e:
print(f"Error processing file {filename}: {e}")
print(f">>> Mapping complete, {len(video_prompt_mapping)} videos matched with prompts.")
# 3. Initialize VBench
my_VBench = VBench(device='cuda', full_info_dir=args.vbench_info, output_path=os.path.abspath(output_dir))
# 4. Evaluate
my_VBench.evaluate(
videos_path=abs_video_dir,
name=run_name,
dimension_list=dimension_list,
mode='custom_input',
prompt_list=video_prompt_mapping
)
# 5. Summarize Results
print("\n" + "="*50)
eval_results_file = os.path.join(output_dir, f"{run_name}_eval_results.json")
if os.path.exists(eval_results_file):
with open(eval_results_file, 'r') as f:
all_data = json.load(f)
summary = {}
for dim in dimension_list:
val = float(all_data[dim][0] if isinstance(all_data[dim], list) else all_data[dim])
if "subject_consistency" in dim:
summary[dim] = round(val, 4)
else:
summary[dim] = round(val * 100, 4)
for k, v in summary.items(): print(f"{k:25}: {v:.4f}")
with open(args.output_path, 'w') as f: json.dump(summary, f, indent=4)
print(f"Final results exported to: {args.output_path}")
if __name__ == "__main__":
main()
🤔 My Questions:
- Metric Collapse on PE? Is the extremely high
subject_consistency (~0.98) for PE expected due to the videos being nearly static (as indicated by the very low dynamic_degree of ~9.0)? Did you observe a similar "metric collapse" for PE during your evaluations?
- Aligning UltraViCo Consistency: My UltraViCo
subject_consistency is around ~0.91, which is a bit lower than the reported ~0.94 in the paper. I noticed in Appendix C.2 it mentions: "For UltraViCo, the first frame's decay factor is set negative to fix its blurring." Could you please provide some guidance on how I should modify my evaluation code or the generation script to correctly implement this and align with the paper's results?
Looking forward to your insights. Thanks again for the great contribution!
Hi authors,
First of all, thanks for the amazing work and for open-sourcing the code!
I am currently trying to reproduce the VBench metrics for the Wan base model (3x extrapolation) as reported in Table 1. However, I noticed that the trend for the
subject_consistencymetric in my local tests doesn't perfectly align with the paper's results, and the gap between PE and UltraViCo is quite large.Here are the results from my two comparative experiments:
📊 My Experimental Results
Experiment 1:
{ "subject_consistency": 0.9141, "dynamic_degree": 32.0, "imaging_quality": 63.957, "overall_consistency": 19.4289 }{ "subject_consistency": 0.9759, "dynamic_degree": 9.0, "imaging_quality": 58.7393, "overall_consistency": 16.7371 }Experiment 2:
{ "subject_consistency": 0.9147, "dynamic_degree": 32.0, "imaging_quality": 65.1345, "overall_consistency": 21.4484 }{ "subject_consistency": 0.9821, "dynamic_degree": 10.0, "imaging_quality": 58.4083, "overall_consistency": 18.5526 }💻 My Evaluation Code
For reference, here is the VBench evaluation script I am currently using to calculate the metrics:
Click to expand the evaluation code
🤔 My Questions:
subject_consistency(~0.98) for PE expected due to the videos being nearly static (as indicated by the very lowdynamic_degreeof ~9.0)? Did you observe a similar "metric collapse" for PE during your evaluations?subject_consistencyis around ~0.91, which is a bit lower than the reported ~0.94 in the paper. I noticed in Appendix C.2 it mentions: "For UltraViCo, the first frame's decay factor is set negative to fix its blurring." Could you please provide some guidance on how I should modify my evaluation code or the generation script to correctly implement this and align with the paper's results?Looking forward to your insights. Thanks again for the great contribution!