Evaluation code for CRONOS-Benchmark. Given a generated video and a reference sequence from the dataset, this repo runs all predictions and computes the six benchmark metrics.
CRONOS-Benchmark/
├── config.py # model paths and hyperparameters
├── requirements-cronos.txt # Python deps for the cronos environment
├── constraints-cronos.txt # torch version pins (prevents pip from upgrading)
├── setup_cronos.sh # sets up the 'cronos' conda environment
├── setup_sam3d.sh # sets up the 'cronos-sam3d' conda environment
├── scripts/
│ ├── run_predictions.sh # orchestrates all 7 prediction steps
│ ├── run_sam.py # SAM3 video segmentation
│ ├── run_cotracker.py # CoTracker point tracking
│ ├── run_sam3d.py # SAM3D per-object 3D reconstruction
│ ├── run_dino.py # DINOv2 temporal embeddings
│ ├── run_dismo.py # DisMo motion embeddings
│ └── run_qwen.py # Qwen VLM task-performance evaluation
└── metrics/
├── metrics.py # compute and aggregate all metrics
└── utils.py # shared loading and scaling utilities
| Model | Role | Source |
|---|---|---|
| SAM3 | Video object segmentation (prompt → per-frame masks) | Clone locally; set SAM3_MODEL_PATH in config.py |
| SAM3D-Objects | Per-object 3D mesh reconstruction from video | Clone locally for source code (SAM3D_OBJECTS_DIR); checkpoints downloaded automatically from HF Hub (facebook/sam-3d-objects) |
| CoTracker3 | Dense point tracking across frames | Loaded via PyTorch Hub (facebookresearch/co-tracker) |
| DisMo | Motion representation embeddings | Loaded via PyTorch Hub (CompVis/DisMo) |
| DINOv2 | Visual feature embeddings for object consistency | Loaded via PyTorch Hub (facebookresearch/dinov2) |
| Qwen3-VL | VLM task-performance evaluation | HuggingFace model ID or local path; set QWEN_MODEL_PATH in config.py |
| CLIP | Visual features used by SAM3D pipeline | Download locally; set CLIP_MODEL_PATH in config.py |
- Linux x86_64 with an NVIDIA GPU (≥24 GB VRAM for Qwen3-VL-32B; ≥16 GB for all other models)
- CUDA 12.4 driver (
nvidia-smimust work) - Miniconda or Mamba
nvcc(CUDA 12.4 toolkit) in PATH — required only for the SAM3D environment- ~80 GB of free disk space (model weights + environments)
Clone the following repos to a common directory (e.g. models/) and update the paths in config.py:
MODELS_DIR=/path/to/models
# SAM3 (video segmentation)
git clone https://github.com/facebookresearch/sam3 "$MODELS_DIR/sam3"
# SAM3D-Objects (3D reconstruction — source code only; checkpoints are fetched from HF Hub automatically)
git clone https://github.com/facebookresearch/sam-3d-objects "$MODELS_DIR/sam-3d-objects"
# Qwen3-VL (VLM evaluation — includes qwen-vl-utils)
git clone https://github.com/QwenLM/Qwen3-VL "$MODELS_DIR/Qwen3-VL"Edit config.py so the path variables point to your clones:
| Variable | Value |
|---|---|
SAM3_MODEL_PATH |
$MODELS_DIR/sam3 |
SAM3D_OBJECTS_DIR |
$MODELS_DIR/sam-3d-objects |
SAM3D_HF_CACHE |
Local directory where SAM3D-Objects checkpoints are cached |
CLIP_MODEL_PATH |
Path to a local CLIP checkpoint (used by SAM3D) |
QWEN_MODEL_PATH |
"Qwen/Qwen3-VL-32B-Instruct" (HuggingFace) or a local path |
CoTracker, DisMo, and DINOv2 are downloaded automatically via PyTorch Hub at first run.
SAM3D-Objects checkpoints are downloaded automatically from facebook/sam-3d-objects on HuggingFace Hub at first run and cached at SAM3D_HF_CACHE.
This environment covers run_sam.py, run_cotracker.py, run_dino.py, run_dismo.py, run_qwen.py, and metrics/.
cd CRONOS-benchmark
bash setup_cronos.sh \
--sam3-path /path/to/models/sam3 \
--qwen-utils-path /path/to/models/Qwen3-VL/qwen-vl-utils
conda activate cronossetup_cronos.sh installs PyTorch 2.5.1+cu124, all requirements from requirements-cronos.txt, and editable installs of SAM3 and qwen-vl-utils.
SAM3D-Objects requires older versions of timm, transformers, and bitsandbytes that conflict with the cronos environment. It also needs several packages that must be compiled from source (pytorch3d, gsplat).
Note:
nvcc(CUDA 12.4 toolkit) must be available before runningsetup_sam3d.sh.
cd CRONOS-benchmark
bash setup_sam3d.sh --sam3d-path /path/to/models/sam-3d-objects
# Record the Python binary path printed at the end:
# Python binary path: /path/to/envs/cronos-sam3d/bin/python
pytorch3dandgsplatare compiled from source and may take 10–20 minutes.
setup_sam3d.sh automatically removes the dataclasses backport package from site-packages after installation. This backport is pulled in as a transitive dependency and shadows the Python 3.11 stdlib version, breaking imageio at import time.
Edit config.py to point to your local model paths before running anything.
The CoTracker, DisMo, and DINOv2 models are loaded from PyTorch Hub at runtime using the identifiers in config.py. SAM3D-Objects checkpoints are fetched from HuggingFace Hub automatically by run_sam3d.py at first run.
run_predictions.sh runs all seven steps in order for a single sequence.
Because the SAM3D step runs in a separate environment, set PYTHON_SAM3D to the Python binary of the cronos-sam3d conda environment before calling the script:
# Set PYTHON_SAM3D to the path printed at the end of setup_sam3d.sh, then:
conda activate cronos
PYTHON_SAM3D=/path/to/envs/cronos-sam3d/bin/python bash scripts/run_predictions.sh \
--video <generated>.mp4 \
--ref <ref_sequence_dir> \
--output <output_dir>--video: the generated.mp4file to evaluate--ref: a sequence directory from the CRONOS-Benchmark dataset; must containmetadata.json,mask/frame_0000.jpg, andmovies/complete.mp4--output: directory where all prediction artefacts are written
If
PYTHON_SAM3Dis not set, the script falls back topythonand assumes a single environment that satisfies all dependencies.
| File | Produced by |
|---|---|
segmentation_masks.npz |
run_sam.py |
tracks.npz |
run_cotracker.py |
sam3D/obj{i}/sam3D/mesh/*.ply |
run_sam3d.py |
dino_embeddings.npz |
run_dino.py |
dismo_embeddings.npz |
run_dismo.py (generated video) |
gt_dismo_embeddings.npz |
run_dismo.py (reference video) |
vlm_predictions.json |
run_qwen.py |
Individual scripts can also be run standalone; each accepts --help.
python metrics/metrics.py \
--video <generated>.mp4 \
--ref <dataset>/{event}/{scene}/{object}/{appearance}/{view} \
--output <output_dir> \
--metrics <output_dir>/metrics.json # optional, defaults to output/metrics.json--output must point to the same directory used with run_predictions.sh.
| Metric | Description | Pass threshold |
|---|---|---|
bg_mse |
Background MSE vs frame 0 — detects unwanted scene changes | ≤ 0.25 |
motion_similarity |
DisMo cosine similarity between generated and reference motion | ≥ 0.70 |
object_consistency |
DINOv2 temporal consistency of tracked objects | ≥ 0.50 |
mean_chamfer_distance |
SAM3D mesh Chamfer distance across frames — measures shape stability | ≤ 0.80 |
vlm_positive_fraction |
Fraction of Qwen VLM task-specific questions answered positively | ≥ 0.35 |
success |
True only when every metric above meets its threshold simultaneously and no disappearance is detected |
— |
Raw values, linearly- or exponentially-scaled values, and the success flag are all written to the output JSON.
CRONOS builds on the following models and codebases:
- SAM3 — Carion et al., SAM 3: Segment Anything with Concepts, Meta Superintelligence Labs.
- SAM3D-Objects — Chen et al., SAM 3D: 3Dfy Anything in Images, Meta AI Research.
- CoTracker3 — Karaev et al., CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos, Meta AI Research.
- DisMo — Ressler-Antal et al., DisMo: Disentangled Motion Representations for Open-World Motion Transfer, CompVis.
- DINOv2 — Oquab et al., DINOv2: Learning Robust Visual Features without Supervision, Meta AI Research.
- Qwen3-VL — Qwen Team, Qwen3-VL, Alibaba Cloud.
- CLIP — Radford et al., Learning Transferable Visual Models From Natural Language Supervision, OpenAI.
The benchmark code in this repository is released under the MIT License. Third-party models used by CRONOS (SAM3, CoTracker3, DisMo, DINOv2, Qwen3-VL, CLIP) are subject to their own licenses — please review them before use.
If you use CRONOS-Benchmark in your research, please cite:
@misc{begiristain2026cronos,
title={CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models},
author={Le{\'o}n Begiristain and Olaf D{\"u}nkel and Adam Kortylewski},
year={2026},
eprint={2605.23699},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.23699},
}