CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Evaluation code for CRONOS-Benchmark. Given a generated video and a reference sequence from the dataset, this repo runs all predictions and computes the six benchmark metrics.

Repository layout

CRONOS-Benchmark/
├── config.py                   # model paths and hyperparameters
├── requirements-cronos.txt     # Python deps for the cronos environment
├── constraints-cronos.txt      # torch version pins (prevents pip from upgrading)
├── setup_cronos.sh             # sets up the 'cronos' conda environment
├── setup_sam3d.sh              # sets up the 'cronos-sam3d' conda environment
├── scripts/
│   ├── run_predictions.sh  # orchestrates all 7 prediction steps
│   ├── run_sam.py          # SAM3 video segmentation
│   ├── run_cotracker.py    # CoTracker point tracking
│   ├── run_sam3d.py        # SAM3D per-object 3D reconstruction
│   ├── run_dino.py         # DINOv2 temporal embeddings
│   ├── run_dismo.py        # DisMo motion embeddings
│   └── run_qwen.py         # Qwen VLM task-performance evaluation
└── metrics/
    ├── metrics.py          # compute and aggregate all metrics
    └── utils.py            # shared loading and scaling utilities

Models

Model	Role	Source
SAM3	Video object segmentation (prompt → per-frame masks)	Clone locally; set `SAM3_MODEL_PATH` in `config.py`
SAM3D-Objects	Per-object 3D mesh reconstruction from video	Clone locally for source code (`SAM3D_OBJECTS_DIR`); checkpoints downloaded automatically from HF Hub (`facebook/sam-3d-objects`)
CoTracker3	Dense point tracking across frames	Loaded via PyTorch Hub (`facebookresearch/co-tracker`)
DisMo	Motion representation embeddings	Loaded via PyTorch Hub (`CompVis/DisMo`)
DINOv2	Visual feature embeddings for object consistency	Loaded via PyTorch Hub (`facebookresearch/dinov2`)
Qwen3-VL	VLM task-performance evaluation	HuggingFace model ID or local path; set `QWEN_MODEL_PATH` in `config.py`
CLIP	Visual features used by SAM3D pipeline	Download locally; set `CLIP_MODEL_PATH` in `config.py`

Installation

Prerequisites

Linux x86_64 with an NVIDIA GPU (≥24 GB VRAM for Qwen3-VL-32B; ≥16 GB for all other models)
CUDA 12.4 driver (nvidia-smi must work)
Miniconda or Mamba
nvcc (CUDA 12.4 toolkit) in PATH — required only for the SAM3D environment
~80 GB of free disk space (model weights + environments)

Step 1 — Clone model repositories

Clone the following repos to a common directory (e.g. models/) and update the paths in config.py:

MODELS_DIR=/path/to/models

# SAM3 (video segmentation)
git clone https://github.com/facebookresearch/sam3 "$MODELS_DIR/sam3"

# SAM3D-Objects (3D reconstruction — source code only; checkpoints are fetched from HF Hub automatically)
git clone https://github.com/facebookresearch/sam-3d-objects "$MODELS_DIR/sam-3d-objects"

# Qwen3-VL (VLM evaluation — includes qwen-vl-utils)
git clone https://github.com/QwenLM/Qwen3-VL "$MODELS_DIR/Qwen3-VL"

Edit config.py so the path variables point to your clones:

Variable	Value
`SAM3_MODEL_PATH`	`$MODELS_DIR/sam3`
`SAM3D_OBJECTS_DIR`	`$MODELS_DIR/sam-3d-objects`
`SAM3D_HF_CACHE`	Local directory where SAM3D-Objects checkpoints are cached
`CLIP_MODEL_PATH`	Path to a local CLIP checkpoint (used by SAM3D)
`QWEN_MODEL_PATH`	`"Qwen/Qwen3-VL-32B-Instruct"` (HuggingFace) or a local path

CoTracker, DisMo, and DINOv2 are downloaded automatically via PyTorch Hub at first run.
SAM3D-Objects checkpoints are downloaded automatically from facebook/sam-3d-objects on HuggingFace Hub at first run and cached at SAM3D_HF_CACHE.

Step 2 — `cronos` environment (all steps except SAM3D)

This environment covers run_sam.py, run_cotracker.py, run_dino.py, run_dismo.py, run_qwen.py, and metrics/.

cd CRONOS-benchmark

bash setup_cronos.sh \
    --sam3-path       /path/to/models/sam3 \
    --qwen-utils-path /path/to/models/Qwen3-VL/qwen-vl-utils

conda activate cronos

setup_cronos.sh installs PyTorch 2.5.1+cu124, all requirements from requirements-cronos.txt, and editable installs of SAM3 and qwen-vl-utils.

Step 3 — `cronos-sam3d` environment (`run_sam3d.py` only)

SAM3D-Objects requires older versions of timm, transformers, and bitsandbytes that conflict with the cronos environment. It also needs several packages that must be compiled from source (pytorch3d, gsplat).

Note: nvcc (CUDA 12.4 toolkit) must be available before running setup_sam3d.sh.

cd CRONOS-benchmark

bash setup_sam3d.sh --sam3d-path /path/to/models/sam-3d-objects

# Record the Python binary path printed at the end:
#   Python binary path:  /path/to/envs/cronos-sam3d/bin/python

pytorch3d and gsplat are compiled from source and may take 10–20 minutes.

setup_sam3d.sh automatically removes the dataclasses backport package from site-packages after installation. This backport is pulled in as a transitive dependency and shadows the Python 3.11 stdlib version, breaking imageio at import time.

Step 4 — Configure `config.py`

Edit config.py to point to your local model paths before running anything.

The CoTracker, DisMo, and DINOv2 models are loaded from PyTorch Hub at runtime using the identifiers in config.py. SAM3D-Objects checkpoints are fetched from HuggingFace Hub automatically by run_sam3d.py at first run.

Running predictions

run_predictions.sh runs all seven steps in order for a single sequence.
Because the SAM3D step runs in a separate environment, set PYTHON_SAM3D to the Python binary of the cronos-sam3d conda environment before calling the script:

# Set PYTHON_SAM3D to the path printed at the end of setup_sam3d.sh, then:
conda activate cronos
PYTHON_SAM3D=/path/to/envs/cronos-sam3d/bin/python bash scripts/run_predictions.sh \
    --video  <generated>.mp4 \
    --ref    <ref_sequence_dir> \
    --output <output_dir>

--video: the generated .mp4 file to evaluate
--ref: a sequence directory from the CRONOS-Benchmark dataset; must contain metadata.json, mask/frame_0000.jpg, and movies/complete.mp4
--output: directory where all prediction artefacts are written

If PYTHON_SAM3D is not set, the script falls back to python and assumes a single environment that satisfies all dependencies.

Output files

File	Produced by
`segmentation_masks.npz`	`run_sam.py`
`tracks.npz`	`run_cotracker.py`
`sam3D/obj{i}/sam3D/mesh/*.ply`	`run_sam3d.py`
`dino_embeddings.npz`	`run_dino.py`
`dismo_embeddings.npz`	`run_dismo.py` (generated video)
`gt_dismo_embeddings.npz`	`run_dismo.py` (reference video)
`vlm_predictions.json`	`run_qwen.py`

Individual scripts can also be run standalone; each accepts --help.

Computing metrics

python metrics/metrics.py \
    --video   <generated>.mp4 \
    --ref     <dataset>/{event}/{scene}/{object}/{appearance}/{view} \
    --output  <output_dir> \
    --metrics <output_dir>/metrics.json   # optional, defaults to output/metrics.json

--output must point to the same directory used with run_predictions.sh.

Metrics

Metric	Description	Pass threshold
`bg_mse`	Background MSE vs frame 0 — detects unwanted scene changes	≤ 0.25
`motion_similarity`	DisMo cosine similarity between generated and reference motion	≥ 0.70
`object_consistency`	DINOv2 temporal consistency of tracked objects	≥ 0.50
`mean_chamfer_distance`	SAM3D mesh Chamfer distance across frames — measures shape stability	≤ 0.80
`vlm_positive_fraction`	Fraction of Qwen VLM task-specific questions answered positively	≥ 0.35
`success`	`True` only when every metric above meets its threshold simultaneously and no disappearance is detected	—

Raw values, linearly- or exponentially-scaled values, and the success flag are all written to the output JSON.

Acknowledgments

CRONOS builds on the following models and codebases:

SAM3 — Carion et al., SAM 3: Segment Anything with Concepts, Meta Superintelligence Labs.
SAM3D-Objects — Chen et al., SAM 3D: 3Dfy Anything in Images, Meta AI Research.
CoTracker3 — Karaev et al., CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos, Meta AI Research.
DisMo — Ressler-Antal et al., DisMo: Disentangled Motion Representations for Open-World Motion Transfer, CompVis.
DINOv2 — Oquab et al., DINOv2: Learning Robust Visual Features without Supervision, Meta AI Research.
Qwen3-VL — Qwen Team, Qwen3-VL, Alibaba Cloud.
CLIP — Radford et al., Learning Transferable Visual Models From Natural Language Supervision, OpenAI.

License

The benchmark code in this repository is released under the MIT License. Third-party models used by CRONOS (SAM3, CoTracker3, DisMo, DINOv2, Qwen3-VL, CLIP) are subject to their own licenses — please review them before use.

Citation

If you use CRONOS-Benchmark in your research, please cite:

@misc{begiristain2026cronos,
      title={CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models}, 
      author={Le{\'o}n Begiristain and Olaf D{\"u}nkel and Adam Kortylewski},
      year={2026},
      eprint={2605.23699},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.23699}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Repository layout

Models

Installation

Prerequisites

Step 1 — Clone model repositories

Step 2 — `cronos` environment (all steps except SAM3D)

Step 3 — `cronos-sam3d` environment (`run_sam3d.py` only)

Step 4 — Configure `config.py`

Running predictions

Output files

Computing metrics

Metrics

Acknowledgments

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
metrics		metrics
scripts		scripts
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements-cronos.txt		requirements-cronos.txt
setup_cronos.sh		setup_cronos.sh
setup_sam3d.sh		setup_sam3d.sh

Folders and files

Latest commit

History

Repository files navigation

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Repository layout

Models

Installation

Prerequisites

Step 1 — Clone model repositories

Step 2 — cronos environment (all steps except SAM3D)

Step 3 — cronos-sam3d environment (run_sam3d.py only)

Step 4 — Configure config.py

Running predictions

Output files

Computing metrics

Metrics

Acknowledgments

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 2 — `cronos` environment (all steps except SAM3D)

Step 3 — `cronos-sam3d` environment (`run_sam3d.py` only)

Step 4 — Configure `config.py`

Packages