This repository contains the official implementation and models for OVIE (One View Is Enough! Monocular Training for In-the-Wild Novel View Generation).
OVIE is a framework for monocular novel view synthesis that does not require multi-view image pairs for supervision. Instead, it is trained entirely on unpaired internet images.
- Installation
- Model Weights
- Inference
- Data Preprocessing
- Training
- Evaluation
- Contributing
- Acknowledgments
We use uv by Astral to manage the Python environment and dependencies. It is a drastically faster drop-in replacement for standard Python packaging tools.
1. Install uv:
For macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh(Alternatively, you can install it via macOS Homebrew: brew install uv, or refer to the official documentation for Windows instructions.)
2. Clone this repository and sync dependencies:
Once uv is installed, clone the project and run uv sync. This will automatically resolve the required Python version (3.10.9) and install all dependencies from uv.lock.
git clone https://github.com/AdrienRR/ovie.git
cd ovie
uv syncPrefix all commands with uv run to ensure they run inside the managed environment.
Pretrained weights are hosted on the Hugging Face Hub at kyutai/ovie and are downloaded automatically when using from_pretrained (see Inference below).
For evaluation and training, local checkpoint files are also required. Download them from the Releases page and place them inside the assets/ folder:
ovie.pt— main checkpoint used for evaluation (contains EMA weights).dino_vit_small_patch8_224.pth— used only for training; same checkpoint as in RAE.
OVIE/
├── assets/
│ ├── ovie.pt # evaluation
│ ├── dino_vit_small_patch8_224.pth # training only
│ └── sample_image.jpg
├── configs/
│ └── config_ovie.yaml
├── models/
└── ...
We provide two Jupyter notebooks to get started quickly:
| Notebook | Weights source |
|---|---|
inference_huggingface.ipynb |
Downloaded automatically from kyutai/ovie |
inference_local.ipynb |
Loaded from a local assets/ovie.pt checkpoint |
uv run jupyter notebook inference_huggingface.ipynbLoading from the Hugging Face Hub (recommended):
import torch
from models.models import OVIEModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = OVIEModel.from_pretrained("kyutai/ovie", revision="v1.0").to(device)
model.eval()
image_size = model.image_size # 256, read from the saved configLoading from a local checkpoint:
import yaml, torch
from models.models import OVIE_models
with open("./configs/config_ovie.yaml") as f:
config = yaml.safe_load(f)
model_cfg = config["model"]
image_size = config["data"]["image_size"]
model = OVIE_models[model_cfg["model_type"]](
image_size=image_size,
vit_use_qknorm=model_cfg.get("use_qknorm", False),
vit_use_swiglu=model_cfg.get("use_swiglu", True),
vit_use_rope=model_cfg.get("use_rope", False),
vit_use_rmsnorm=model_cfg.get("use_rmsnorm", True),
vit_wo_shift=model_cfg.get("wo_shift", False),
vit_use_checkpoint=model_cfg.get("use_checkpoint", False),
).to(device)
ckpt = torch.load("./assets/ovie.pt", map_location="cpu")
model.load_state_dict(ckpt["ema"])
model.eval()Running inference:
from torchvision.transforms import ToTensor
from PIL import Image
from utils.pose_enc import extri_intri_to_pose_encoding
img_pil = Image.open("./assets/sample_image.jpg").convert("RGB").resize((image_size, image_size))
img_tensor = ToTensor()(img_pil).unsqueeze(0).to(device)
extrinsics = torch.tensor([[[1.0, 0.0, 0.0, -1.25],
[0.0, 1.0, 0.0, 0.5],
[0.0, 0.0, 1.0, -2.0]]], device=device)
dummy_intrinsics = torch.zeros(1, 1, 3, 3, device=device)
camera = extri_intri_to_pose_encoding(
extrinsics=extrinsics.unsqueeze(0),
intrinsics=dummy_intrinsics,
image_size_hw=(image_size, image_size),
)
cam_token = camera[..., :7].squeeze(0)
with torch.no_grad():
pred_tensor = model(x=img_tensor, cam_params=cam_token)Before training or evaluating on specific datasets, raw images must be preprocessed. We provide scripts for both in-the-wild training data and DL3DV evaluation data.
For in-the-wild training images:
uv run python data_preparation/preprocess_in_the_wild_images.py \
--data_path /PATH/TO/RAW/DATASET \
--output_path /PATH/TO/PREPROCESSED/DATASETPoint the resulting directories to the data_path lists in configs/config_ovie.yaml.
For DL3DV evaluation data:
uv run python data_preparation/format_dl3dv.py \
--root_dir /PATH/TO/DL3DV \
--output_dir /PATH/TO/PROCESSED/DL3DVDL3DV can be downloaded from the official dataset repository.
Once data is preprocessed and paths are set in the config, launch distributed training with torchrun:
uv run torchrun --nproc_per_node <number_of_gpus> train.py --config configs/config_ovie.yamlUse evaluate.py to evaluate on benchmark datasets. Requires a local assets/ovie.pt checkpoint (see Model Weights).
Evaluating on Real Estate 10K (RE10K): The pre-processed RE10K dataset is available on Hugging Face: chenchenshi/re10k-sc.
uv run python evaluate.py \
--dataset_path /PATH/TO/EVAL/DATASET \
--config_path configs/config_ovie.yaml \
--checkpoint_path assets/ovie.ptThis project uses pre-commit hooks to enforce code style (ruff format + lint) and keep the lockfile in sync. CI runs the same checks on every push and pull request.
Install the hooks:
uv run pre-commit installAfter this, ruff format, ruff check, and uv lock --check run automatically on every git commit. You can also run them manually across all files:
uv run pre-commit run --all-filesThe uv.lock file is committed to the repository — do not remove it from version control.
This project relies on fantastic open-source tools and models, including:
- DINOv2
- DINOv3
- MoGe Depth Estimator
- Visually Grounded Geometry Transformer (VGGT)
- Representation Autoencoders (RAE)
If you find our work useful in your research, please consider citing:
@misc{ovie2026,
title={One View Is Enough! Monocular Training for In-the-Wild Novel View Generation},
author={Adrien Ramanana Rahary and Nicolas Dufour and Patrick Perez and David Picard},
year={2026},
eprint={2603.23488},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.23488},
}