WildRayZer
This repository hosts the checkpoint of WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments (CVPR 2026, Highlight).
Paper · Project page · Dataset · Code
Model summary
WildRayZer is a self-supervised feed-forward framework for novel view synthesis (NVS) in dynamic in-the-wild videos where both the camera and scene objects move. It extends the static NVS model RayZer to dynamic environments by adding:
- a learned motion mask estimator that flags dynamic regions per input view, trained by distilling pseudo-masks from the residual between a static renderer and the observed frames (DINOv3 + SSIM + co-segmentation + GrabCut);
- a masked 3D scene encoder that replaces dynamic image tokens with a learnable noise embedding before scene aggregation (MAE-style token masking).
All supervision is self-supervised — no ground-truth depth, camera poses, or motion masks are used. Given a set of unposed, uncalibrated dynamic images, the model predicts camera parameters and motion masks and renders novel static views in a single feed-forward pass.
This checkpoint
| Property | Value |
|---|---|
| File | wildrayzer_2view.pt (3.9 GB, fp32 state_dict) |
| Input resolution | 256 × 256 |
| Input / target views | 2 input → 6 target |
| Base dataset | Dynamic-RE10K (train split) + RealEstate10K (static mix-in) |
| Backbone | RayZer (28 transformer layers) + DINOv3 ViT-7B features |
| Framework | PyTorch ≥ 2.1, xFormers, transformers |
The K=2 configuration matches the sparse-view setting used in the paper's main D-RE10K and D-RE10K-iPhone benchmarks. 3- and 4-input-view variants can be reproduced by retraining with the same pipeline — see training details.
How to use
Download the checkpoint and run the reference demo:
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="uva-cv-lab/wildrayzer-2view",
filename="wildrayzer_2view.pt",
)
# Pass ckpt_path to the WildRayZerDemo class or to inference.py
# via --config configs/wildrayzer_inference.yaml.
The full inference pipeline, Gradio demo, and training code live in the
companion repo. A ready-to-deploy
Space layout is provided under demo/ in that repo.
Hardware requirements: CUDA GPU with ≥ 40 GB VRAM (the motion-mask predictor fuses DINOv3 ViT-7B patch features with image/Plücker tokens at inference time — this 7B backbone is a hard dependency, not optional). The author will soon provide an alternative.
Citation
@inproceedings{chen2026wildrayzer,
title = {WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments},
author = {Chen, Xuweiyi and Zhou, Wentao and Cheng, Zezhou},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
note = {Highlight},
year = {2026},
}
License
Released under CC BY-NC 4.0 — free for research and non-commercial use, attribution required. For commercial licensing, contact the authors.
Acknowledgements
This work was supported by the MathWorks Research Gift, Adobe Research Gift, the University of Virginia Research Computing and Data Analytics Center, the AMD AI & HPC Cluster Program, the ACCESS program, and the NAIRR Pilot. Computation was run on the Anvil supercomputer (NSF OAC-2005632) at Purdue and on Delta / DeltaAI (NSF OAC-2005572).