WildRayZer

This repository hosts the checkpoint of WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments (CVPR 2026, Highlight).

Paper · Project page · Dataset · Code

Model summary

WildRayZer is a self-supervised feed-forward framework for novel view synthesis (NVS) in dynamic in-the-wild videos where both the camera and scene objects move. It extends the static NVS model RayZer to dynamic environments by adding:

  1. a learned motion mask estimator that flags dynamic regions per input view, trained by distilling pseudo-masks from the residual between a static renderer and the observed frames (DINOv3 + SSIM + co-segmentation + GrabCut);
  2. a masked 3D scene encoder that replaces dynamic image tokens with a learnable noise embedding before scene aggregation (MAE-style token masking).

All supervision is self-supervised — no ground-truth depth, camera poses, or motion masks are used. Given a set of unposed, uncalibrated dynamic images, the model predicts camera parameters and motion masks and renders novel static views in a single feed-forward pass.

This checkpoint

Property Value
File wildrayzer_2view.pt (3.9 GB, fp32 state_dict)
Input resolution 256 × 256
Input / target views 2 input → 6 target
Base dataset Dynamic-RE10K (train split) + RealEstate10K (static mix-in)
Backbone RayZer (28 transformer layers) + DINOv3 ViT-7B features
Framework PyTorch ≥ 2.1, xFormers, transformers

The K=2 configuration matches the sparse-view setting used in the paper's main D-RE10K and D-RE10K-iPhone benchmarks. 3- and 4-input-view variants can be reproduced by retraining with the same pipeline — see training details.

How to use

Download the checkpoint and run the reference demo:

from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="uva-cv-lab/wildrayzer-2view",
    filename="wildrayzer_2view.pt",
)
# Pass ckpt_path to the WildRayZerDemo class or to inference.py
# via --config configs/wildrayzer_inference.yaml.

The full inference pipeline, Gradio demo, and training code live in the companion repo. A ready-to-deploy Space layout is provided under demo/ in that repo.

Hardware requirements: CUDA GPU with ≥ 40 GB VRAM (the motion-mask predictor fuses DINOv3 ViT-7B patch features with image/Plücker tokens at inference time — this 7B backbone is a hard dependency, not optional). The author will soon provide an alternative.

Citation

@inproceedings{chen2026wildrayzer,
  title     = {WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments},
  author    = {Chen, Xuweiyi and Zhou, Wentao and Cheng, Zezhou},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  note      = {Highlight},
  year      = {2026},
}

License

Released under CC BY-NC 4.0 — free for research and non-commercial use, attribution required. For commercial licensing, contact the authors.

Acknowledgements

This work was supported by the MathWorks Research Gift, Adobe Research Gift, the University of Virginia Research Computing and Data Analytics Center, the AMD AI & HPC Cluster Program, the ACCESS program, and the NAIRR Pilot. Computation was run on the Anvil supercomputer (NSF OAC-2005632) at Purdue and on Delta / DeltaAI (NSF OAC-2005572).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train uva-cv-lab/wildrayzer-checkpoints

Space using uva-cv-lab/wildrayzer-checkpoints 1

Collection including uva-cv-lab/wildrayzer-checkpoints

Paper for uva-cv-lab/wildrayzer-checkpoints