by Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari
Accepted to 3DV 2026!
Update: [March 20, 2026] We have released the pre-trained M2SVid weights!
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.
We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6× more often than the second-place method in a user study, while being 6× faster.
-
Download
ckpts.zipfrom Hi3D repo and unzip (follow step "2. Download checkpoints here and unzip."). Our model follows Hi3D implementation and uses the same openclip model. -
Download the M2SVid weights (8.5Gb) and extract them into the
ckptsfolder:unzip m2svid_weights.zip -d ckpts/. We provide two model variants: one with full attention for disoccluded tokens (m2svid_weights.pt, 4.64Gb) and one without full attention (m2svid_no_full_atten_weights.pt, 4.6Gb). -
Optional (for training only): download stable-video-diffusion-img2vid-xt checkpoint and put it in
ckpts/.
- Create conda env
depthcrafterfollowing DepthCrafter instructions - Create conda env
sgm. We used cuda 11.8,python=3.10.6,torch==2.0.1 torchvision==0.15.2. We tested our model training/inference on GPUs A100 and H100.
conda env create -f environment.yml -n sgmRun inference on demo video:
bash inference.shSee examples outputs in demo folder.
Note 1: The width/hight of the video should be divisible by 64.
Note 2: The model was trained on a resolution of 512x512. For inference of higher resolution videos, please follow the tiling approach described in the StereoCrafter paper. Our released models support temporal and spatial stitching.
- Depth prediction and depth-based warping
source /opt/conda/bin/activate ""
conda activate depthcrafter
PYTHONPATH="third_party/DepthCrafter/::${PYTHONPATH}" python third_party/DepthCrafter/run.py \
--video-path demo/input.mp4 --save_folder outputs/depthcrafter --save_npz True --num_inference_steps 25 --max_res 1024
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python warping.py \
--video_path demo/input.mp4 \
--depth_path outputs/depthcrafter/input.npz \
--output_path_reprojected outputs/reprojected/input_reprojected.mp4 \
--output_path_mask outputs/reprojected/input_reprojected_mask.mp4 \
--disparity_perc 0.05- Inpainting and refinement with M2SVid
source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python inpaint_and_refine.py \
--mask_antialias 0 \
--model_config configs/m2svid.yaml \
--ckpt ckpts/m2svid_weights.pt \
--video_path demo/input.mp4 \
--reprojected_path outputs/reprojected/input_reprojected.mp4 \
--reprojected_mask_path outputs/reprojected/input_reprojected_mask.mp4\
--output_folder outputs/m2svid \Note: If you are using the version without full attention, ensure you use the m2svid_no_full_atten.yaml config instead:
source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python inpaint_and_refine.py \
--mask_antialias 0 \
--model_config configs/m2svid_no_fullatten.yaml \
--ckpt ckpts/m2svid_no_full_atten_weights.pt \
--video_path demo/input.mp4 \
--reprojected_path outputs/reprojected/input_reprojected.mp4 \
--reprojected_mask_path outputs/reprojected/input_reprojected_mask.mp4\
--output_folder outputs/m2svid_no_full_atten \We used the Ego4D and Stereo4D datasets for model training and evaluation.
-
Download and preprocess the Stereo4D dataset into the folder
datasets/stereo4dby following the official instructions. You only need to perform the rectification and stereo matching steps. Then, you can warp all videos using ourwarping.pyscript. At the end, you should have the following folders:left_rectified,right_rectified,reprojected, andreprojected_mask. We provide the train/val split indatasets/stereo4d/subsets. -
For Ego4D, we use only videos with the attribute
is_stereo=True, resulting in 263 videos in total. Download videos intodatasets/ego4dby following the official instructions. We rectify the videos, split them into 150-frames clips, and apply the BiDAStereo model to estimate disparities. Check the ego4d preprocessing README for more details. At the end, you should have the following folders:cropped_videos(side by side rectified and cropped left and right videos),reprojected, andreprojected_mask. We provide the train/val split indatasets/ego4d/subsets.
-
Download stable-video-diffusion-img2vid-xt checkpoint and put it to ckpts.
-
Run
make_m2svid_init.pyto modify SVD models weights for ours M2SVid model configuration with left view, warped view and mask conditioning.
source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python make_m2svid_init.py- Run training
source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
--base configs/training/m2svid_train.yaml \
--no-test True \
--train True \
--logdir outputs/training/m2svidEvaluation on stereo4d:
source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
--base configs/training/m2svid_train.yaml \
--dataset_base configs/testing/stereo4d.yaml \
--no-test False \
--train False \
--logdir outputs/training/m2svid \
--resume /home/jupyter/outputs_m2svid/training/m2svid/checkpoints/epoch=000120.ckptEvaluation on ego4d:
source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
--base configs/training/m2svid_train.yaml \
--dataset_base configs/testing/ego4d.yaml \
--no-test False \
--train False \d
--logdir outputs/training/m2svid \
--resume /home/jupyter/outputs_m2svid/training/m2svid/checkpoints/epoch=000000.ckptTo reproduce the paper's results on Stereo4D and Ego4D using our released weights:
source /opt/conda/bin/activate ""
conda activate sgm
# Evaluate on Stereo4D
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
--base configs/testing/pretrained_m2svid.yaml \
--dataset_base configs/testing/stereo4d.yaml \
--no-test False \
--train False \
--logdir outputs/training/m2svid
# Evaluate on Ego4D
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
--base configs/training/pretrained_m2svid.yaml \
--dataset_base configs/testing/stereo4d.yaml \
--no-test False \
--train False \
--logdir outputs/training/m2svid @article{shvetsova2026m2svid,
title={M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion},
author={Shvetsova, Nina and Bhat, Goutam and Truong, Prune and Kuehne, Hilde and Tombari, Federico},
journal={3DV},
year={2026}
}