M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

by Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari

Accepted to 3DV 2026!

Update: [March 20, 2026] We have released the pre-trained M2SVid weights!

This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

📄 Abstract

We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6× more often than the second-place method in a user study, while being 6× faster.

🛠️ Get started

Weights

Download ckpts.zip from Hi3D repo and unzip (follow step "2. Download checkpoints here and unzip."). Our model follows Hi3D implementation and uses the same openclip model.
Download the M2SVid weights (8.5Gb) and extract them into the ckpts folder: unzip m2svid_weights.zip -d ckpts/. We provide two model variants: one with full attention for disoccluded tokens (m2svid_weights.pt, 4.64Gb) and one without full attention (m2svid_no_full_atten_weights.pt, 4.6Gb).
Optional (for training only): download stable-video-diffusion-img2vid-xt checkpoint and put it in ckpts/.

Environment

Create conda env depthcrafter following DepthCrafter instructions
Create conda env sgm. We used cuda 11.8, python=3.10.6, torch==2.0.1 torchvision==0.15.2. We tested our model training/inference on GPUs A100 and H100.

conda env create -f environment.yml -n sgm

⚙️ Inference

Run inference on demo video:

bash inference.sh

See examples outputs in demo folder.

Note 1: The width/hight of the video should be divisible by 64.

Note 2: The model was trained on a resolution of 512x512. For inference of higher resolution videos, please follow the tiling approach described in the StereoCrafter paper. Our released models support temporal and spatial stitching.

Inference Steps:

Depth prediction and depth-based warping

source /opt/conda/bin/activate ""
conda activate depthcrafter
PYTHONPATH="third_party/DepthCrafter/::${PYTHONPATH}" python third_party/DepthCrafter/run.py  \
        --video-path demo/input.mp4 --save_folder outputs/depthcrafter --save_npz True --num_inference_steps 25 --max_res 1024

PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python warping.py  \
        --video_path demo/input.mp4 \
        --depth_path outputs/depthcrafter/input.npz \
        --output_path_reprojected outputs/reprojected/input_reprojected.mp4  \
        --output_path_mask outputs/reprojected/input_reprojected_mask.mp4 \
        --disparity_perc 0.05

Inpainting and refinement with M2SVid

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python inpaint_and_refine.py  \
        --mask_antialias 0 \
        --model_config configs/m2svid.yaml \
        --ckpt ckpts/m2svid_weights.pt \
        --video_path demo/input.mp4  \
        --reprojected_path outputs/reprojected/input_reprojected.mp4 \
        --reprojected_mask_path outputs/reprojected/input_reprojected_mask.mp4\
        --output_folder outputs/m2svid \

Note: If you are using the version without full attention, ensure you use the m2svid_no_full_atten.yaml config instead:

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python inpaint_and_refine.py  \
        --mask_antialias 0 \
        --model_config configs/m2svid_no_fullatten.yaml \
        --ckpt ckpts/m2svid_no_full_atten_weights.pt \
        --video_path demo/input.mp4  \
        --reprojected_path outputs/reprojected/input_reprojected.mp4 \
        --reprojected_mask_path outputs/reprojected/input_reprojected_mask.mp4\
        --output_folder outputs/m2svid_no_full_atten \

🏋️ Training and Quantitative Evaluation

Datasets

We used the Ego4D and Stereo4D datasets for model training and evaluation.

Download and preprocess the Stereo4D dataset into the folder datasets/stereo4d by following the official instructions. You only need to perform the rectification and stereo matching steps. Then, you can warp all videos using our warping.py script. At the end, you should have the following folders: left_rectified, right_rectified, reprojected, and reprojected_mask. We provide the train/val split in datasets/stereo4d/subsets.
For Ego4D, we use only videos with the attribute is_stereo=True, resulting in 263 videos in total. Download videos into datasets/ego4d by following the official instructions. We rectify the videos, split them into 150-frames clips, and apply the BiDAStereo model to estimate disparities. Check the ego4d preprocessing README for more details. At the end, you should have the following folders: cropped_videos (side by side rectified and cropped left and right videos), reprojected, and reprojected_mask. We provide the train/val split in datasets/ego4d/subsets.

Training

Download stable-video-diffusion-img2vid-xt checkpoint and put it to ckpts.
Run make_m2svid_init.py to modify SVD models weights for ours M2SVid model configuration with left view, warped view and mask conditioning.

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python make_m2svid_init.py

Run training

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
    --base configs/training/m2svid_train.yaml \
    --no-test True \
    --train True \
    --logdir outputs/training/m2svid

Evaluation

Evaluation on stereo4d:

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
    --base configs/training/m2svid_train.yaml \
    --dataset_base configs/testing/stereo4d.yaml \
    --no-test False \
    --train False \
    --logdir outputs/training/m2svid \
    --resume /home/jupyter/outputs_m2svid/training/m2svid/checkpoints/epoch=000120.ckpt

Evaluation on ego4d:

source /opt/conda/bin/activate ""
conda activate sgm
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
    --base configs/training/m2svid_train.yaml \
    --dataset_base configs/testing/ego4d.yaml \
    --no-test False \
    --train False \d
    --logdir outputs/training/m2svid \
    --resume /home/jupyter/outputs_m2svid/training/m2svid/checkpoints/epoch=000000.ckpt

Evaluation of Released Models

To reproduce the paper's results on Stereo4D and Ego4D using our released weights:

source /opt/conda/bin/activate ""
conda activate sgm

# Evaluate on Stereo4D
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
    --base configs/testing/pretrained_m2svid.yaml \
    --dataset_base configs/testing/stereo4d.yaml \
    --no-test False \
    --train False \
    --logdir outputs/training/m2svid 

# Evaluate on Ego4D
PYTHONPATH="./:./third_party/Hi3D-Official/:./third_party/pytorch-msssim/:${PYTHONPATH}" python third_party/Hi3D-Official/train_test_updated.py \
    --base configs/training/pretrained_m2svid.yaml \
    --dataset_base configs/testing/stereo4d.yaml \
    --no-test False \
    --train False \
    --logdir outputs/training/m2svid

🎓 Citation

@article{shvetsova2026m2svid,
  title={M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion},
  author={Shvetsova, Nina and Bhat, Goutam and Truong, Prune and Kuehne, Hilde and Tombari, Federico},
  journal={3DV},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

📄 Abstract

🛠️ Get started

Weights

Environment

⚙️ Inference

Inference Steps:

🏋️ Training and Quantitative Evaluation

Datasets

Training

Evaluation

Evaluation of Released Models

🎓 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data_preprocess		data_preprocess
datasets		datasets
demo		demo
docs		docs
m2svid		m2svid
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
inference.sh		inference.sh
inpaint_and_refine.py		inpaint_and_refine.py
make_m2svid_init.py		make_m2svid_init.py
teaser.gif		teaser.gif
warping.py		warping.py

Folders and files

Latest commit

History

Repository files navigation

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

📄 Abstract

🛠️ Get started

Weights

Environment

⚙️ Inference

Inference Steps:

🏋️ Training and Quantitative Evaluation

Datasets

Training

Evaluation

Evaluation of Released Models

🎓 Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages