SM\(^4\)Depth: Seamless Monocular Metric Depth Estimation
across Multiple Cameras and Scenes by One Model

Yihao Liu *1       Feng Xue *1,2        Anlong Ming1
1Beijing University of Posts and Telecommunications   2University of Trento
(* Equal Contribution)

SM\(^4\)Depth is an approach to predict metric depth in never-seen-before scenes but is only trained on 150K RGB-D pairs.

Visualization Comparison between SM\(^4\)Depth and ZoeDepth

Depth and Distribution Visualizations on Real-world Videos

Note: The video was recorded by Huawei Mate40 (\(4K@30FPS\)), and the focal length is \(f_x=3278.24\), \(f_y=3278.13\). We employ the officially provided code of all methods in this comparison. Due to the lack of ground truth depth, this qualitative comparison is for reference only.


      Lower Error in Scale and Shift

0.15M vs 800M Training Pairs

Consistent Results in indoors/outdoors

Real-world scenes vary widely in depth, making models tend to focus on specific scenes and causing inconsistent accuracy across scenes. Tthe previous works suffered from large accuracy fluctuations and high average errors, while our method reduces the errors.

Compared to Metric3D, SM\(^4\)Depth performs better on most datasets, (i.e., SUN RGB-D, ETH3D, DIODE, and DDAD) and similar on iBims-1, but is only trained 150K images, which proves the effectiveness of SM\(^4\)Depth.

All the improvements are designed to ensure that SM\(^4\)Depth can obtain stable metric scales in multi-scene seamless videos. The results show that SM\(^4\)Depth can achieve consistent metric depth accuracy across both indoor and outdoor scenes.


The generalization of monocular metric depth estimation (MMDE) has been a longstanding challenge. Recent methods made progress by combining relative and metric depth or aligning input image focal length. However, they are still beset by the challenges in camera, scene, and data levels: (1) Sensitivity to different cameras; (2) Inconsistent accuracy across scenes; (3) Reliance on massive training data. This paper proposes SM\(^4\)Depth, a seamless MMDE method, to address all the issues above within a single network. First, we reveal that a consistent field of view (FOV) is the key to resolve "metric ambiguity" across cameras, which guides us to propose a more straightforward preprocessing unit. Second, to achieve consistently high accuracy across scenes, we explicitly model the metric scale determination as discretizing the depth interval into bins and propose variation-based unnormalized depth bins. This method bridges the depth gap of diverse scenes by reducing the ambiguity of the conventional metric bin. Third, to reduce the reliance on massive training data, we propose a "divide and conquer" solution. Instead of estimating directly from the vast solution space, the correct metric bins are estimated from multiple solution sub-spaces for complexity reduction. Finally, with just 150K RGB-D pairs and a consumer-grade GPU for training, SM\(^4\)Depth achieves state-of-the-art performance on most previously unseen datasets, especially surpassing ZoeDepth and Metric3D on mRI\(_\theta\).

Zero-shot Metric Depth Estimation

BUPT Depth dataset (cut-free indoor-outdoor video)

The BUPT Depth dataset is proposed to evaluate the consistency in accuracy across indoor and outdoor scenes, including streets , restaurants, classroom, lounges, etc. It consists of 14,932 continuous RGB-D pairs captured on the campus of Beijing University of Posts and Telecommunications (BUPT) by a ZED2 stereo camera. In addition, we provide the re-generated depth maps from CREStereo and the sky segmentations from ViT-Adapter. The color and depth streams are captured with intrinsics of \(1091.517\) and a baseline of \(120.034 mm\).

Zero-shot seamless results for BUPT Depth dataset

Our method is marked in red, ZoeDepth in blue, and Metric3D in green.



                title={SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model},
                author={Liu, Yihao and Xue, Feng and Ming, Anlong},