TY - JOUR
T1 - FS-Depth
T2 - Focal-and-Scale Depth Estimation From a Single Image in Unseen Indoor Scene
AU - Wei, Chengrui
AU - Yang, Meng
AU - He, Lei
AU - Zheng, Nanning
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - It has long been an ill-posed problem to predict absolute depth maps from single images in unseen scenes. We observe that it is essentially due to not only the scale-ambiguous problem, but more importantly, the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. Our model is enabled to be well trained on either a single dataset or a mixed dataset with diverse focal lengths and scene scales by a dual-directional alignment strategy. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. The experiments verify that our model trained on NYUDv2 significantly improves the generalization ability of monocular depth estimation by 32%/14% (RMSE) on three unseen datasets with/without data augmentation compared with state-of-the-art (SOTA) baselines, and well alleviates the deformation problem of depth maps in 3D view. The generalization ability is further improved by 16% when the model is trained on a mixture of NYUDv2 and SUNRGBD. In addition, our model maintains a SOTA accuracy, when it is trained and tested on NYUDv2 similar to existing models. The code is released on https://github.com/wcrwcrwcr/FS-Depth-v1.
AB - It has long been an ill-posed problem to predict absolute depth maps from single images in unseen scenes. We observe that it is essentially due to not only the scale-ambiguous problem, but more importantly, the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. Our model is enabled to be well trained on either a single dataset or a mixed dataset with diverse focal lengths and scene scales by a dual-directional alignment strategy. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. The experiments verify that our model trained on NYUDv2 significantly improves the generalization ability of monocular depth estimation by 32%/14% (RMSE) on three unseen datasets with/without data augmentation compared with state-of-the-art (SOTA) baselines, and well alleviates the deformation problem of depth maps in 3D view. The generalization ability is further improved by 16% when the model is trained on a mixture of NYUDv2 and SUNRGBD. In addition, our model maintains a SOTA accuracy, when it is trained and tested on NYUDv2 similar to existing models. The code is released on https://github.com/wcrwcrwcr/FS-Depth-v1.
KW - Monocular depth estimation
KW - focal length
KW - generalization
KW - indoor scene
KW - scene scale
UR - https://www.scopus.com/pages/publications/85196100543
U2 - 10.1109/TCSVT.2024.3411688
DO - 10.1109/TCSVT.2024.3411688
M3 - 文章
AN - SCOPUS:85196100543
SN - 1051-8215
VL - 34
SP - 10604
EP - 10617
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 11
ER -