TY - GEN
T1 - VaF-LangSplat
T2 - 33rd ACM International Conference on Multimedia, MM 2025
AU - Li, Changzhou
AU - Yang, Xinyu
AU - Yang, Weiguo
AU - Li, Xinyi
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Efficient and precise open-vocabulary 3D scene segmentation remains a critical challenge in computer vision. While current leading methods encode CLIP language features into 3D Gaussians to achieve high segmentation accuracy and fast inference speeds, they suffer from point ambiguity issues caused by separately training on multi-level 2D semantic masks. This approach not only compromises time and space efficiency but also degrades accuracy when selecting optimal semantic levels. To overcome these limitations, we propose Voxel-Aware Fusion Language Gaussian Splatting (VaF-LangSplat), a novel framework that jointly optimizes geometric and semantic representations. Our approach first voxelizes 3D Gaussians using sparse point clouds and lightweight MLP decoders, effectively disentangling language features from geometric attributes. This enables simultaneous training across arbitrary semantic levels with minimal overhead. Crucially, we introduce Fusion Language Splatting, which aligns geometric and multi-level semantic distributions to sharpen boundary definitions while eliminating redundant Gaussian expansions. The voxel-aware representation further enhances robustness against motion blur and lighting variations. Experiments on open-vocabulary 3D localization and segmentation tasks demonstrate that VaF-LangSplat outperforms LangSplat (the prior state-of-the-art) with significant improvements in both segmentation/localization accuracy and efficiency: 4X faster training and 15X reduced storage requirements.
AB - Efficient and precise open-vocabulary 3D scene segmentation remains a critical challenge in computer vision. While current leading methods encode CLIP language features into 3D Gaussians to achieve high segmentation accuracy and fast inference speeds, they suffer from point ambiguity issues caused by separately training on multi-level 2D semantic masks. This approach not only compromises time and space efficiency but also degrades accuracy when selecting optimal semantic levels. To overcome these limitations, we propose Voxel-Aware Fusion Language Gaussian Splatting (VaF-LangSplat), a novel framework that jointly optimizes geometric and semantic representations. Our approach first voxelizes 3D Gaussians using sparse point clouds and lightweight MLP decoders, effectively disentangling language features from geometric attributes. This enables simultaneous training across arbitrary semantic levels with minimal overhead. Crucially, we introduce Fusion Language Splatting, which aligns geometric and multi-level semantic distributions to sharpen boundary definitions while eliminating redundant Gaussian expansions. The voxel-aware representation further enhances robustness against motion blur and lighting variations. Experiments on open-vocabulary 3D localization and segmentation tasks demonstrate that VaF-LangSplat outperforms LangSplat (the prior state-of-the-art) with significant improvements in both segmentation/localization accuracy and efficiency: 4X faster training and 15X reduced storage requirements.
KW - 3d gaussians
KW - fusion language splatting
KW - open-vocabulary segmentation
KW - point ambiguity issue
KW - voxel-aware
UR - https://www.scopus.com/pages/publications/105024073003
U2 - 10.1145/3746027.3755693
DO - 10.1145/3746027.3755693
M3 - 会议稿件
AN - SCOPUS:105024073003
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 4952
EP - 4961
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 27 October 2025 through 31 October 2025
ER -