TY - JOUR
T1 - Swin-6D
T2 - 6D Pose Estimation via 3D Keypoints Voting with Swin Transformer
AU - Huang, Zijian
AU - Shi, Xiaojun
AU - Ju, Zeli
AU - Yun, Xialun
AU - Mei, Xuesong
AU - Guo, Lunan
N1 - Publisher Copyright:
© The Author(s) under exclusive licence to The Korean Institute of Electrical Engineers 2025.
PY - 2026/1
Y1 - 2026/1
N2 - 6D pose estimation using RGB-D data is essential for various computer vision applications. Extracting relevant information from depth and color image data, and effectively integrating them, remains a significant technical challenge. Previous approaches primarily depend on convolutional networks for feature extraction and neglect to fully integrate features across the entire pipeline. This limitation undermines the robustness of existing 6D pose estimation methods, particularly in scenarios involving significant occlusions and clutter. To address these issues, we propose Swin-6D, a novel two-branch network designed specifically for 6D pose estimation with RGB-D input. The proposed model leverages the Swin Transformer in the RGB branch, which excels in handling occlusions. Moreover, the network incorporates a feature fusion module that facilitates the seamless integration of RGB and point cloud features. We conduct experiments on the LineMOD and Occlusion-LineMOD datasets, demonstrating that Swin-6D achieves ADD(−S) scores of 99.7 and 75.5, respectively. We further validate our method on the YCB-Video dataset, demonstrating competitive performance in cluttered real-world scenes. These results highlight that Swin-6D outperforms state-of-the-art methods, especially in scenarios with significant occlusions. We also implemented our method on the AUBO i5 robot for grasping experiments, where it achieves robust performance even in cluttered and partially occluded scenes, demonstrating its practical deployment potential.
AB - 6D pose estimation using RGB-D data is essential for various computer vision applications. Extracting relevant information from depth and color image data, and effectively integrating them, remains a significant technical challenge. Previous approaches primarily depend on convolutional networks for feature extraction and neglect to fully integrate features across the entire pipeline. This limitation undermines the robustness of existing 6D pose estimation methods, particularly in scenarios involving significant occlusions and clutter. To address these issues, we propose Swin-6D, a novel two-branch network designed specifically for 6D pose estimation with RGB-D input. The proposed model leverages the Swin Transformer in the RGB branch, which excels in handling occlusions. Moreover, the network incorporates a feature fusion module that facilitates the seamless integration of RGB and point cloud features. We conduct experiments on the LineMOD and Occlusion-LineMOD datasets, demonstrating that Swin-6D achieves ADD(−S) scores of 99.7 and 75.5, respectively. We further validate our method on the YCB-Video dataset, demonstrating competitive performance in cluttered real-world scenes. These results highlight that Swin-6D outperforms state-of-the-art methods, especially in scenarios with significant occlusions. We also implemented our method on the AUBO i5 robot for grasping experiments, where it achieves robust performance even in cluttered and partially occluded scenes, demonstrating its practical deployment potential.
KW - 6D pose estimation
KW - Keypoint-based pose estimation
KW - Pixel-point cloud fusion
KW - RGB-D images
UR - https://www.scopus.com/pages/publications/105017396537
U2 - 10.1007/s42835-025-02455-4
DO - 10.1007/s42835-025-02455-4
M3 - 文章
AN - SCOPUS:105017396537
SN - 1975-0102
VL - 21
SP - 1055
EP - 1071
JO - Journal of Electrical Engineering and Technology
JF - Journal of Electrical Engineering and Technology
IS - 1
ER -