Swin-6D: 6D Pose Estimation via 3D Keypoints Voting with Swin Transformer

Research output: Contribution to journalArticlepeer-review

Abstract

6D pose estimation using RGB-D data is essential for various computer vision applications. Extracting relevant information from depth and color image data, and effectively integrating them, remains a significant technical challenge. Previous approaches primarily depend on convolutional networks for feature extraction and neglect to fully integrate features across the entire pipeline. This limitation undermines the robustness of existing 6D pose estimation methods, particularly in scenarios involving significant occlusions and clutter. To address these issues, we propose Swin-6D, a novel two-branch network designed specifically for 6D pose estimation with RGB-D input. The proposed model leverages the Swin Transformer in the RGB branch, which excels in handling occlusions. Moreover, the network incorporates a feature fusion module that facilitates the seamless integration of RGB and point cloud features. We conduct experiments on the LineMOD and Occlusion-LineMOD datasets, demonstrating that Swin-6D achieves ADD(−S) scores of 99.7 and 75.5, respectively. We further validate our method on the YCB-Video dataset, demonstrating competitive performance in cluttered real-world scenes. These results highlight that Swin-6D outperforms state-of-the-art methods, especially in scenarios with significant occlusions. We also implemented our method on the AUBO i5 robot for grasping experiments, where it achieves robust performance even in cluttered and partially occluded scenes, demonstrating its practical deployment potential.

Original languageEnglish
Pages (from-to)1055-1071
Number of pages17
JournalJournal of Electrical Engineering and Technology
Volume21
Issue number1
DOIs
StatePublished - Jan 2026

Keywords

  • 6D pose estimation
  • Keypoint-based pose estimation
  • Pixel-point cloud fusion
  • RGB-D images

Fingerprint

Dive into the research topics of 'Swin-6D: 6D Pose Estimation via 3D Keypoints Voting with Swin Transformer'. Together they form a unique fingerprint.

Cite this