TY - GEN
T1 - RefDetector
T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
AU - Wang, Yabing
AU - Tian, Zhuotao
AU - Qin, Zheng
AU - Zhou, Sanping
AU - Wang, Le
N1 - Publisher Copyright:
Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2025/4/11
Y1 - 2025/4/11
N2 - Despite the rapid and substantial advancements in object detection, it continues to face limitations imposed by pre-defined category sets. Current methods for visual grounding primarily focus on how to better leverage the visual backbone to generate text-tailored visual features, which may require adjusting the parameters of the entire model. Besides, some early methods, i.e., matching-based method, build upon and extend the functionality of existing object detectors by enabling them to localize an object based on free-form linguistic expressions, which have good application potential. However, the untapped potential of the matching-based approach has not been fully realized due to inadequate exploration. In this paper, we first analyze the limitations of the current matching-based method (i.e., mismatch problem and complicated fusion mechanisms), then present a simple yet effective matching-based method, namely RefDetector. To tackle the above issues, we devise a simple heuristic rule to generate proposals with improved referent recall. Additionally, we introduce a straightforward vision-language interaction module that eliminates the need for intricate manually-designed mechanisms. Moreover, we have explored the visual grounding based on the modern detector DETR, and achieved significant performance improvement. Extensive experiments on three REC benchmark datasets, i.e., RefCOCO, RefCOCO+, and RefCOCOg validate the effectiveness of the proposed method.
AB - Despite the rapid and substantial advancements in object detection, it continues to face limitations imposed by pre-defined category sets. Current methods for visual grounding primarily focus on how to better leverage the visual backbone to generate text-tailored visual features, which may require adjusting the parameters of the entire model. Besides, some early methods, i.e., matching-based method, build upon and extend the functionality of existing object detectors by enabling them to localize an object based on free-form linguistic expressions, which have good application potential. However, the untapped potential of the matching-based approach has not been fully realized due to inadequate exploration. In this paper, we first analyze the limitations of the current matching-based method (i.e., mismatch problem and complicated fusion mechanisms), then present a simple yet effective matching-based method, namely RefDetector. To tackle the above issues, we devise a simple heuristic rule to generate proposals with improved referent recall. Additionally, we introduce a straightforward vision-language interaction module that eliminates the need for intricate manually-designed mechanisms. Moreover, we have explored the visual grounding based on the modern detector DETR, and achieved significant performance improvement. Extensive experiments on three REC benchmark datasets, i.e., RefCOCO, RefCOCO+, and RefCOCOg validate the effectiveness of the proposed method.
UR - https://www.scopus.com/pages/publications/105004290133
U2 - 10.1609/aaai.v39i8.32866
DO - 10.1609/aaai.v39i8.32866
M3 - 会议稿件
AN - SCOPUS:105004290133
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 8033
EP - 8041
BT - Special Track on AI Alignment
A2 - Walsh, Toby
A2 - Shah, Julie
A2 - Kolter, Zico
PB - Association for the Advancement of Artificial Intelligence
Y2 - 25 February 2025 through 4 March 2025
ER -