TY - GEN
T1 - Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill
AU - Cai, Wenzhe
AU - Huang, Siyuan
AU - Cheng, Guangran
AU - Long, Yuxing
AU - Gao, Peng
AU - Sun, Changyin
AU - Dong, Hao
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Zero-shot object navigation is a challenging task for home-assistance robots. This task emphasizes visual grounding, commonsense inference and locomotion abilities, where the first two are inherent in foundation models. But for the locomotion part, most works still depend on map-based planning approaches. The gap between RGB space and map space makes it difficult to directly transfer the knowledge from foundation models to navigation tasks. In this work, we propose a Pixel-guided Navigation skill (PixNav), which bridges the gap between the foundation models and the embodied navigation task. It is straightforward for recent foundation models to indicate an object by pixels, and with pixels as the goal specification, our method becomes a versatile navigation policy towards all different kinds of objects. Besides, our PixNav is a pure RGB-based policy that can reduce the cost of homeassistance robots. Experiments demonstrate the robustness of the PixNav which achieves 80+% success rate in the local path-planning task. To perform long-horizon object navigation, we design an LLM-based planner to utilize the commonsense knowledge between objects and rooms to select the best waypoint. Evaluations across both photorealistic indoor simulators and real-world environments validate the effectiveness of our proposed navigation strategy. More details are accessible via our project website https://sites.google.com/view/pixnav/.
AB - Zero-shot object navigation is a challenging task for home-assistance robots. This task emphasizes visual grounding, commonsense inference and locomotion abilities, where the first two are inherent in foundation models. But for the locomotion part, most works still depend on map-based planning approaches. The gap between RGB space and map space makes it difficult to directly transfer the knowledge from foundation models to navigation tasks. In this work, we propose a Pixel-guided Navigation skill (PixNav), which bridges the gap between the foundation models and the embodied navigation task. It is straightforward for recent foundation models to indicate an object by pixels, and with pixels as the goal specification, our method becomes a versatile navigation policy towards all different kinds of objects. Besides, our PixNav is a pure RGB-based policy that can reduce the cost of homeassistance robots. Experiments demonstrate the robustness of the PixNav which achieves 80+% success rate in the local path-planning task. To perform long-horizon object navigation, we design an LLM-based planner to utilize the commonsense knowledge between objects and rooms to select the best waypoint. Evaluations across both photorealistic indoor simulators and real-world environments validate the effectiveness of our proposed navigation strategy. More details are accessible via our project website https://sites.google.com/view/pixnav/.
UR - https://www.scopus.com/pages/publications/85202431282
U2 - 10.1109/ICRA57147.2024.10610499
DO - 10.1109/ICRA57147.2024.10610499
M3 - 会议稿件
AN - SCOPUS:85202431282
T3 - Proceedings - IEEE International Conference on Robotics and Automation
SP - 5228
EP - 5234
BT - 2024 IEEE International Conference on Robotics and Automation, ICRA 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Robotics and Automation, ICRA 2024
Y2 - 13 May 2024 through 17 May 2024
ER -