Language Guided Robotic Grasping with Fine-grained Instructions

被引:7
作者
Sun, Qiang [1 ,6 ]
Lin, Haitao [1 ]
Fu, Ying [5 ]
Fu, Yanwei [1 ,2 ,3 ,4 ]
Xue, Xiangyang [1 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
[2] Fudan Univ, Sch Data Sci, Shanghai, Peoples R China
[3] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China
[4] Zhejiang Normal Univ, Fudan ISTBIZJNU Algorithm Ctr Braininspired Intel, Jinhua, Zhejiang, Peoples R China
[5] Beijing Inst Technol, Beijing, Peoples R China
[6] Shanghai Univ Int Business & Econ, Sch Stat & informat, Shanghai, Peoples R China
来源
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS | 2023年
关键词
D O I
10.1109/IROS55552.2023.10342331
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a single RGB image and the attribute-rich language instructions, this paper investigates the novel problem of using Fine-grained instructions for the Language guided robotic Grasping (FLarG). This problem is made challenging by learning fine-grained language descriptions to ground target objects. Recent advances have been made in visually grounding the objects simply by several coarse attributes [1]. However, these methods have poor performance as they cannot well align the multi-modal features, and do not make the best of recent powerful large pre-trained vision and language models, e.g., CLIP. To this end, this paper proposes a FLarG pipeline including stages of CLIP-guided object localization, and 6-DoF category-level object pose estimation for grasping. Specially, we first take the CLIP-based segmentation model CRIS as the backbone and propose an end-to-end DyCRIS model that uses a novel dynamic mask strategy to well fuse the multi-level language and vision features. Then, the well-trained instance segmentation backbone Mask R-CNN is adopted to further improve the predicted mask of our DyCRIS. Finally, the target object pose is inferred for the robotics grasping by using the recent 6-DoF object pose estimation method. To validate our CLIP-enhanced pipeline, we also construct a validation dataset for our FLarG task and name it RefNOCS. Extensive results on RefNOCS have shown the utility and effectiveness of our proposed method. The project homepage is available at https://sunqiang85.github.io/FLarG/.
引用
收藏
页码:1319 / 1326
页数:8
相关论文
共 38 条
[1]  
[Anonymous], 2022, CVPR, DOI DOI 10.1109/CVPR52688.2022.01716
[2]  
Bisk Y., 2016, P 15 ANN C N AM CHAP, P751
[3]  
Cheang C., 2022, ARXIV220504028
[4]  
Chen Y.-W., 2019, BMVC
[5]   A Joint Network for Grasp Detection Conditioned on Natural Language Commands [J].
Chen, Yiye ;
Xu, Ruinian ;
Lin, Yunzhi ;
Vela, Patricio A. .
2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, :4576-4582
[6]  
Ding H., 2022, TPAMI
[7]   Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [J].
Feng, Guang ;
Hu, Zhiwei ;
Zhang, Lihe ;
Lu, Huchuan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15501-15510
[8]  
Hatori J., 2018, ICRA
[9]  
He KM, 2020, IEEE T PATTERN ANAL, V42, P386, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]
[10]   Segmentation from Natural Language Expressions [J].
Hu, Ronghang ;
Rohrbach, Marcus ;
Darrell, Trevor .
COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :108-124