Language Guided Robotic Grasping with Fine-grained Instructions

被引:7
作者
Sun, Qiang [1 ,6 ]
Lin, Haitao [1 ]
Fu, Ying [5 ]
Fu, Yanwei [1 ,2 ,3 ,4 ]
Xue, Xiangyang [1 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
[2] Fudan Univ, Sch Data Sci, Shanghai, Peoples R China
[3] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China
[4] Zhejiang Normal Univ, Fudan ISTBIZJNU Algorithm Ctr Braininspired Intel, Jinhua, Zhejiang, Peoples R China
[5] Beijing Inst Technol, Beijing, Peoples R China
[6] Shanghai Univ Int Business & Econ, Sch Stat & informat, Shanghai, Peoples R China
来源
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS | 2023年
关键词
D O I
10.1109/IROS55552.2023.10342331
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a single RGB image and the attribute-rich language instructions, this paper investigates the novel problem of using Fine-grained instructions for the Language guided robotic Grasping (FLarG). This problem is made challenging by learning fine-grained language descriptions to ground target objects. Recent advances have been made in visually grounding the objects simply by several coarse attributes [1]. However, these methods have poor performance as they cannot well align the multi-modal features, and do not make the best of recent powerful large pre-trained vision and language models, e.g., CLIP. To this end, this paper proposes a FLarG pipeline including stages of CLIP-guided object localization, and 6-DoF category-level object pose estimation for grasping. Specially, we first take the CLIP-based segmentation model CRIS as the backbone and propose an end-to-end DyCRIS model that uses a novel dynamic mask strategy to well fuse the multi-level language and vision features. Then, the well-trained instance segmentation backbone Mask R-CNN is adopted to further improve the predicted mask of our DyCRIS. Finally, the target object pose is inferred for the robotics grasping by using the recent 6-DoF object pose estimation method. To validate our CLIP-enhanced pipeline, we also construct a validation dataset for our FLarG task and name it RefNOCS. Extensive results on RefNOCS have shown the utility and effectiveness of our proposed method. The project homepage is available at https://sunqiang85.github.io/FLarG/.
引用
收藏
页码:1319 / 1326
页数:8
相关论文
共 38 条
[11]  
Hu Zhiwei, 2020, CVPR
[12]   Referring Image Segmentation via Cross-Modal Progressive Comprehension [J].
Huang, Shaofei ;
Hui, Tianrui ;
Liu, Si ;
Li, Guanbin ;
Wei, Yunchao ;
Han, Jizhong ;
Liu, Luoqi ;
Li, Bo .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10485-10494
[13]   Linguistic Structure Guided Context Modeling for Referring Image Segmentation [J].
Hui, Tianrui ;
Liu, Si ;
Huang, Shaofei ;
Li, Guanbin ;
Yu, Sansi ;
Zhang, Faxi ;
Han, Jizhong .
COMPUTER VISION - ECCV 2020, PT X, 2020, 12355 :59-75
[14]  
Jing Yongcheng, 2021, CVPR
[15]  
Kazemzadeh S., 2014, P 2014 C EMPIRICAL M, P787
[16]  
Li RC, 2020, IEEE POSITION LOCAT, P798, DOI [10.1109/plans46316.2020.9109908, 10.1109/PLANS46316.2020.9109908]
[17]  
Lin HB, 2022, PROCEEDINGS OF THE FIFTH FACT EXTRACTION AND VERIFICATION WORKSHOP (FEVER 2022), P6
[18]   Recurrent Multimodal Interaction for Referring Image Segmentation [J].
Liu, Chenxi ;
Lin, Zhe ;
Shen, Xiaohui ;
Yang, Jimei ;
Lu, Xin ;
Yuille, Alan .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1280-1289
[19]   Learning to Assemble Neural Module Tree Networks for Visual Grounding [J].
Liu, Daqing ;
Zhang, Hanwang ;
Wu, Feng ;
Zha, Zheng-Jun .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4672-4681
[20]   SSD: Single Shot MultiBox Detector [J].
Liu, Wei ;
Anguelov, Dragomir ;
Erhan, Dumitru ;
Szegedy, Christian ;
Reed, Scott ;
Fu, Cheng-Yang ;
Berg, Alexander C. .
COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :21-37