M3IL: Multi-Modal Meta-Imitation Learning

被引:0
作者
Zhang X. [1 ]
Matsushima T. [1 ]
Matsuo Y. [1 ]
Iwasawa Y. [1 ]
机构
[1] The University of Tokyo, Japan
基金
日本学术振兴会;
关键词
deep learning; imitation learning; multi-modal; robot learning;
D O I
10.1527/tjsai.38-2_A-LB3
中图分类号
学科分类号
摘要
Imitation Learning (IL) is anticipated to achieve intelligent robots since it allows the user to teach the various robot tasks easily.In particular, Few-Shot Imitation Learning (FSIL) aims to infer and adapt fast to unseen tasks with a small amount of data. Though FSIL requires few-shot of data, the high cost of demonstrations in IL is still a critical problem. Especially when we want to teach the robot a new task, we need to execute the task for the assignment every time. Inspired by the fact that humans specify tasks using language instructions without executing them, we propose a multi-modal FSIL setting in this work. The model leverages image and language information in the training phase and utilizes both image and language or only language information in the testing phase. We also propose a Multi-Modal Meta-Imitation Learning or M3IL, which can infer with only image or language information. The result of M3IL outperforms the baseline in the standard and proposed settings. Our result shows the effectiveness of M3IL and the importance of language instructions in the FSIL setting. © 2023, Japanese Society for Artificial Intelligence. All rights reserved.
引用
收藏
相关论文
共 35 条
[1]  
Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J., Devin M., Ghemawat S., Irving G., Isard M., Et al., Tensorflow: A system for large-scale machine learning, 12th {USENIX} symposium on operating systems design and implemen-tation ({OSDI} 16), pp. 265-283, (2016)
[2]  
Codevilla F., Muller M., Lopez A., Koltun V., Dosovitskiy A., End-to-end driving via conditional imitation learn-ing, 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4693-4700IEEE, (2018)
[3]  
Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, (2018)
[4]  
Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, (2019)
[5]  
Duvallet F., Kollar T., Stentz A., Imitation learning for natural language direction following through unknown envi-ronments, 2013 IEEE International Conference on Robotics and Automation, pp. 1047-1053IEEE, (2013)
[6]  
Fei-Fei L., Fergus R., Perona P., One-shot learning of object categories, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 4, pp. 594-611, (2006)
[7]  
Finn C., Abbeel P., Levine S., Model-agnostic meta-learning for fast adaptation of deep networks, (2017)
[8]  
Finn C., Yu T., Zhang T., Abbeel P., Levine S., One-Shot Visual Imitation Learning via Meta-Learning, (2017)
[9]  
Finn C., Yu T., Zhang T., Abbeel P., Levine S., One-Shot Visual Imitation Learning via Meta-Learning, Proceedings of the 1st Annual Conference on Robot Learning, (2017)
[10]  
Gopalan N., Arumugam D., Wong L. L., Tellex S., Sequence-to-Sequence Language Grounding of Non-Markovian Task Specifications, Robotics: Science and Systems, (2018)