ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

被引:204
作者
Shridhar, Mohit [1 ]
Thomason, Jesse [1 ]
Gordon, Daniel [1 ]
Bisk, Yonatan [1 ,2 ,3 ]
Han, Winson [3 ]
Mottaghi, Roozbeh [1 ,3 ]
Zettlemoyer, Luke [1 ]
Fox, Dieter [1 ,4 ]
机构
[1] Univ Washington, Paul G Allen Sch Comp Sci & Engn, Seattle, WA 98195 USA
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA USA
[3] Allen Inst AI, Seattle, WA USA
[4] NVIDIA, Santa Clara, CA USA
来源
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年
关键词
D O I
10.1109/CVPR42600.2020.01075
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
引用
收藏
页码:10737 / 10746
页数:10
相关论文
共 58 条
  • [1] Anderson P., 2018, arXiv
  • [2] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [3] Neural Module Networks
    Andreas, Jacob
    Rohrbach, Marcus
    Darrell, Trevor
    Klein, Dan
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 39 - 48
  • [4] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00365
  • [5] [Anonymous], 2016, CVPR, DOI DOI 10.1109/CVPR.2016.495
  • [6] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.01282
  • [7] Artzi Yoav, 2013, T ASSOC COMPUT LING, V1, P49, DOI [10.1162/tacla00209, DOI 10.1162/TACLA00209]
  • [8] Asai M, 2018, AAAI CONF ARTIF INTE, P6094
  • [9] Beetz Michael., 2011, IEEE-RAS
  • [10] Bisk Yonatan., 2016, NAACL