Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments

被引:7
作者
Gao, Difei [1 ,2 ]
Wang, Ruiping [1 ,2 ,3 ]
Bai, Ziyi [1 ,2 ]
Chen, Xilin [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, CAS, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Beijing Acad Artificial Intelligence, Beijing 100084, Peoples R China
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
基金
国家重点研发计划;
关键词
D O I
10.1109/ICCV48922.2021.00170
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the environment, quickly perceive events happening around, and continuously track objects' state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and interacting in the environment. It also provides 85K questions to evaluate the ability of understanding the composition, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Comprehensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.
引用
收藏
页码:1655 / 1665
页数:11
相关论文
共 60 条
[1]   Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].
Agrawal, Aishwarya ;
Batra, Dhruv ;
Parikh, Devi ;
Kembhavi, Aniruddha .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980
[2]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[3]  
[Anonymous], 2017, arXiv
[4]  
[Anonymous], 2003, P IEEE COMP SOC C CO
[5]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[6]  
Batra Dhruv, 2020, Rearrangement: A challenge for embodied ai
[7]   Matterport3D: Learning from RGB-D Data in Indoor Environments [J].
Chang, Angel ;
Dai, Angela ;
Funkhouser, Thomas ;
Halber, Maciej ;
Niessner, Matthias ;
Savva, Manolis ;
Song, Shuran ;
Zeng, Andy ;
Zhang, Yinda .
PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2017, :667-676
[8]   A Voltage-fed Control Method with Automatic Bidirectional Power Switching Capability for CLLC and SR Buck Based Two Stage Converters [J].
Chen, Huan ;
Chong, Hongsheng ;
Sun, Kai ;
Zhou, You ;
Zhou, Hang ;
Fletcher, John .
2020 IEEE 9TH INTERNATIONAL POWER ELECTRONICS AND MOTION CONTROL CONFERENCE (IPEMC2020-ECCE ASIA), 2020, :538-543
[9]  
Das A, 2018, INT SYMP NETW CHIP
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848