Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

被引:77
作者
Mallya, Arun [1 ]
Lazebnik, Svetlana [1 ]
机构
[1] Univ Illinois, Champaign, IL 61801 USA
来源
COMPUTER VISION - ECCV 2016, PT I | 2016年 / 9905卷
关键词
Activity prediction; Deep networks; Visual Question Answering;
D O I
10.1007/978-3-319-46448-0_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task
引用
收藏
页码:414 / 428
页数:15
相关论文
共 38 条
[1]  
Agrawal P, 2014, LECT NOTES COMPUT SC, V8695, P329, DOI 10.1007/978-3-319-10584-0_22
[2]  
Andreas J., 2015, CoRR
[3]  
[Anonymous], 2015, ARXIV151105234
[4]  
[Anonymous], ACM ICMI
[5]  
[Anonymous], Simple baseline for visual question answering
[6]  
[Anonymous], 2015, Advances in neural information processing systems
[7]  
[Anonymous], 2014, Advances in neural information processing systems
[8]  
[Anonymous], CVPR
[9]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[10]  
Bell S., 2015, CoRR