YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition

被引:314
作者
Guadarrama, Sergio [1 ]
Krishnamoorthy, Niveda [2 ]
Malkarnenkar, Girish [2 ]
Venugopalan, Subhashini [2 ]
Mooney, Raymond [2 ]
Darrell, Trevor [3 ]
Saenko, Kate [4 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] UT Austin, Austin, TX USA
[3] Univ Calif Berkeley, ICSI, Berkeley, CA 94720 USA
[4] UMass Lowell, Lowell, MA USA
来源
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) | 2013年
关键词
D O I
10.1109/ICCV.2013.337
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities "in-the-wild". We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to "fill in" novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
引用
收藏
页码:2712 / 2719
页数:8
相关论文
共 31 条
[21]  
Laptev I., 2007, INT C COMPUTER VISIO, P1
[22]  
Li L.-J., 2010, Proc. of Neural Information Processing Systems, P5
[23]  
Mun Wai Lee, 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), P1, DOI 10.1109/CVPRW.2008.4562954
[24]  
Pedersen Ted., 2004, DEMONSTRATION PAPERS, P38
[25]  
Platt JC, 2000, ADV NEUR IN, P61
[26]  
Reddy KK, 2012, Machine Vision and Applications, DOI DOI 10.1007/S00138-012-0450-4
[27]   Recognizing human actions:: A local SVM approach [J].
Schüldt, C ;
Laptev, I ;
Caputo, B .
PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3, 2004, :32-36
[28]  
Wang H, 2011, PROTECTING PRIVACY IN CHINA: A RESEARCH ON CHINAS PRIVACY STANDARDS AND THE POSSIBILITY OF ESTABLISHING THE RIGHT TO PRIVACY AND THE INFORMATION PRIVACY PROTECTION LEGISLATION IN MODERN CHINA, P1, DOI 10.1007/978-3-642-21750-0_1
[29]  
WU ZB, 1994, 32ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P133
[30]   I2T: Image Parsing to Text Description [J].
Yao, Benjamin Z. ;
Yang, Xiong ;
Lin, Liang ;
Lee, Mun Wai ;
Zhu, Song-Chun .
PROCEEDINGS OF THE IEEE, 2010, 98 (08) :1485-1508