YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition

被引:298
作者
Guadarrama, Sergio [1 ]
Krishnamoorthy, Niveda [2 ]
Malkarnenkar, Girish [2 ]
Venugopalan, Subhashini [2 ]
Mooney, Raymond [2 ]
Darrell, Trevor [3 ]
Saenko, Kate [4 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] UT Austin, Austin, TX USA
[3] Univ Calif Berkeley, ICSI, Berkeley, CA 94720 USA
[4] UMass Lowell, Lowell, MA USA
来源
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) | 2013年
关键词
D O I
10.1109/ICCV.2013.337
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities "in-the-wild". We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to "fill in" novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
引用
收藏
页码:2712 / 2719
页数:8
相关论文
共 31 条
  • [21] Laptev I., 2007, INT C COMPUTER VISIO, P1
  • [22] Li L.-J., 2010, Proc. of Neural Information Processing Systems, P5
  • [23] Mun Wai Lee, 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), P1, DOI 10.1109/CVPRW.2008.4562954
  • [24] Pedersen Ted., 2004, DEMONSTRATION PAPERS, P38
  • [25] Platt JC, 2000, ADV NEUR IN, P61
  • [26] Reddy KK, 2012, Machine Vision and Applications, DOI DOI 10.1007/S00138-012-0450-4
  • [27] Recognizing human actions:: A local SVM approach
    Schüldt, C
    Laptev, I
    Caputo, B
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3, 2004, : 32 - 36
  • [28] Wang H, 2011, PROTECTING PRIVACY IN CHINA: A RESEARCH ON CHINAS PRIVACY STANDARDS AND THE POSSIBILITY OF ESTABLISHING THE RIGHT TO PRIVACY AND THE INFORMATION PRIVACY PROTECTION LEGISLATION IN MODERN CHINA, P1, DOI 10.1007/978-3-642-21750-0_1
  • [29] WU ZB, 1994, 32ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P133
  • [30] I2T: Image Parsing to Text Description
    Yao, Benjamin Z.
    Yang, Xiong
    Lin, Liang
    Lee, Mun Wai
    Zhu, Song-Chun
    [J]. PROCEEDINGS OF THE IEEE, 2010, 98 (08) : 1485 - 1508