YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition

被引：314

作者：

Guadarrama, Sergio ^{[1
]}

Krishnamoorthy, Niveda ^{[2
]}

Malkarnenkar, Girish ^{[2
]}

Venugopalan, Subhashini ^{[2
]}

Mooney, Raymond ^{[2
]}

Darrell, Trevor ^{[3
]}

Saenko, Kate ^{[4
]}

机构：

[1] Univ Calif Berkeley, Berkeley, CA 94720 USA

[2] UT Austin, Austin, TX USA

[3] Univ Calif Berkeley, ICSI, Berkeley, CA 94720 USA

[4] UMass Lowell, Lowell, MA USA

来源：

2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) | 2013年

关键词：

D O I：

10.1109/ICCV.2013.337

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities "in-the-wild". We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to "fill in" novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.

引用

页码：2712 / 2719

页数：8

共 31 条

[21]

Laptev I., 2007, INT C COMPUTER VISIO, P1

[22]

Li L.-J., 2010, Proc. of Neural Information Processing Systems, P5

[23]

Mun Wai Lee, 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), P1, DOI 10.1109/CVPRW.2008.4562954

[24]

Pedersen Ted., 2004, DEMONSTRATION PAPERS, P38

[25]

Platt JC, 2000, ADV NEUR IN, P61

[26]

Reddy KK, 2012, Machine Vision and Applications, DOI DOI 10.1007/S00138-012-0450-4

[27] Recognizing human actions:: A local SVM approach [J].

Schüldt, C ;

Laptev, I ;

Caputo, B .

PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3, 2004, :32-36

[28]

Wang H, 2011, PROTECTING PRIVACY IN CHINA: A RESEARCH ON CHINAS PRIVACY STANDARDS AND THE POSSIBILITY OF ESTABLISHING THE RIGHT TO PRIVACY AND THE INFORMATION PRIVACY PROTECTION LEGISLATION IN MODERN CHINA, P1, DOI 10.1007/978-3-642-21750-0_1

[29]

WU ZB, 1994, 32ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P133

[30] I2T: Image Parsing to Text Description [J].

Yao, Benjamin Z. ;

Yang, Xiong ;

Lin, Liang ;

Lee, Mun Wai ;

Zhu, Song-Chun .

PROCEEDINGS OF THE IEEE, 2010, 98 (08) :1485-1508

← 1 2 3 4 →