Video Classification via Weakly Supervised Sequence Modeling

被引:10
作者
Liu, Jingjing [1 ]
Chen, Chao [2 ]
Zhu, Yan [1 ]
Liu, Wei [3 ]
Metaxas, Dimitris N. [1 ]
机构
[1] Rutgers, Dept Comp Sci, Piscataway Township, NJ 08854 USA
[2] CUNY Queens Coll, Dept Comp Sci, Flushing, NY 11367 USA
[3] Didi Res, Beijing 100085, Peoples R China
关键词
Video classification; Gesture; Action; Weakly supervised; Sequence modeling; Multiple-instance learning (MIL); Conditional Random Fields (CRFs); INSTANCE; CATEGORIZATION; SEGMENTATION;
D O I
10.1016/j.cviu.2015.10.012
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional approaches for video classification treat the entire video clip as one data instance. They extract visual features from video frames which are then quantized (e.g., K-means) and pooled (e.g., average pooling) to produce a single feature vector. Such holistic representations of videos are further used as inputs of a classifier. Despite of efficiency, global and aggregate feature representation unavoidably brings in redundant and noisy information from background and unrelated video frames that sometimes overwhelms targeted visual patterns. Besides, temporal correlations between consecutive video frames are also ignored in both training and testing, which may be the key indicator of an action or event. To this end, we propose Weakly Supervised Sequence Modeling (WSSM), a novel framework that combines multiple-instance learning (MIL) and Conditional Random Field (CRF) model seamlessly. Our model takes each entire video as a bag and one video segment as an instance. In our framework, the salient local patterns for different video categories are explored by MIL, and intrinsic temporal dependencies between instances are explicitly exploited using the powerful chain CRF model. In the training stage, we design a novel conditional likelihood formulation which only requires annotation on videos. Such likelihood can be maximized using an alternating optimization method. The training algorithm is guaranteed to converge and is very efficient. In the testing stage, videos are classified by the learned CRF model. The proposed WSSM algorithm outperforms other MIL-based approaches in both accuracy and efficiency on synthetic data and realistic videos for gesture and action classification. (C) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:79 / 87
页数:9
相关论文
共 57 条
[1]   Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning [J].
Ali, Saad ;
Shah, Mubarak .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (02) :288-303
[2]  
Andrews Stuart, 2002, Proceedings of the 15th International Conference on Neural Information Processing Systems. NIPS'02, P561
[3]  
[Anonymous], P 2014 IEEE C COMP V
[4]  
[Anonymous], 2014, ADV NEURAL INFORM PR
[5]  
[Anonymous], ARXIV14054506
[6]  
[Anonymous], 2006, PATTERN RECOGN, DOI DOI 10.1117/1.2819119
[7]  
[Anonymous], P 26 ANN INT C MACH
[8]  
[Anonymous], P 2013 BRIT MACH VIS
[9]  
[Anonymous], 0502 DAT MIN I
[10]  
[Anonymous], 2011, P ADV NEUR INF PROC