Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models

被引:403
作者
Wang, Xiaogang [1 ]
Ma, Xiaoxu [1 ]
Grimson, W. Eric L. [1 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
关键词
Hierarchical Bayesian model; visual surveillance; activity analysis; abnormality detection; video segmentation; motion segmentation; clustering; Dirichlet process; Gibbs sampling; variational inference; VIDEO; TRACKING;
D O I
10.1109/TPAMI.2008.87
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel unsupervised learning framework to model activities and interactions in crowded and complicated scenes. Under our framework, hierarchical Bayesian models are used to connect three elements in visual surveillance: low-level visual features, simple "atomic" activities, and interactions. Atomic activities are modeled as distributions over low-level visual features, and multiagent interactions are modeled as distributions over atomic activities. These models are learned in an unsupervised way. Given a long video sequence, moving pixels are clustered into different atomic activities and short video clips are clustered into different interactions. In this paper, we propose three hierarchical Bayesian models: the Latent Dirichlet Allocation (LDA) mixture model, the Hierarchical Dirichlet Processes (HDP) mixture model, and the Dual Hierarchical Dirichlet Processes (Dual-HDP) model. They advance existing topic models, such as LDA [1] and HDP [2]. Directly using existing LDA and HDP models under our framework, only moving pixels can be clustered into atomic activities. Our models can cluster both moving pixels and video clips into atomic activities and into interactions. The LDA mixture model assumes that it is already known how many different types of atomic activities and interactions occur in the scene. The HDP mixture model automatically decides the number of categories of atomic activities. The Dual-HDP automatically decides the numbers of categories of both atomic activities and interactions. Our data sets are challenging video sequences from crowded traffic scenes and train station scenes with many kinds of activities co-occurring. Without tracking and human labeling effort, our framework completes many challenging visual surveillance tasks of broad interest such as: 1) discovering and providing a summary of typical atomic activities and interactions occurring in the scene, 2) segmenting long video sequences into different interactions, 3) segmenting motions into different activities, 4) detecting abnormality, and 5) supporting high-level queries on activities and interactions. In our work, these surveillance problems are formulated in a transparent, clean, and probabilistic way compared with the ad hoc nature of many existing approaches.
引用
收藏
页码:539 / 555
页数:17
相关论文
共 43 条
[1]  
[Anonymous], P INT C COMP VIS
[2]  
[Anonymous], 2005, 2005 IEEE COMP SOC C, DOI DOI 10.1109/CVPR.2005.16
[3]  
[Anonymous], 2006, J AM STAT ASS
[4]  
[Anonymous], 2005, P INT C COMP VIS
[5]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[6]   Discovery and segmentation of activities in video [J].
Brand, M ;
Kettnaker, V .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (08) :844-851
[7]  
Davis J., 1997, P IEEE INT C COMP VI
[8]   Concept decompositions for large sparse text data using clustering [J].
Dhillon, IS ;
Modha, DS .
MACHINE LEARNING, 2001, 42 (1-2) :143-175
[9]  
DHILLON IS, 2001, P ACM SPEC INT GROUP
[10]   BAYESIAN ANALYSIS OF SOME NONPARAMETRIC PROBLEMS [J].
FERGUSON, TS .
ANNALS OF STATISTICS, 1973, 1 (02) :209-230