Human emotion recognition from videos using spatio-temporal and audio features

被引:37
作者
Rashid, Munaf [1 ,2 ]
Abu-Bakar, S. A. R. [1 ]
Mokji, Musa [1 ]
机构
[1] Univ Teknol Malaysia, Fac Elect Engn, Comp Vis Video & Image Proc Lab CVVIP, Skudai 81310, Johor Bahru, Malaysia
[2] Karachi Inst Econ & Technol, Coll Engn COE, Karachi 75190, Pakistan
关键词
Human computer interface (HCI); Multimodal system; Human emotions; Support vector machines (SVM); Spatio-temporal features;
D O I
10.1007/s00371-012-0768-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %.
引用
收藏
页码:1269 / 1275
页数:7
相关论文
共 26 条
[1]  
[Anonymous], ARCH SCI
[2]  
Bhatti MW, 2004, 2004 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL 2, PROCEEDINGS, P181
[3]   Recognizing facial expressions in image sequences using local parameterized models of image motion [J].
Black, MJ ;
Yacoob, Y .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 1997, 25 (01) :23-48
[4]   Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection [J].
Busso, Carlos ;
Lee, Sungbok ;
Narayanan, Shrikanth .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (04) :582-596
[5]  
Byun KS, 2004, SICE 2004 ANNUAL CONFERENCE, VOLS 1-3, P2483
[6]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[7]  
Chen CY, 2005, 2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, P1469
[8]   Emotion recognition in human-computer interaction [J].
Cowie, R ;
Douglas-Cowie, E ;
Tsapatsoulis, N ;
Votsis, G ;
Kollias, S ;
Fellenz, W ;
Taylor, JG .
IEEE SIGNAL PROCESSING MAGAZINE, 2001, 18 (01) :32-80
[9]  
Devillers L, 2003, 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, P549
[10]  
Dollar P., 2005, Proceedings. 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS) (IEEE Cat. No. 05EX1178), P65