Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module

被引:27
作者
Liu, Yihe [1 ,2 ]
Yuan, Ziqi [1 ,3 ]
Mao, Huisheng [1 ,3 ]
Liang, Zhiyun [1 ,4 ]
Yang, Wanqiuyue [1 ,5 ]
Qiu, Yuanzhe [1 ,2 ]
Cheng, Tie [1 ,5 ]
Li, Xiaoteng [1 ,2 ]
Xu, Hua [1 ]
Gao, Kai [2 ]
机构
[1] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Dept Comp Sci & Technol, Beijing, Peoples R China
[2] Hebei Univ Sci & Technol, Sch Informat Sci & Engn, Shijiazhuang, Hebei, Peoples R China
[3] Beijing Natl Res Ctr Informat Sci & Technol BNRis, Beijing, Peoples R China
[4] China Agr Univ, Beijing, Peoples R China
[5] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022 | 2022年
基金
中国国家自然科学基金;
关键词
multimodal sentiment analysis; dataset; semi-supervised machine learning; modality mixup;
D O I
10.1145/3536221.3556630
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal sentiment analysis (MSA), which supposes to improve text-based sentiment analysis with associated acoustic and visual modalities, is an emerging research area due to its potential applications in Human-Computer Interaction (HCI). However, existing researches observe that the acoustic and visual modalities contribute much less than the textual modality, termed as text-predominant. Under such circumstances, in this work, we emphasize making non-verbal cues matter for the MSA task. Firstly, from the resource perspective, we present the CH-SIMS v2.0 dataset, an extension and enhancement of the CH-SIMS. Compared with the original dataset, the CH-SIMS v2.0 doubles its size with another 2121 refined video segments containing both unimodal and multimodal annotations and collects 10161 unlabelled raw video segments with rich acoustic and visual emotion-bearing context to highlight non-verbal cues for sentiment prediction. Secondly, from the model perspective, benefiting from the unimodal annotations and the unsupervised data in the CH-SIMS v2.0, the Acoustic Visual Mixup Consistent (AV-MC) framework is proposed. The designed modality mixup module can be regarded as an augmentation, which mixes the acoustic and visual modalities from different videos. Through drawing unobserved multimodal context along with the text, the model can learn to be aware of different non-verbal contexts for sentiment prediction. Our evaluations demonstrate that both CH-SIMS v2.0 and AV-MC framework enable further research for discovering emotion-bearing acoustic and visual cues and pave the path to interpretable end-toend HCI applications for real-world scenarios. The full dataset and code are available for use at https://github.com/thuiar/ch- sims- v2.
引用
收藏
页码:247 / 258
页数:12
相关论文
共 48 条
[1]  
Amiriparian S., 2021, arXiv
[2]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[3]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[4]  
Chen JA, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2147
[5]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arXiv.1810.04805]
[6]  
Eyben F., 2010, P 18 ACM INT C MULT, P1459
[7]   The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing [J].
Eyben, Florian ;
Scherer, Klaus R. ;
Schuller, Bjoern W. ;
Sundberg, Johan ;
Andre, Elisabeth ;
Busso, Carlos ;
Devillers, Laurence Y. ;
Epps, Julien ;
Laukka, Petri ;
Narayanan, Shrikanth S. ;
Truong, Khiet P. .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) :190-202
[8]  
Ghosal D, 2020, Arxiv, DOI arXiv:2009.13902
[9]  
Guo Hongyu, 2019, Augmenting data with mixup for sentence classifica-tion: An empirical study
[10]  
Han W, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9180