BEATS: Bengali Speech Acts Recognition using Multimodal Attention Fusion

被引:0
作者
Deb, Ahana [1 ]
Nag, Sayan [2 ]
Mahapatra, Ayan [1 ]
Chattopadhyay, Soumitri [1 ]
Marik, Aritra [1 ]
Gayen, Pijush Kanti [1 ]
Sanyal, Shankha [1 ]
Banerjee, Archi [3 ]
Karmakar, Samir [1 ]
机构
[1] Jadavpur Univ, Kolkata, India
[2] Univ Toronto, Toronto, ON, Canada
[3] IIT Kharagpur, Kharagpur, W Bengal, India
来源
INTERSPEECH 2023 | 2023年
关键词
speech act; multimodal fusion; transformer; low-resource language; EMOTION; EXPRESSION; FEATURES;
D O I
10.21437/Interspeech.2023-1146
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts (Bengali speech acts recognition using Multimodal Attention Fusion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data. Project page: https://soumitri2001.github.io/BeAts
引用
收藏
页码:3392 / 3396
页数:5
相关论文
empty
未找到相关数据