Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention

被引:0
作者
Huang, Qiang [1 ]
Hain, Thomas [1 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
来源
INTERSPEECH 2019 | 2019年
关键词
mismatch detection; deep learning; attention;
D O I
10.21437/Interspeech.2019-2125
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose to detect mismatches between speech and transcriptions using deep neural networks. Although it is generally assumed there are no mismatches in some speech related applications, it is hard to avoid the errors due to one reason or another. Moreover, the use of mismatched data probably leads to performance reduction when training a model. In our work, instead of detecting the errors by computing the distance between manual transcriptions and text strings obtained using a speech recogniser, we view mismatch detection as a classification task and merge speech and transcription features using deep neural networks. To enhance detection ability, we use cross-modal attention mechanism in our approach by learning the relevance between the features obtained from the two modalities. To evaluate the effectiveness of our approach, we test it on Factored WSJCAM0 by randomly setting three kinds of mismatch, word deletion, insertion or substitution. To test its robustness, we train our models using a small number of samples and detect mismatch with different number of words being removed, inserted, and substituted. In our experiments, the results show the use of our approach for mismatch detection is close to 80% on insertion and deletion and outperforms the baseline.
引用
收藏
页码:584 / 588
页数:5
相关论文
共 50 条
  • [21] CATNet: Cross-modal fusion for audio-visual speech recognition
    Wang, Xingmei
    Mi, Jiachen
    Li, Boquan
    Zhao, Yixu
    Meng, Jiaxiang
    PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
  • [22] Cross-modal attention modulates tactile subitizing but not tactile numerosity estimation
    Yue Tian
    Lihan Chen
    Attention, Perception, & Psychophysics, 2018, 80 : 1229 - 1239
  • [23] Cross-modal multi-headed attention for long multimodal conversations
    Harshith Belagur
    N. Saketh Reddy
    P. Radha Krishna
    Raj Tumuluri
    Multimedia Tools and Applications, 2023, 82 : 45679 - 45697
  • [24] XAttentionHAR Ensemble: Leveraging Cross-Modal Attention for Enhanced Activity Recognition
    Sahni, Sarita
    Jain, Sweta
    Saritha, Sri Khetwat
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2025, 39 (01)
  • [25] Attention and cross-modal processing: Evidence from heart rate analyses
    Robinson, Christopher W.
    Sloutsky, Vladimir M.
    COGNITION IN FLUX, 2010, : 2639 - 2643
  • [26] Cross-modal multi-headed attention for long multimodal conversations
    Belagur, Harshith
    Reddy, N. Saketh
    Krishna, P. Radha
    Tumuluri, Raj
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (29) : 45679 - 45697
  • [27] Dual-supervised attention network for deep cross-modal hashing
    Peng, Hanyu
    He, Junjun
    Chen, Shifeng
    Wang, Yali
    Qiao, Yu
    PATTERN RECOGNITION LETTERS, 2019, 128 : 333 - 339
  • [28] Detecting Sarcasm from Travel Reviews Based on Cross-Modal Deep Learning
    Liu Y.
    Ma L.
    Zhang W.
    Hu Z.
    Wu J.
    Data Analysis and Knowledge Discovery, 2022, 6 (12) : 23 - 31
  • [29] Attention-Based Cross-Modal CNN Using Non-Disassembled Files for Malware Classification
    Kim, Jeongwoo
    Paik, Joon-Young
    Cho, Eun-Sun
    IEEE ACCESS, 2023, 11 : 22889 - 22903
  • [30] Attention modulates cross-modal semantic priming as reflected by the N400
    Baez-Martin, Margarita M.
    Bringas-Vega, Maria L.
    Connolly, John F.
    Perez-Abalo, Maria C.
    Fernandez, Yamile
    Bermudez-Zaldivar, Marilyn
    Sanchez-Coroneaux, Abel
    Cabrera-Abreu, Ivette
    PSYCHOPHYSIOLOGY, 2006, 43 : S22 - S22