Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention

被引:0
作者
Huang, Qiang [1 ]
Hain, Thomas [1 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
来源
INTERSPEECH 2019 | 2019年
关键词
mismatch detection; deep learning; attention;
D O I
10.21437/Interspeech.2019-2125
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose to detect mismatches between speech and transcriptions using deep neural networks. Although it is generally assumed there are no mismatches in some speech related applications, it is hard to avoid the errors due to one reason or another. Moreover, the use of mismatched data probably leads to performance reduction when training a model. In our work, instead of detecting the errors by computing the distance between manual transcriptions and text strings obtained using a speech recogniser, we view mismatch detection as a classification task and merge speech and transcription features using deep neural networks. To enhance detection ability, we use cross-modal attention mechanism in our approach by learning the relevance between the features obtained from the two modalities. To evaluate the effectiveness of our approach, we test it on Factored WSJCAM0 by randomly setting three kinds of mismatch, word deletion, insertion or substitution. To test its robustness, we train our models using a small number of samples and detect mismatch with different number of words being removed, inserted, and substituted. In our experiments, the results show the use of our approach for mismatch detection is close to 80% on insertion and deletion and outperforms the baseline.
引用
收藏
页码:584 / 588
页数:5
相关论文
共 50 条
  • [41] Improving Cross-Modal Constraints: Text Attribute Person Search With Graph Attention Networks
    Yang, Xi
    Wang, Xiaoqi
    Yang, Dong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2493 - 2503
  • [42] Cross-modal detection using various temporal and spatial configurations
    Schirillo, James A.
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2011, 73 (01) : 237 - 246
  • [43] Cross-modal Non-linear Guided Attention and Temporal Coherence in Multi-modal Deep Video Models
    Sahu, Saurabh
    Goyal, Palash
    Ghosh, Shalini
    Lee, Chul
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 313 - 321
  • [44] Automatic depression prediction via cross-modal attention-based multi-modal fusion in social networks
    Wang, Lidong
    Zhang, Yin
    Zhou, Bin
    Cao, Shihua
    Hu, Keyong
    Tan, Yunfei
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 118
  • [45] Additive Cross-Modal Attention Network (ACMA) for Depression Detection Based on Audio and Textual Features
    Iyortsuun, Ngumimi Karen
    Kim, Soo-Hyung
    Yang, Hyung-Jeong
    Kim, Seung-Won
    Jhon, Min
    IEEE ACCESS, 2024, 12 : 20479 - 20489
  • [46] AR food changer using deep learning and cross-modal effects
    Ueda, Junya
    Okajima, Katsunori
    2019 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND VIRTUAL REALITY (AIVR), 2019, : 110 - 117
  • [47] Weakly-Supervised Temporal Action Localization with Multi-Head Cross-Modal Attention
    Ren, Hao
    Ren, Haoran
    Ran, Wu
    Lu, Hong
    Jin, Cheng
    PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III, 2022, 13631 : 281 - 295
  • [48] Cross-modal interaction between temperature and light color temperature on reading comprehension
    Guo, Xingtong
    Luo, Wei
    Mangoubi, Oren
    Liu, Shichao
    BUILDING AND ENVIRONMENT, 2025, 274
  • [49] Cross-modal and modality-specific expectancy effects between pain and disgust
    Sharvit, Gil
    Vuilleumier, Patrik
    Delplanque, Sylvain
    Corradi-Dell'Acqua, Corrado
    SCIENTIFIC REPORTS, 2015, 5
  • [50] Cross-Modal and Intra-Modal Characteristics of Visual Function and Speech Perception Performance in Postlingually Deafened, Cochlear Implant Users
    Kim, Min-Beom
    Shim, Hyun-Yong
    Jin, Sun Hwa
    Kang, Soojin
    Woo, Jihwan
    Han, Jong Chul
    Lee, Ji Young
    Kim, Martha
    Cho, Yang-Sun
    Moon, Il Joon
    Hong, Sung Hwa
    PLOS ONE, 2016, 11 (02):