Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention

被引：0

作者：

Huang, Qiang ^{[1
]}

Hain, Thomas ^{[1
]}

机构：

[1] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England

来源：

INTERSPEECH 2019 | 2019年

关键词：

mismatch detection; deep learning; attention;

D O I：

10.21437/Interspeech.2019-2125

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this paper, we propose to detect mismatches between speech and transcriptions using deep neural networks. Although it is generally assumed there are no mismatches in some speech related applications, it is hard to avoid the errors due to one reason or another. Moreover, the use of mismatched data probably leads to performance reduction when training a model. In our work, instead of detecting the errors by computing the distance between manual transcriptions and text strings obtained using a speech recogniser, we view mismatch detection as a classification task and merge speech and transcription features using deep neural networks. To enhance detection ability, we use cross-modal attention mechanism in our approach by learning the relevance between the features obtained from the two modalities. To evaluate the effectiveness of our approach, we test it on Factored WSJCAM0 by randomly setting three kinds of mismatch, word deletion, insertion or substitution. To test its robustness, we train our models using a small number of samples and detect mismatch with different number of words being removed, inserted, and substituted. In our experiments, the results show the use of our approach for mismatch detection is close to 80% on insertion and deletion and outperforms the baseline.

引用

页码：584 / 588

页数：5

共 50 条

[31] Behold the voice of wrath: Cross-modal modulation of visual attention by anger prosody
Brosch, Tobias
Grandjean, Didier
Sander, David
Scherer, Klaus R.
COGNITION, 2008, 106 (03) : 1497 - 1503
[32] ARIF: An Adaptive Attention-Based Cross-Modal Representation Integration Framework
Liu, Chengzhi
Luo, Zihong
Bi, Yifei
Huang, Zile
Shu, Dong
Hou, Jiheng
Wang, Hongchen
Liang, Kaiyu
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT VI, 2024, 15021 : 3 - 18
[33] Detect, Reject, Focus: The Role of Satiation and Odor Relevance in Cross-Modal Attention
Schreiber, Timothy
White, Theresa L.
CHEMOSENSORY PERCEPTION, 2013, 6 (04) : 170 - 178
[34] No difference in cross-modal attention or sensory discrimination thresholds in autism and matched controls
Haigh, Sarah M.
Heeger, David J.
Heller, Laurie M.
Gupta, Akshat
Dinstein, Ilan
Minshew, Nancy J.
Behrmann, Marlene
VISION RESEARCH, 2016, 121 : 85 - 94
[35] ERP evidence of early cross-modal links between auditory selective attention and visuo-spatial memory
Bomba, Marie D.
Singhal, Anthony
BRAIN AND COGNITION, 2010, 74 (03) : 273 - 280
[36] Management of attentional resources in within-modal and cross-modal divided attention tasks: An fMRI study
Vohn, Rene
Fimm, Bruno
Weber, Jochen
Schnitker, Ralph
Thron, Armin
Spijkers, Will
Willmes, Klaus
Sturm, Walter
HUMAN BRAIN MAPPING, 2007, 28 (12) : 1267 - 1275
[37] CLASSIFICATION OF BREAST LESIONS USING CROSS-MODAL DEEP LEARNING
Hadad, Omer
Bakalo, Ran
Ben-Ari, Rami
Hashoul, Sharbell
Amit, Guy
2017 IEEE 14TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2017), 2017, : 109 - 112
[38] SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval
Ji, Zhong
Wang, Haoran
Han, Jungong
Pang, Yanwei
IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (02) : 1086 - 1097
[39] BCSwinReg: A cross-modal attention network for CBCT-to-CT multimodal image registration
Zhang, Jieming
Qing, Chang
Li, Yu
Wang, Yaqi
COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
[40] Adaptive Graph Attention Hashing for Unsupervised Cross-Modal Retrieval via Multimodal Transformers
Li, Yewen
Ge, Mingyuan
Ji, Yucheng
Li, Mingyong
WEB AND BIG DATA, PT III, APWEB-WAIM 2023, 2024, 14333 : 1 - 15

← 1 2 3 4 5 →