Conversational Speech Recognition by Learning Audio-Textual Cross-Modal Contextual Representation

被引：2

作者：

Wei, Kun ^{[1
]}

Li, Bei ^{[2
]}

Lv, Hang ^{[1
]}

Lu, Quan ^{[3
]}

Jiang, Ning ^{[3
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp, Xian 710072, Peoples R China

[2] Northeastern Univ, Sch Comp Sci & Engn, Shenyang 110167, Peoples R China

[3] Mashang Consumer Finance Co Ltd, Chongqing 401121, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Speech recognition; Feature extraction; Decoding; Context modeling; Training; Oral communication; Data mining; Conversational ASR; Cross-modal Representation; Context; Conformer; Latent Variational; ASR;

D O I：

10.1109/TASLP.2024.3389630

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational-level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains longer context without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

引用

页码：2432 / 2444

页数：13

共 52 条

[11]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[12]

Gong X., 2023, IEEE ICASSP 2023, P1

[13]

Graves A, 2012, Arxiv, DOI [arXiv:1211.3711, DOI 10.48550/ARXIV.1211.3711]

[14]

Graves A, 2012, STUD COMPUT INTELL, V385, P61

[15] Conformer: Convolution-augmented Transformer for Speech Recognition [J].

Gulati, Anmol ;

Qin, James ;

Chiu, Chung-Cheng ;

Parmar, Niki ;

Zhang, Yu ;

Yu, Jiahui ;

Han, Wei ;

Wang, Shibo ;

Zhang, Zhengdong ;

Wu, Yonghui ;

Pang, Ruoming .

INTERSPEECH 2020, 2020, :5036-5040

[16] Deep Neural Networks for Acoustic Modeling in Speech Recognition [J].

Hinton, Geoffrey ;

Deng, Li ;

Yu, Dong ;

Dahl, George E. ;

Mohamed, Abdel-rahman ;

Jaitly, Navdeep ;

Senior, Andrew ;

Vanhoucke, Vincent ;

Patrick Nguyen ;

Sainath, Tara N. ;

Kingsbury, Brian .

IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) :82-97

[17] Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers [J].

Hori, Takaaki ;

Moritz, Niko ;

Hori, Chiori ;

Le Roux, Jonathan .

INTERSPEECH 2021, 2021, :2097-2101

[18] Transformer-based Long-context End-to-end Speech Recognition [J].

Hori, Takaaki ;

Moritz, Niko ;

Hori, Chiori ;

Le Roux, Jonathan .

INTERSPEECH 2020, 2020, :5011-5015

[19] Bring dialogue-context into RNN-T for streaming ASR [J].

Hou, Junfeng ;

Chen, Jinkun ;

Li, Wanyu ;

Tang, Yufeng ;

Zhang, Jun ;

Ma, Zejun .

INTERSPEECH 2022, 2022, :2048-2052

[20] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].

Hsu, Wei-Ning ;

Bolte, Benjamin ;

Tsai, Yao-Hung Hubert ;

Lakhotia, Kushal ;

Salakhutdinov, Ruslan ;

Mohamed, Abdelrahman .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460

← 1 2 3 4 5 6 →