Multi-channel multi-speaker transformer for speech recognition

被引:1
作者
Guo Yifan [1 ]
Tian Yao [1 ]
Suo Hongbin [1 ]
Wan Yulong [1 ]
机构
[1] OPPO, Data & AI Engn Syst, Beijing, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
multi-channel ASR; multi-speaker ASR; transformer;
D O I
10.21437/Interspeech.2023-257
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-field acoustic environments. However, MCT cannot encode high-dimensional acoustic features for each speaker from mixed input audio because of the interference between speakers. Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for farfield multi-speaker ASR in this paper. Experiments on the SMS-WSJ benchmark show that the M2Former outperforms the neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-toend systems by 9.2%, 14.3%, 24.9%, and 52.2% respectively, in terms of relative word error rate reduction.
引用
收藏
页码:4918 / 4922
页数:5
相关论文
共 32 条
[1]  
Bolla M., 1991, Relations between spectral and classification properties of multigraphs (Technical Report No. DIMACS-91-27)
[2]  
Chang F.-J., 2021, MULTICHANNEL TRANSFO
[3]   END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION [J].
Chang, Feng-Ju ;
Radfar, Martin ;
Mouchtaris, Athanasios ;
King, Brian ;
Kunzmann, Siegfried .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5884-5888
[4]  
Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/ASRU46091.2019.9003986, 10.1109/asru46091.2019.9003986]
[5]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[6]  
Drude L., 2019, SMS WSJ DATABASE PER
[7]   Tight integration of spatial and spectral features for BSS with Deep Clustering embeddings [J].
Drude, Lukas ;
Haeb-Umbach, Reinhold .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :2650-2654
[8]   End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks [J].
Fu, Szu-Wei ;
Wang, Tao-Wei ;
Tsao, Yu ;
Lu, Xugang ;
Kawai, Hisashi .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) :1570-1584
[9]  
Graves A., 2006, P 23 INT C MACHINE L, P369, DOI DOI 10.1145/1143844.1143891
[10]  
Gu RZ, 2020, INT CONF ACOUST SPEE, P7319, DOI [10.1109/ICASSP40776.2020.9053092, 10.1109/icassp40776.2020.9053092]