Multi-channel multi-speaker transformer for speech recognition

被引：1

作者：

Guo Yifan ^{[1
]}

Tian Yao ^{[1
]}

Suo Hongbin ^{[1
]}

Wan Yulong ^{[1
]}

机构：

[1] OPPO, Data & AI Engn Syst, Beijing, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

multi-channel ASR; multi-speaker ASR; transformer;

D O I：

10.21437/Interspeech.2023-257

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-field acoustic environments. However, MCT cannot encode high-dimensional acoustic features for each speaker from mixed input audio because of the interference between speakers. Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for farfield multi-speaker ASR in this paper. Experiments on the SMS-WSJ benchmark show that the M2Former outperforms the neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-toend systems by 9.2%, 14.3%, 24.9%, and 52.2% respectively, in terms of relative word error rate reduction.

引用

页码：4918 / 4922

页数：5

共 32 条

[1]

Bolla M., 1991, Relations between spectral and classification properties of multigraphs (Technical Report No. DIMACS-91-27)

[2]

Chang F.-J., 2021, MULTICHANNEL TRANSFO

[3] END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION [J].

Chang, Feng-Ju ;

Radfar, Martin ;

Mouchtaris, Athanasios ;

King, Brian ;

Kunzmann, Siegfried .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5884-5888

[4]

Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/ASRU46091.2019.9003986, 10.1109/asru46091.2019.9003986]

[5]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[6]

Drude L., 2019, SMS WSJ DATABASE PER

[7] Tight integration of spatial and spectral features for BSS with Deep Clustering embeddings [J].

Drude, Lukas ;

Haeb-Umbach, Reinhold .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :2650-2654

[8] End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks [J].

Fu, Szu-Wei ;

Wang, Tao-Wei ;

Tsao, Yu ;

Lu, Xugang ;

Kawai, Hisashi .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) :1570-1584

[9]

Graves A., 2006, P 23 INT C MACHINE L, P369, DOI DOI 10.1145/1143844.1143891

[10]

Gu RZ, 2020, INT CONF ACOUST SPEE, P7319, DOI [10.1109/ICASSP40776.2020.9053092, 10.1109/icassp40776.2020.9053092]

← 1 2 3 4 →