MULTI-DIALECT SPEECH RECOGNITION IN ENGLISH USING ATTENTION ON ENSEMBLE OF EXPERTS

被引:13
作者
Das, Amit [1 ]
Kumar, Kshitiz [1 ]
Wu, Jian [1 ]
机构
[1] Microsoft Speech & Language Grp, Redmond, WA 98052 USA
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
multi-dialect; attention; mixture of experts; acoustic modeling; speech recognition; DEEP NEURAL-NETWORK;
D O I
10.1109/ICASSP39728.2021.9413952
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In the presence of a wide variety of dialects, training dialect-specific models for each dialect is a demanding task. Previous studies have explored training a single model that is robust across multiple dialects. These studies have used either multi-condition training, multi-task learning, end-to-end modeling, or ensemble modeling. In this study, we further explore using a single model for multi-dialect speech recognition using ensemble modeling. First, we build an ensemble of dialect-specific models (or experts). Then we linearly combine the outputs of the experts using attention weights generated by a long short-term memory (LSTM) network. For comparison purposes, we train a model that jointly learns to recognize and classify dialects using multi-task learning and a second model using multi-condition training. We train all of these models with about 60,000 hours of speech data collected in American English, Canadian English, British English, and Australian English. Experimental results reveal that our best proposed model achieved an average 4.74% word error rate reduction (WERR) compared to the strong baseline model.
引用
收藏
页码:6244 / 6248
页数:5
相关论文
共 31 条
  • [1] Convolutional Neural Networks for Speech Recognition
    Abdel-Hamid, Ossama
    Mohamed, Abdel-Rahman
    Jiang, Hui
    Deng, Li
    Penn, Gerald
    Yu, Dong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
  • [2] [Anonymous], 2006, Multilingual speech processing
  • [3] Multitask learning
    Caruana, R
    [J]. MACHINE LEARNING, 1997, 28 (01) : 41 - 75
  • [4] Chen MM, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3620
  • [5] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [6] Das A, 2019, INT CONF ACOUST SPEE, P5681, DOI 10.1109/ICASSP.2019.8682403
  • [7] Elfeky M., 2015, P INT C NAT LANG SPE
  • [8] Glembek O, 2011, INT CONF ACOUST SPEE, P4516
  • [9] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
  • [10] THE META-PI NETWORK - BUILDING DISTRIBUTED KNOWLEDGE REPRESENTATIONS FOR ROBUST MULTISOURCE PATTERN-RECOGNITION
    HAMPSHIRE, JB
    WAIBEL, A
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1992, 14 (07) : 751 - 769