Multi-Task Transformer with Adaptive Cross-Entropy Loss for Multi-Dialect Speech Recognition

被引:11
作者
Dan, Zhengjia [1 ]
Zhao, Yue [1 ]
Bi, Xiaojun [1 ]
Wu, Licheng [1 ]
Ji, Qiang [2 ]
机构
[1] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
[2] Rensselaer Polytech Inst, Dept Elect Comp & Syst Engn, Troy, NY 12180 USA
基金
中国国家自然科学基金;
关键词
adaptive cross-entropy loss; multi-task Transformer; multi-dialect speech recognition; DEEP NEURAL-NETWORKS;
D O I
10.3390/e24101429
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
At present, most multi-dialect speech recognition models are based on a hard-parameter-sharing multi-task structure, which makes it difficult to reveal how one task contributes to others. In addition, in order to balance multi-task learning, the weights of the multi-task objective function need to be manually adjusted. This makes multi-task learning very difficult and costly because it requires constantly trying various combinations of weights to determine the optimal task weights. In this paper, we propose a multi-dialect acoustic model that combines soft-parameter-sharing multi-task learning with Transformer, and introduce several auxiliary cross-attentions to enable the auxiliary task (dialect ID recognition) to provide dialect information for the multi-dialect speech recognition task. Furthermore, we use the adaptive cross-entropy loss function as the multi-task objective function, which automatically balances the learning of the multi-task model according to the loss proportion of each task during the training process. Therefore, the optimal weight combination can be found without any manual intervention. Finally, for the two tasks of multi-dialect (including low-resource dialect) speech recognition and dialect ID recognition, the experimental results show that, compared with single-dialect Transformer, single-task multi-dialect Transformer, and multi-task Transformer with hard parameter sharing, our method significantly reduces the average syllable error rate of Tibetan multi-dialect speech recognition and the character error rate of Chinese multi-dialect speech recognition.
引用
收藏
页数:12
相关论文
共 33 条
  • [11] Dynamic Task Prioritization for Multitask Learning
    Guo, Michelle
    Haque, Albert
    Huang, De-An
    Yeung, Serena
    Li Fei-Fei
    [J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 282 - 299
  • [12] Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning
    Jain, Abhinav
    Upreti, Minali
    Jyothi, Preethi
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2454 - 2458
  • [13] Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
    Kendall, Alex
    Gal, Yarin
    Cipolla, Roberto
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7482 - 7491
  • [14] Kingma DP, 2014, ADV NEUR IN, V27
  • [15] Kriman S, 2020, INT CONF ACOUST SPEE, P6124, DOI [10.1109/icassp40776.2020.9053889, 10.1109/ICASSP40776.2020.9053889]
  • [16] Krishna K., 2018, ARXIV
  • [17] Liu SK, 2019, Arxiv, DOI arXiv:1803.10704
  • [18] Meyer J, 2019, THESIS U ARIZONA TUC
  • [19] Cross-stitch Networks for Multi-task Learning
    Misra, Ishan
    Shrivastava, Abhinav
    Gupta, Abhinav
    Hebert, Martial
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3994 - 4003
  • [20] Moritz N, 2020, INT CONF ACOUST SPEE, P6074, DOI [10.1109/icassp40776.2020.9054476, 10.1109/ICASSP40776.2020.9054476]