Tibetan-Mandarin Bilingual Speech Recognition Based on End-to-End Framework

被引:0
作者
Wang, Qingnan [1 ]
Guo, Wu [1 ]
Chen, Peixin [1 ]
Song, Yan [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
来源
2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017) | 2017年
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Tibetan-Mandarin bilingual speech recognition is addressed in this paper. Because there is a great difference between the phoneme sets of these languages, it is difficult to find a universal phoneme set for the bilingual acoustic model (AM) in the conventional hidden Markov model (HMM) framework The end-to-end framework based on connectionist temporal classification (CTC) loss function is proposed to solve this problem by using the character as the modeling unit instead of the phoneme. However, the sparseness problem of model units is an intractable and ineluctable fact in CTC model training, particularly under low-resource conditions. This paper explores two methods to address this problem. First, different model units are selected. The Tibetan characters and the Mandarin non-tonal syllables are used as the CTC output units. Second, an adding noise algorithm is applied to the bilingual part of the training corpus to augment Mandarin speech. The experiments are carried out on the hybrid IFLYTEK Tibetan-Mandarin corpus. Obvious improvements can be observed by using the proposed methods.
引用
收藏
页码:1214 / 1217
页数:4
相关论文
共 20 条
  • [1] [Anonymous], INTERSPEECH
  • [2] [Anonymous], 2014, P INTERSPEECH
  • [3] Bernard Theos., 1946, SIMPLIFIED GRAMMAR L
  • [4] Byrne W, 2000, INT CONF ACOUST SPEE, P1029
  • [5] Chao Weng, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P5532, DOI 10.1109/ICASSP.2014.6854661
  • [6] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [7] Graves A., 2006, INT C MACH LEARN
  • [8] Graves A, 2014, PR MACH LEARN RES, V32, P1764
  • [9] Graves A, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P273, DOI 10.1109/ASRU.2013.6707742
  • [10] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947