Effective Exploitation of Posterior Information for Attention-Based Speech Recognition

被引:1
作者
Tang, Jian [1 ]
Hou, Junfeng [1 ]
Song, Yan [1 ]
Dai, Li-Rong [1 ]
McLoughlin, Ian [1 ,2 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230026, Peoples R China
[2] Univ Kent, Sch Comp, Canterbury CT2 7NZ, Kent, England
来源
IEEE ACCESS | 2020年 / 8卷 / 08期
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Training; Decoding; Speech recognition; Acoustics; Optimization; Task analysis; posterior attention; divergence penalty; exposure bias; alternate learning strategy; DIVERGENCE;
D O I
10.1109/ACCESS.2020.3001636
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
End-to-end attention-based modeling is increasingly popular for tackling sequence-tosequence mapping tasks. Traditional attention mechanisms utilize prior input information to derive attention, which then conditions the output. However, we believe that knowledge of posterior output information may convey some advantage when modeling attention. Arecent technique proposed for machine translation called the posterior attention model (PAM) demonstrates that posterior output information can be used in that way for machine translation. This paper explores the use of posterior information for attention modeling in an automatic speech recognition (ASR) task. We demonstrate that direct application of PAM to ASR is unsatisfactory, due to two deficiencies; Firstly, PAM adopts attention based weighted single-frame output prediction by assuming a single focused attention variable, whereas wider contextual information from acoustic frames is important for output prediction in ASR. Secondly, in addition to the well-known exposure bias problem, PAM introduces additional mismatches in attention training and inference calculations. We present extensive experiments combining a number of alternative approaches to solving these problems, leading to a high performance technique which we call extended PAM (EPAM). To counter the first deficiency, EPAM modifies the encoder to introduce additional context information for output prediction. The second deficiency is overcome in EPAM through a two part solution of a mismatch penalty term and an alternate learning strategy. The former applies a divergence-based loss to correct the mismatch bias distribution, while the latter employs a novel update strategy which relies on introducing iterative inference steps alongside each training step. In experiments with both WSJ-80hrs and Switchboard-300hrs datasets we found significant performance gains. For example, the full EPAM system model achieved a word error rate (WER) of 10.6% on the WSJ eval92 test set, compared to 11.6% for traditional prior-attention modeling. Meanwhile, on the Switchboard eval2000 test set, we achieved 16.3% WER compared to the traditional method WER of
引用
收藏
页码:108988 / 108999
页数:12
相关论文
共 47 条
  • [1] ALI SM, 1966, J ROY STAT SOC B, V28, P131
  • [2] [Anonymous], 2018, ARXIV180400015
  • [3] [Anonymous], 2015, ARXIV151002693
  • [4] [Anonymous], 2015, 2015 IEEE WORKSH
  • [5] [Anonymous], 2016, ARXIV160905473
  • [6] Bahdanau D., 2014, ABS14090473 CORR
  • [7] Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
  • [8] Baskar M. Karthick, 2018, ARXIV181102770
  • [9] Bengio S, 2015, ADV NEUR IN, V28
  • [10] Bergstra J, 2010, P 9 PYTH SCI C, P18, DOI 10.25080/Majora-92bf1922-003