Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR

被引:3
作者
Zhou, Xinyuan [1 ,2 ]
Lee, Grandee [2 ]
Yilmaz, Emre [2 ]
Long, Yanhua [1 ]
Liang, Jiaen [3 ]
Li, Haizhou [2 ]
机构
[1] Shanghai Normal Univ, Shanghai, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] Unisound AI Technol Co Ltd, Beijing, Peoples R China
来源
INTERSPEECH 2020 | 2020年
关键词
speech recognition; attention; Transformer; SPEECH RECOGNITION;
D O I
10.21437/Interspeech.2020-2556
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Transformer has shown impressive performance in automatic speech recognition. It uses an encoder-decoder structure with self-attention to learn the relationship between high-level representation of source inputs and embedding of target outputs. In this paper, we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR. Specifically, we introduce a self-attention mechanism to learn a multi-layer deep acoustic structure for multiple levels of acoustic abstraction. We also design a mixed attention mechanism that learns the alignment between different levels of acoustic abstraction and its corresponding linguistic information simultaneously in a shared embedding space. The ASR experiments on Aishell-1 show that the proposed structure achieves CERs of 4.8% on the dev set and 5.1% on the test set, which are the best reported results on this task to the best of our knowledge.
引用
收藏
页码:5016 / 5020
页数:5
相关论文
共 35 条
[1]  
Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
[2]  
Bengio Y., 2014, P NIPS WORKSH DEEP L
[3]  
Bu Hui, 2017, ORIENTAL COCOSDA, P1
[4]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[5]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[6]  
Chorowski J.K., 2015, CoRR, DOI DOI 10.1016/0167-739X(94)90007-8
[7]  
Chorowski Jan, 2016, Towards better decoding and lan- guage model integration in sequence to sequence models
[8]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[9]  
Graves A, 2006, Proceedings of the 23rd International Conference on Machine Learning, ICML'06, page, P369, DOI DOI 10.1145/1143844.1143891
[10]  
He TY, 2018, ADV NEUR IN, V31