SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

被引：8

作者：

Gao, Zhifu ^{[1
]}

Zhang, Shiliang ^{[1
]}

Lei, Ming ^{[1
]}

McLoughlin, Ian ^{[2
]}

机构：

[1] Alibaba DAMO Acad, Speech Lab, Hangzhou, Peoples R China

[2] Singapore Inst Technol, ICT Cluster, Singapore, Singapore

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech recognition; end-to-end; attentional model; Transformer; san-m;

D O I：

10.21437/Interspeech.2020-2471

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-attention instead of recurrent mechanisms, enabling both encoder and decoder to capture long-range dependencies with lower computational complexity. In this work, we propose boosting the self-attention ability with a DFSMN memory block, forming the proposed memory equipped self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have been made to demonstrate the relevancy and complementarity between self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M provides an efficient mechanism to integrate these two modules. We have evaluated our approach on the public AISHELL-1 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. On both tasks, SAN-M systems achieved much better performance than the self-attention based Transformer baseline system. Specially, it can achieve a CER of 6.46% on the AISHELL-1 task even without using any external LM, comfortably outperforming other state-of-the-art systems.

引用

页码：6 / 10

页数：5

共 28 条

[1] Convolutional Neural Networks for Speech Recognition [J].

Abdel-Hamid, Ossama ;

Mohamed, Abdel-Rahman ;

Jiang, Hui ;

Deng, Li ;

Penn, Gerald ;

Yu, Dong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545

[2]

Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618

[3] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].

BENGIO, Y ;

SIMARD, P ;

FRASCONI, P .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166

[4]

Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449

[5]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[6]

Changhao Shan, 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Proceedings, P5631, DOI 10.1109/ICASSP.2019.8682490

[7] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].

Dahl, George E. ;

Yu, Dong ;

Deng, Li ;

Acero, Alex .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42

[8]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[9]

Graves A., 2006, P 23 INT C MACH LEAR, P369

[10]

Graves A, 2014, Arxiv, DOI [arXiv:1308.0850, 10.48550/arXiv.1308.0850]

← 1 2 3 →