Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR

被引：3

作者：

Zhou, Xinyuan ^{[1
,2
]}

Lee, Grandee ^{[2
]}

Yilmaz, Emre ^{[2
]}

Long, Yanhua ^{[1
]}

Liang, Jiaen ^{[3
]}

Li, Haizhou ^{[2
]}

机构：

[1] Shanghai Normal Univ, Shanghai, Peoples R China

[2] Natl Univ Singapore, Singapore, Singapore

[3] Unisound AI Technol Co Ltd, Beijing, Peoples R China

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech recognition; attention; Transformer; SPEECH RECOGNITION;

D O I：

10.21437/Interspeech.2020-2556

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Transformer has shown impressive performance in automatic speech recognition. It uses an encoder-decoder structure with self-attention to learn the relationship between high-level representation of source inputs and embedding of target outputs. In this paper, we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR. Specifically, we introduce a self-attention mechanism to learn a multi-layer deep acoustic structure for multiple levels of acoustic abstraction. We also design a mixed attention mechanism that learns the alignment between different levels of acoustic abstraction and its corresponding linguistic information simultaneously in a shared embedding space. The ASR experiments on Aishell-1 show that the proposed structure achieves CERs of 4.8% on the dev set and 5.1% on the test set, which are the best reported results on this task to the best of our knowledge.

引用

页码：5016 / 5020

页数：5

共 35 条

[1]

Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618

[2]

Bengio Y., 2014, P NIPS WORKSH DEEP L

[3]

Bu Hui, 2017, ORIENTAL COCOSDA, P1

[4]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[5]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

[6]

Chorowski J.K., 2015, CoRR, DOI DOI 10.1016/0167-739X(94)90007-8

[7]

Chorowski Jan, 2016, Towards better decoding and lan- guage model integration in sequence to sequence models

[8]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[9]

Graves A, 2006, Proceedings of the 23rd International Conference on Machine Learning, ICML'06, page, P369, DOI DOI 10.1145/1143844.1143891

[10]

He TY, 2018, ADV NEUR IN, V31

← 1 2 3 4 →