NAM plus : TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR

被引：3

作者：

Munkhdalai, Tsendsuren ^{[1
]}

Wu, Zelin ^{[1
]}

Pundak, Golan ^{[1
]}

Sim, Khe Chai ^{[1
]}

Li, Jiayang ^{[1
]}

Rondon, Pat ^{[1
]}

Sainath, Tara N. ^{[1
]}

机构：

[1] Google LLC, Mountain View, CA 94043 USA

来源：

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年

关键词：

speech recognition; on-device learning; fast contextual adaptation; SPEECH RECOGNITION;

D O I：

10.1109/SLT54892.2023.10023323

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Attention-based biasing techniques for end-to-end ASR systems are able to achieve large accuracy gains without requiring the inference algorithm adjustments and parameter tuning common to fusion approaches. However, it is challenging to simultaneously scale up attention-based biasing to realistic numbers of biased phrases; maintain in-domain WER gains, while minimizing out-of-domain losses; and run in real time. We present NAM+, an attention-based biasing approach which achieves a 16X inference speedup per acoustic frame over prior work when run with 3,000 biasing entities, as measured on a typical mobile CPU. NAM+ achieves these run-time gains through a combination of Two-Pass Hierarchical Attention and Dilated Context Update. Compared to the adapted baseline, NAM+ further decreases the in-domain WER by up to 12.6% relative, while incurring an out-of-domain WER regression of 20% relative. Compared to the non-adapted baseline, the out-of-domain WER regression is 7.1% relative.

引用

页码：190 / 196

页数：7

共 27 条

[1] Aleksic Petar, 2015, INTERSPEECH 2015
[2] [Anonymous], TFLIT MOD BENCHM TOO
[3] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[4] Tied & Reduced RNN-T Decoder
Botros, Rami
Sainath, Tara N.
David, Robert
Guzman, Emmanuel
Li, Wei
He, Yanzhang
[J]. INTERSPEECH 2021, 2021, : 4563 - 4567
[5] CONTEXT-AWARE TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION
Chang, Feng-Ju
Liu, Jing
Radfar, Martin
Mouchtaris, Athanasios
Omologo, Maurizio
Rastrow, Ariya
Kunzmann, Siegfried
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 503 - 510
[6] Graves A, 2012, PROC ICML
[7] Conformer: Convolution-augmented Transformer for Speech Recognition
Gulati, Anmol
Qin, James
Chiu, Chung-Cheng
Parmar, Niki
Zhang, Yu
Yu, Jiahui
Han, Wei
Wang, Shibo
Zhang, Zhengdong
Wu, Yonghui
Pang, Ruoming
[J]. INTERSPEECH 2020, 2020, : 5036 - 5040
[8] IMPROVING END-TO-END CONTEXTUAL SPEECH RECOGNITION WITH FINE-GRAINED CONTEXTUAL KNOWLEDGE SELECTION
Han, Minglun
Dong, Linhao
Liang, Zhenlin
Cai, Meng
Zhou, Shiyu
Ma, Zejun
Xu, Bo
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8532 - 8536
[9] INSTANT ONE-SHOT WORD-LEARNING FOR CONTEXT-SPECIFIC NEURAL SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION
Huber, Christian
Hussain, Juan
Stueker, Sebastian
Waibel, Alexander
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1 - 7
[10] Contextual RNN-T for Open Domain ASR
Jain, Mahaveer
Keren, Gil
Mahadeokar, Jay
Zweig, Geoffrey
Metze, Florian
Saraf, Yatharth
[J]. INTERSPEECH 2020, 2020, : 11 - 15

← 1 2 3 →