NAM plus : TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR

被引:3
作者
Munkhdalai, Tsendsuren [1 ]
Wu, Zelin [1 ]
Pundak, Golan [1 ]
Sim, Khe Chai [1 ]
Li, Jiayang [1 ]
Rondon, Pat [1 ]
Sainath, Tara N. [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
来源
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年
关键词
speech recognition; on-device learning; fast contextual adaptation; SPEECH RECOGNITION;
D O I
10.1109/SLT54892.2023.10023323
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attention-based biasing techniques for end-to-end ASR systems are able to achieve large accuracy gains without requiring the inference algorithm adjustments and parameter tuning common to fusion approaches. However, it is challenging to simultaneously scale up attention-based biasing to realistic numbers of biased phrases; maintain in-domain WER gains, while minimizing out-of-domain losses; and run in real time. We present NAM+, an attention-based biasing approach which achieves a 16X inference speedup per acoustic frame over prior work when run with 3,000 biasing entities, as measured on a typical mobile CPU. NAM+ achieves these run-time gains through a combination of Two-Pass Hierarchical Attention and Dilated Context Update. Compared to the adapted baseline, NAM+ further decreases the in-domain WER by up to 12.6% relative, while incurring an out-of-domain WER regression of 20% relative. Compared to the non-adapted baseline, the out-of-domain WER regression is 7.1% relative.
引用
收藏
页码:190 / 196
页数:7
相关论文
共 27 条
  • [1] Aleksic Petar, 2015, INTERSPEECH 2015
  • [2] [Anonymous], TFLIT MOD BENCHM TOO
  • [3] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [4] Tied & Reduced RNN-T Decoder
    Botros, Rami
    Sainath, Tara N.
    David, Robert
    Guzman, Emmanuel
    Li, Wei
    He, Yanzhang
    [J]. INTERSPEECH 2021, 2021, : 4563 - 4567
  • [5] CONTEXT-AWARE TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION
    Chang, Feng-Ju
    Liu, Jing
    Radfar, Martin
    Mouchtaris, Athanasios
    Omologo, Maurizio
    Rastrow, Ariya
    Kunzmann, Siegfried
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 503 - 510
  • [6] Graves A, 2012, PROC ICML
  • [7] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    [J]. INTERSPEECH 2020, 2020, : 5036 - 5040
  • [8] IMPROVING END-TO-END CONTEXTUAL SPEECH RECOGNITION WITH FINE-GRAINED CONTEXTUAL KNOWLEDGE SELECTION
    Han, Minglun
    Dong, Linhao
    Liang, Zhenlin
    Cai, Meng
    Zhou, Shiyu
    Ma, Zejun
    Xu, Bo
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8532 - 8536
  • [9] INSTANT ONE-SHOT WORD-LEARNING FOR CONTEXT-SPECIFIC NEURAL SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION
    Huber, Christian
    Hussain, Juan
    Stueker, Sebastian
    Waibel, Alexander
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1 - 7
  • [10] Contextual RNN-T for Open Domain ASR
    Jain, Mahaveer
    Keren, Gil
    Mahadeokar, Jay
    Zweig, Geoffrey
    Metze, Florian
    Saraf, Yatharth
    [J]. INTERSPEECH 2020, 2020, : 11 - 15