HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

被引:0
作者
Chen, Chen [1 ]
Hu, Yuchen [1 ]
Yang, Chao-Han Huck [2 ,3 ,6 ]
Siniscalchi, Sabato Marco [2 ,4 ]
Chen, Pin-Yu [5 ]
Chng, Eng Siong [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
[3] NVIDIA Res, Santa Clara, CA USA
[4] Norwegian Univ Sci & Technol, Trondheim, Norway
[5] IBM Res AI, Cambridge, MA USA
[6] Georgia Tech, Atlanta, GA 30332 USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
新加坡国家研究基金会;
关键词
NEURAL-NETWORK; ASR;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, "HyPoradise" (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.
引用
收藏
页数:24
相关论文
共 109 条
  • [1] Aghajanyan Armen, 2020, arXiv preprint arXiv:2012.13255
  • [2] Aksnova A., 2021, P 1 WORKSH BENCHM PA, P22, DOI DOI 10.18653/V1/2021.BPPF-1.4
  • [3] [Anonymous], 2016, INTERSPEECH
  • [4] ARDILA R., 2019, ARXIV191206670
  • [5] Arisoy E, 2015, INT CONF ACOUST SPEE, P5421, DOI 10.1109/ICASSP.2015.7179007
  • [6] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
  • [7] ASR in Classroom Today: Automatic Visualization of Conceptual Network in Science Classrooms
    Caballero, Daniela
    Araya, Roberto
    Kronholm, Hanna
    Viiri, Jouni
    Mansikkaniemi, Andre
    Lehesvuori, Sami
    Virtanen, Tuomas
    Kurimo, Mikko
    [J]. DATA DRIVEN APPROACHES IN DIGITAL EDUCATION, 2017, 10474 : 541 - 544
  • [8] Chan S., 2022, P ADV NEUR INF PROC, V35, P18878, DOI DOI 10.48550/ARXIV.2205.05055
  • [9] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
  • [10] Chang K.-W., 2023, ARXIV230300733