HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

被引：0

作者：

Chen, Chen ^{[1
]}

Hu, Yuchen ^{[1
]}

Yang, Chao-Han Huck ^{[2
,3
,6
]}

Siniscalchi, Sabato Marco ^{[2
,4
]}

Chen, Pin-Yu ^{[5
]}

Chng, Eng Siong ^{[1
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Georgia Inst Technol, Atlanta, GA 30332 USA

[3] NVIDIA Res, Santa Clara, CA USA

[4] Norwegian Univ Sci & Technol, Trondheim, Norway

[5] IBM Res AI, Cambridge, MA USA

[6] Georgia Tech, Atlanta, GA 30332 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

新加坡国家研究基金会;

关键词：

NEURAL-NETWORK; ASR;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, "HyPoradise" (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.

引用

页数：24

共 109 条

[1] Aghajanyan Armen, 2020, arXiv preprint arXiv:2012.13255
[2] Aksnova A., 2021, P 1 WORKSH BENCHM PA, P22, DOI DOI 10.18653/V1/2021.BPPF-1.4
[3] [Anonymous], 2016, INTERSPEECH
[4] ARDILA R., 2019, ARXIV191206670
[5] Arisoy E, 2015, INT CONF ACOUST SPEE, P5421, DOI 10.1109/ICASSP.2015.7179007
[6] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[7] ASR in Classroom Today: Automatic Visualization of Conceptual Network in Science Classrooms
Caballero, Daniela
Araya, Roberto
Kronholm, Hanna
Viiri, Jouni
Mansikkaniemi, Andre
Lehesvuori, Sami
Virtanen, Tuomas
Kurimo, Mikko
[J]. DATA DRIVEN APPROACHES IN DIGITAL EDUCATION, 2017, 10474 : 541 - 544
[8] Chan S., 2022, P ADV NEUR INF PROC, V35, P18878, DOI DOI 10.48550/ARXIV.2205.05055
[9] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[10] Chang K.-W., 2023, ARXIV230300733

← 1 2 3 4 5 6 7 8 9 10 →