Improving Readability for Automatic Speech Recognition Transcription

被引:5
作者
Liao, Junwei [1 ]
Eskimez, Sefik [2 ]
Lu, Liyang [2 ]
Shi, Yu [2 ]
Gong, Ming [3 ]
Shou, Linjun [3 ]
Qu, Hong [1 ]
Zeng, Michael [2 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Peoples R China
[2] Microsoft Speech & Dialogue Res Grp, New York, NY USA
[3] Microsoft STCA NLP Grp, Beijing, Peoples R China
关键词
Automatic speech recognition; post-processing for readability; data synthesis; pre-trained model; PUNCTUATION; CAPITALIZATION;
D O I
10.1145/3557894
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to grammatical errors, disfluency, and other noises common in spoken communication. These readable issues introduced by speakers and ASR systems will impair the performance of downstream tasks and the understanding of human readers. In thiswork, we present a task called ASR post-processing for readability (APR) and formulate it as a sequenceto-sequence text generation problem. The APR task aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of speakers. We further study the APR task from the benchmark dataset, evaluation metrics, and baseline models: First, to address the lack of task-specific data, we propose a method to construct a dataset for the APR task by using the data collected for grammatical error correction. Second, we utilize metrics adapted or borrowed from similar tasks to evaluate model performance on the APR task. Lastly, we use several typical or adapted pre-trained models as the baseline models for the APR task. Furthermore, we fine-tune the baseline models on the constructed dataset and compare their performance with a traditional pipeline method in terms of proposed evaluation metrics. Experimental results show that all the fine-tuned baseline models perform better than the traditional pipeline method, and our adapted RoBERTa model outperforms the pipeline method by 4.95 and 6.63 BLEU points on two test sets, respectively. The human evaluation and case study further reveal the ability of the proposed model to improve the readability of ASR transcripts.
引用
收藏
页数:23
相关论文
共 50 条
  • [31] Allophones in Automatic Whispery Speech Recognition
    Kozierski, Piotr
    Sadalla, Talar
    Drgas, Szymon
    Dabrowski, Adam
    2016 21ST INTERNATIONAL CONFERENCE ON METHODS AND MODELS IN AUTOMATION AND ROBOTICS (MMAR), 2016, : 811 - 815
  • [32] Automatic Speech Recognition: An Improved Paradigm
    Topoleanu, Tudor-Sabin
    Mogan, Gheorghe Leonte
    TECHNOLOGICAL INNOVATION FOR SUSTAINABILITY, 2011, 349 : 269 - +
  • [33] Counterfactually Fair Automatic Speech Recognition
    Sari, Leda
    Hasegawa-Johnson, Mark
    Yoo, Chang D.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3515 - 3525
  • [34] Automatic speech-to-text transcription in arabic
    Lamel, Lori
    Messaoudi, Abdelkhalek
    Gauvain, Jean-Luc
    ACM Transactions on Asian Language Information Processing, 2009, 8 (04):
  • [35] Acoustic Analysis for Automatic Speech Recognition
    O'Shaughnessy, Douglas
    PROCEEDINGS OF THE IEEE, 2013, 101 (05) : 1038 - 1053
  • [36] Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement
    Liu, Yazhu
    Xue, Shaofei
    Tang, Jian
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 162 - 172
  • [37] Automatic Speech Correction: A step to Speech Recognition for People with Disabilities
    Terbeh, Naim
    Labidi, Mohamed
    Zrigui, Mounir
    2013 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY AND ACCESSIBILITY (ICTA), 2013,
  • [38] Real and synthetic Punjabi speech datasets for automatic speech recognition
    Singh, Satwinder
    Hou, Feng
    Wang, Ruili
    DATA IN BRIEF, 2024, 52
  • [39] RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition
    Georgescu, Alexandru-Lucian
    Cucu, Horia
    Buzo, Andi
    Burileanu, Corneliu
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6606 - 6612
  • [40] Bangladeshi Bangla speech corpus for automatic speech recognition research
    Kibria, Shafkat
    Samin, Ahnaf Mozib
    Kobir, M. Humayon
    Rahman, M. Shahidur
    Selim, M. Reza
    Iqbal, M. Zafar
    SPEECH COMMUNICATION, 2022, 136 : 84 - 97