WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

被引:52
作者
Bain, Max [1 ]
Huh, Jaesung [1 ]
Han, Tengda [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Visual Geometry Grp, Oxford, England
来源
INTERSPEECH 2023 | 2023年
基金
英国工程与自然科学研究理事会;
关键词
SEGMENTATION;
D O I
10.21437/Interspeech.2023-78
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, the predicted timestamps corresponding to each utterance are prone to inaccuracies, and word-level timestamps are not available out-of-the-box. Further, their application to long audio via buffered transcription prohibits batched inference due to their sequential nature. To overcome the aforementioned challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference. The code is available open-source(1).
引用
收藏
页码:4489 / 4493
页数:5
相关论文
共 28 条
[1]  
[Anonymous], 2022, P AAAI
[2]  
Baevski A, 2020, ADV NEUR IN, V33
[3]  
Bredin H., 2020, ICASSP
[4]   AUTOMATIC SEGMENTATION AND LABELING OF SPEECH-BASED ON HIDDEN MARKOV-MODELS [J].
BRUGNARA, F ;
FALAVIGNA, D ;
OMOLOGO, M .
SPEECH COMMUNICATION, 1993, 12 (04) :357-370
[5]  
Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28
[6]  
Chen H.-J., 2023, ICASSP
[7]   WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].
Chen, Sanyuan ;
Wang, Chengyi ;
Chen, Zhengyang ;
Wu, Yu ;
Liu, Shujie ;
Chen, Zhuo ;
Li, Jinyu ;
Kanda, Naoyuki ;
Yoshioka, Takuya ;
Xiao, Xiong ;
Wu, Jian ;
Zhou, Long ;
Ren, Shuo ;
Qian, Yanmin ;
Qian, Yao ;
Zeng, Michael ;
Yu, Xiangzhan ;
Wei, Furu .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518
[8]  
Chiu CC, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P889, DOI [10.1109/asru46091.2019.9003854, 10.1109/ASRU46091.2019.9003854]
[9]  
Conneau A., 2020, ARXIV200613979
[10]   Optimization of RNN-Based Speech Activity Detection [J].
Gelly, Gregory ;
Gauvain, Jean-Luc .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (03) :646-656