Earnings-21: A Practical Benchmark for ASR in the Wild

被引:13
作者
Del Rio, Miguel [1 ]
Delworth, Natalie [1 ]
Westerman, Ryan [1 ]
Huang, Michelle [1 ]
Bhandari, Nishchal [1 ]
Palakapilly, Joseph [1 ]
McNamara, Quinten [1 ]
Dong, Joshua [1 ]
Zelasko, Piotr [2 ,3 ]
Jette, Miguel [1 ]
机构
[1] Rev Com, Austin, TX 78703 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA
[3] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD USA
来源
INTERSPEECH 2021 | 2021年
关键词
automatic speech recognition; named entity recognition; dataset; earnings call;
D O I
10.21437/Interspeech.2021-1915
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special attention towards named entity recognition. We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model and discuss their differences in performance on Earnings-21. Using our recently released fstal-ign tool, we provide a candid analysis of each model's recognition capabilities under different partitions. Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage. Earnings-21 bridges academic and commercial ASR system evaluation and enables further research on entity modeling and WER on real world audio.
引用
收藏
页码:3465 / 3469
页数:5
相关论文
共 19 条
[1]  
[Anonymous], 2006, The ELRA Newsletter
[2]  
[Anonymous], 2004, P LREC
[3]   Data Augmentation for Deep Neural Network Acoustic Modeling [J].
Cui, Xiaodong ;
Goel, Vaibhava ;
Kingsbury, Brian .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (09) :1469-1477
[4]  
Fiscus J., 2007, JOINT P 2006 CLEAR R
[5]  
Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858
[6]   Densely Connected Networks for Conversational Speech Recognition [J].
Han, Kyu J. ;
Chandrashekaran, Akshay ;
Kim, Jungsuk ;
Lane, Ian .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :796-800
[7]  
Heeman P. A., 1999, SPEECH REPAINS INTON
[8]  
Kudo T, 2018, CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P66
[9]  
Panayotov V, 2015, INT CONF ACOUST SPEE, P5206, DOI 10.1109/ICASSP.2015.7178964
[10]   SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition [J].
Park, Daniel S. ;
Chan, William ;
Zhang, Yu ;
Chiu, Chung-Cheng ;
Zoph, Barret ;
Cubuk, Ekin D. ;
Le, Quoc, V .
INTERSPEECH 2019, 2019, :2613-2617