Earnings-21: A Practical Benchmark for ASR in the Wild

被引：13

作者：

Del Rio, Miguel ^{[1
]}

Delworth, Natalie ^{[1
]}

Westerman, Ryan ^{[1
]}

Huang, Michelle ^{[1
]}

Bhandari, Nishchal ^{[1
]}

Palakapilly, Joseph ^{[1
]}

McNamara, Quinten ^{[1
]}

Dong, Joshua ^{[1
]}

Zelasko, Piotr ^{[2
,3
]}

Jette, Miguel ^{[1
]}

机构：

[1] Rev Com, Austin, TX 78703 USA

[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA

[3] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

automatic speech recognition; named entity recognition; dataset; earnings call;

D O I：

10.21437/Interspeech.2021-1915

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special attention towards named entity recognition. We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model and discuss their differences in performance on Earnings-21. Using our recently released fstal-ign tool, we provide a candid analysis of each model's recognition capabilities under different partitions. Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage. Earnings-21 bridges academic and commercial ASR system evaluation and enables further research on entity modeling and WER on real world audio.

引用

页码：3465 / 3469

页数：5

共 19 条

[11] Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs [J].

Peddinti, Vijayaditya ;

Wang, Yiming ;

Povey, Daniel ;

Khudanpur, Sanjeev .

IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (03) :373-377

[12]

Povey Daniel, 2011, WORKSH AUT SPEECH RE

[13]

Shriberg E.E., 1994, THESIS CITESEER

[14]

Snyder D., 2015, Musan: A music, speech, and noise corpus

[15]

Szymanski P, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020

[16]

Watanabe S., 2020, ARXIV200409249, P1, DOI [10.21437/CHiME.2020-1, DOI 10.21437/CHIME.2020-1]

[17] ESPnet: End-to-End Speech Processing Toolkit [J].

Watanabe, Shinji ;

Hori, Takaaki ;

Karita, Shigeki ;

Hayashi, Tomoki ;

Nishitoba, Jiro ;

Unno, Yuya ;

Soplin, Nelson Enrique Yalta ;

Heymann, Jahn ;

Wiesner, Mattew ;

Chen, Nanxin ;

Renduchintala, Adithya ;

Ochiai, Tsubasa .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :2207-2211

[18] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition [J].

Watanabe, Shinji ;

Hori, Takaaki ;

Kim, Suyoun ;

Hershey, John R. ;

Hayashi, Tomoki .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) :1240-1253

[19]

Zhang Yu, 2020, arXiv2010.10504

← 1 2 →