IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

被引：0

作者：

Javed, Tahir ^{[1
,2
]}

Bhogale, Kaushal ^{[1
,2
]}

Raman, Abhigyan ^{[2
]}

Kumar, Pratyush ^{[2
,3
]}

Kunchukuttan, Anoop ^{[2
,3
]}

Khapra, Mitesh M. ^{[1
,2
]}

机构：

[1] Indian Inst Technol Madras, Chennai, Tamil Nadu, India

[2] AI4Bharat, Chennai, Tamil Nadu, India

[3] Microsoft, Redmond, WA USA

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A cornerstone in AI research has been the creation and adoption of standardized training and test datasets to earmark the progress of state-of-the-art models. A particularly successful example is the GLUE dataset for training and evaluating Natural Language Understanding (NLU) models for English. The large body of research around self-supervised BERT-based language models revolved around performance improvements on NLU tasks in GLUE. To evaluate language models in other languages, several language-specific GLUE datasets were created. The area of speech language understanding (SLU) has followed a similar trajectory. The success of large self-supervised models such as wav2vec2 enable creation of speech models with relatively easy to access unlabelled data. These models can then be evaluated on SLU tasks, such as the SUPERB benchmark. In this work, we extend this to Indic languages by releasing the IndicSUPERB benchmark. Specifically, we make the following three contributions. (i) We collect Kathbath containing 1,684 hours of labelled speech data across 12 Indian languages from 1,218 contributors located in 203 districts in India. (ii) Using Kathbath, we create benchmarks across 6 speech tasks: Automatic Speech Recognition, Speaker Verification, Speaker Identification (mono/multi), Language Identification, Query By Example, and Keyword Spotting for 12 languages. (iii) On the released benchmarks, we train and evaluate different self-supervised models alongside the a commonly used baseline FBANK. We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks, including a large gap of 76% for Language Identification task. However, for speaker identification, self-supervised models trained on large datasets demonstrate an advantage. We hope IndicSUPERB contributes to the progress of developing speech language understanding models for Indian languages.

引用

页码：12942 / 12950

页数：9

共 30 条

[1] Abraham B, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P2819
[2] Adiga D, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, P5039
[3] Ardila R, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4218
[4] Babu A, 2021, Arxiv, DOI arXiv:2111.09296
[5] Barrault Loic, 2022, No language left behind: Scaling human-centered machine translation
[6] Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi
Bhanushali, Anish
Bridgman, Grant
Deekshitha, G.
Ghosh, Prasanta
Kumar, Pratik
Kumar, Saurabh
Kolladath, Adithya Raj
Ravi, Nithya
Seth, Aaditeshwar
Seth, Ashish
Singh, Abhayjeet
Sukhadia, Vrunda N.
Umesh, S.
Udupa, Sathvik
Prasad, Lodagala V. S. V. Durga
[J]. INTERSPEECH 2022, 2022, : 3548 - 3552
[7] Chung J. S., 2018, arXiv
[8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9] Diwan A., 2021, P INTERSPEECH
[10] Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package
Giorgino, Toni
[J]. JOURNAL OF STATISTICAL SOFTWARE, 2009, 31 (07): : 1 - 24

← 1 2 3 →