Multi-resolution approach to Identification of spoken languages and to improve overall Language Diarization System using Whisper Model

被引：7

作者：

Vachhani, Bhavik ^{[1
]}

Singh, Dipesh ^{[1
]}

Lawyer, Rustom ^{[1
]}

机构：

[1] Augnito India Private Ltd, Mumbai, Maharashtra, India

来源：

INTERSPEECH 2023 | 2023年

关键词：

Spoken Language Identification; Language Diarization; Audio Accent Identification; Whisper; DOVER-Lap;

D O I：

10.21437/Interspeech.2023-1354

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This research paper investigates the effectiveness of the Whisper decoder for Language Identification (LI) and Language Diarization (LD) tasks. An audio accent detection system was used as an attention mechanism to narrow down the Whisper LI output classes. The LI system was tested on different audio resolutions ranging from 1.0 to 11.0 seconds, and the segments obtained were combined to generate RTTM per audio resolution. Lastly, we ensemble different multi-resolution diarization systems using DOVER-Lap algorithm. This work was part of DISPLACE challenge organized in INTERSPEECH 2023 and hence the challenge dataset was utilized for all the experiments. It shows that 5-second of audio resolution (i.e.,S-1) yield optimum result of 38.12% and 42.45% DER on development and evaluation data respectively. Furthermore, combining multi-resolution diarization systems (i.e.,S-2) produced an absolute improvement of 3.22% over S-1 and 11.66% over the challenge baseline, with a total DER of 34.9% on the Development set.

引用

页码：1993 / 1997

页数：5

共 22 条

[1] New Advances in Speaker Diarization [J].

Aronowitz, Hagai ;

Zhu, Weizhong ;

Suzuki, Masayuki ;

Kurata, Gakuto ;

Hoory, Ron .

INTERSPEECH 2020, 2020, :279-283

[2]

Baghel S., 2023, Displace challenge: Diarization of speaker and language in conversational environments

[3] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].

Desplanques, Brecht ;

Thienpondt, Jenthe ;

Demuynck, Kris .

INTERSPEECH 2020, 2020, :3830-3834

[4] End-to-End Neural Speaker Diarization with Permutation-Free Objectives [J].

Fujita, Yusuke ;

Kanda, Naoyuki ;

Horiguchi, Shota ;

Nagamatsu, Kenji ;

Watanabe, Shinji .

INTERSPEECH 2019, 2019, :4300-4304

[5]

Fujita Y, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P296, DOI [10.1109/ASRU46091.2019.9003959, 10.1109/asru46091.2019.9003959]

[6] Improving Language Identification of Accented Speech [J].

Kukk, Kunnar ;

Alumae, Tanel .

INTERSPEECH 2022, 2022, :1288-1292

[7]

Lyu DC, 2013, INT CONF ACOUST SPEE, P7314, DOI 10.1109/ICASSP.2013.6639083

[8]

Mishra Jagabandhu, 2021, 2021 NAT C COMM NCC, P1

[9]

Moraru D., 2004, AC SPEECH SIGN PROC, V1, P1

[10]

Park T., 2021, P ICASSP MAY

← 1 2 3 →