Spoken Language Identification in Unseen Target Domain Using Centroid Similarity Loss With Adaptive Gradient Blending

被引:0
作者
Muralikrishna, H. [1 ]
Kumar, Sujeet [2 ]
Dinesh, Dileep Aroor [3 ]
Thenkanidiyoor, Veena [4 ]
机构
[1] Manipal Acad Higher Educ, Manipal Inst Technol, Dept Elect & Commun Engn, Manipal 576104, India
[2] Indian Inst Technol Mandi, MANAS Lab, Mandi 175075, Himachal Prades, India
[3] Indian Inst Technol Dharwad, Dept Comp Sci & Engn, Dharwad 580011, Karnataka, India
[4] Natl Inst Technol Goa, Dept Comp Sci & Engn, Ponda 403401, India
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Feature extraction; Training; Robustness; Object recognition; Adaptive systems; Natural language processing; Gradient methods; Spoken language identification; unseen target domain; domain-mismatch; adaptive gradient blending; centroid similarity loss; DEEP NEURAL-NETWORKS; RECOGNITION;
D O I
10.1109/ACCESS.2024.3422380
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a centroid similarity loss (CSL) with adaptive gradient blending (AGB) (denoted as CSL-with-AGB) strategy to improve the generalization of a spoken language identification (LID) system to unseen target domain conditions. Unlike most of the existing approaches, the proposed CSL-with-AGB can improve the generalization even when the training dataset lacks domain-diversity. Specifically, in this approach, the LID network first analyses the input at two different temporal resolutions using a set of two embedding extractors, which allow them to generalize better by encoding complementary contents. We then propose to use the CSL to further improve the generalization of the network by encouraging the embedding extractors to learn discriminative and domain-invariant embeddings. However, application of auxiliary loss like CSL can sometimes force the two embedding extractors of the network to learn in an unbalanced way, diminishing their ability to encode complementary contents in the input. To overcome this issue, we propose to include the AGB strategy with the CSL. With the help of two auxiliary classifiers attached to the two embedding extractors, the AGB monitors and guides them to have a balanced learning, leading to enhanced performance in unseen target domain conditions.
引用
收藏
页码:95959 / 95971
页数:13
相关论文
共 39 条
[1]   Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages [J].
Abdullah, Badr M. ;
Avgustinova, Tania ;
Moebius, Bernd ;
Klakow, Dietrich .
INTERSPEECH 2020, 2020, :477-481
[2]   Length- and Noise-aware Training Techniques for Short-utterance Speaker Recognition [J].
Chen, Wenda ;
Huang, Jonathan ;
Bocklet, Tobias .
INTERSPEECH 2020, 2020, :3835-3839
[3]   Dialect Identification using Chroma-Spectral Shape Features with Ensemble Technique [J].
Chittaragi, Nagaratna B. ;
Koolagudi, Shashidhar G. .
COMPUTER SPEECH AND LANGUAGE, 2021, 70
[4]   ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].
Deng, Jiankang ;
Guo, Jia ;
Xue, Niannan ;
Zafeiriou, Stefanos .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694
[5]   An Overview of Indian Spoken Language Recognition from Machine Learning Perspective [J].
Dey, Spandan ;
Sahidullah, Md ;
Saha, Goutam .
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (06)
[6]   Multilingually trained bottleneck features in spoken language recognition [J].
Fer, Radek ;
Matejka, Pavel ;
Grezl, Frantisek ;
Plchot, Oldrich ;
Vesely, Karel ;
Cernocky, Jan Honza .
COMPUTER SPEECH AND LANGUAGE, 2017, 46 :252-267
[7]  
Hsu WN, 2017, ADV NEUR IN, V30
[8]   Dialect/accent classification using unrestricted audio [J].
Huang, Rongqing ;
Hansen, John H. L. ;
Angkititrakul, Pongtep .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (02) :453-464
[9]   Spoken Language Recognition: From Fundamentals to Practice [J].
Li, Haizhou ;
Ma, Bin ;
Lee, Kong Aik .
PROCEEDINGS OF THE IEEE, 2013, 101 (05) :1136-1159
[10]  
Li Z, 2020, ASIAPAC SIGN INFO PR, P550