Fisher ratio-based multi-domain frame-level feature aggregation for short utterance speaker verification

被引:1
作者
Zi, Yunfei [1 ]
Xiong, Shengwu [1 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan, Hubei, Peoples R China
关键词
Multi -domain feature; Joint learning; Feature enhancement; Discriminative embedding; Speaker verification; Fisher-ratio; SUPPORT VECTOR MACHINES; RECOGNITION; MFCC; GMM;
D O I
10.1016/j.engappai.2024.108063
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the durations of the short utterances are small, it is difficult to learn sufficient information to distinguish the person, thus, short utterance speaker recognition is highly challenging. In this paper, we propose a multi-domain frame-level feature joint learning method to aggregate the discriminative information from multiple dimensions and domain, which is different domains of the speech, time-domain, frequency-domain, and spectral-domain, represent distinct physical characteristics and provide different dimension information, the time domain captures information about the temporal aspect of the physical signal, the frequency domain represents the signal strength in different frequency ranges, and the spectral domain reflects the overall information of the speech, then, based on the extracted multi-domain frame-level features, using the Multi-Fisher criterion aggregates feature parameters categorically and match the corresponding Multi-Fisher ratio weights to the feature parameters as a way to achieve effective feature aggregation and to preserve more effective information, termed FirmDomain. Extensive experiments are carried out on short-duration text-independent speaker verification datasets derived from the VoxCeleb, SITW, and NIST SRE corpora, which contain speech samples of varying lengths and scenarios. The results demonstrate that the proposed method outperforms the state-of-the-art deep learning architectures by at least 13%, respectively, in the test set. The results of the ablation experiments demonstrate that our proposed methods can significantly outperform previous approaches.
引用
收藏
页数:9
相关论文
共 36 条
[1]  
Al-Kaltakchi MTS, 2016, I W BIOMETRIC FORENS
[2]   Combined i-Vector and Extreme Learning Machine Approach for Robust Speaker Identification and Evaluation with SITW 2016, NIST 2008, TIMIT Databases [J].
Al-Kaltakchi, Musab T. S. ;
Abdullah, Mohammed A. M. ;
Woo, Wai L. ;
Dlay, Satnam S. .
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2021, 40 (10) :4903-4923
[3]   An Investigation on the Accuracy of Truncated DKLT Representation for Speaker Identification With Short Sequences of Speech Frames [J].
Biagetti, Giorgio ;
Crippa, Paolo ;
Falaschetti, Laura ;
Orcioni, Simone ;
Turchetti, Claudio .
IEEE TRANSACTIONS ON CYBERNETICS, 2017, 47 (12) :4235-4249
[4]   Support vector machines for speaker and language recognition [J].
Campbell, WM ;
Campbell, JP ;
Reynolds, DA ;
Singer, E ;
Torres-Carrasquillo, PA .
COMPUTER SPEECH AND LANGUAGE, 2006, 20 (2-3) :210-229
[5]   Support vector machines using GMM supervectors for speaker verification [J].
Campbell, WM ;
Sturim, DE ;
Reynolds, DA .
IEEE SIGNAL PROCESSING LETTERS, 2006, 13 (05) :308-311
[6]   Fusing MFCC and LPC Features Using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals [J].
Chowdhury, Anurag ;
Ross, Arun .
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2020, 15 :1616-1629
[7]  
Chung JS, 2018, INTERSPEECH, P1086
[8]   Exploring different attributes of source information for speaker verification with limited test data [J].
Das, Rohan Kumar ;
Prasanna, S. R. Mahadeva .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 140 (01) :184-190
[9]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[10]   ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].
Desplanques, Brecht ;
Thienpondt, Jenthe ;
Demuynck, Kris .
INTERSPEECH 2020, 2020, :3830-3834