Fisher ratio-based multi-domain frame-level feature aggregation for short utterance speaker verification

被引：1

作者：

Zi, Yunfei ^{[1
]}

Xiong, Shengwu ^{[1
]}

机构：

[1] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan, Hubei, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 133卷

关键词：

Multi -domain feature; Joint learning; Feature enhancement; Discriminative embedding; Speaker verification; Fisher-ratio; SUPPORT VECTOR MACHINES; RECOGNITION; MFCC; GMM;

D O I：

10.1016/j.engappai.2024.108063

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the durations of the short utterances are small, it is difficult to learn sufficient information to distinguish the person, thus, short utterance speaker recognition is highly challenging. In this paper, we propose a multi-domain frame-level feature joint learning method to aggregate the discriminative information from multiple dimensions and domain, which is different domains of the speech, time-domain, frequency-domain, and spectral-domain, represent distinct physical characteristics and provide different dimension information, the time domain captures information about the temporal aspect of the physical signal, the frequency domain represents the signal strength in different frequency ranges, and the spectral domain reflects the overall information of the speech, then, based on the extracted multi-domain frame-level features, using the Multi-Fisher criterion aggregates feature parameters categorically and match the corresponding Multi-Fisher ratio weights to the feature parameters as a way to achieve effective feature aggregation and to preserve more effective information, termed FirmDomain. Extensive experiments are carried out on short-duration text-independent speaker verification datasets derived from the VoxCeleb, SITW, and NIST SRE corpora, which contain speech samples of varying lengths and scenarios. The results demonstrate that the proposed method outperforms the state-of-the-art deep learning architectures by at least 13%, respectively, in the test set. The results of the ablation experiments demonstrate that our proposed methods can significantly outperform previous approaches.

引用

页数：9

共 36 条

[31]

Variani E, 2014, INT CONF ACOUST SPEE

[32] State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations [J].

Villalba, Jesus ;

Chen, Nanxin ;

Snyder, David ;

Garcia-Romero, Daniel ;

McCree, Alan ;

Sell, Gregory ;

Borgstrom, Jonas ;

Garcia-Perera, Leibny Paola ;

Richardson, Fred ;

Dehak, Reda ;

Torres-Carrasquillo, Pedro A. ;

Dehak, Najim .

COMPUTER SPEECH AND LANGUAGE, 2020, 60

[33] Making Confident Speaker Verification Decisions With Minimal Speech [J].

Vogt, Robert ;

Sridharan, Sridha ;

Mason, Michael .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (06) :1182-1192

[34] Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification [J].

Wang, Shuai ;

Huang, Zili ;

Qian, Yanmin ;

Yu, Kai .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) :1686-1696

[35] Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings [J].

Zhang, Chunlei ;

Koishida, Kazuhito ;

Hansen, John H. L. .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) :1633-1644

[36] A Study on Speech Recognition Control for a Surgical Robot [J].

Zinchenko, Kateryna ;

Wu, Chien-Yu ;

Song, Kai-Tai .

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2017, 13 (02) :607-615

← 1 2 3 4 →