Multi-Scene Robust Speaker Verification System Built on Improved ECAPA-TDNN

被引：0

作者：

Xuan, Xi ^{[1
]}

Jin, Rong ^{[2
]}

Xuan, Tingyu ^{[3
]}

Du, Guolei ^{[4
]}

Xuan, Kaisheng ^{[5
]}

机构：

[1] Beijing Inst Fash Technol, Chinese Fash Sci & Technol Res Inst, Beijing, Peoples R China

[2] Dalian Jiaotong Univ, Sch Automat & Elect Engn, Dalian, Liaoning, Peoples R China

[3] Anhui Med Univ, Clin Coll, Dept Nursing, Hefei, Anhui, Peoples R China

[4] Jilin Coll Tradit Chinese Med, Clin Hosp 1, Jilin, Peoples R China

[5] Hefei 9 Middle Sch, Hefei, Anhui, Peoples R China

来源：

2022 IEEE 6TH ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC) | 2022年

关键词：

ECAPA-TDNN; FBanks; Speaker embedding; Speaker Verification System;

D O I：

10.1109/IAEAC54830.2022.9929964

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In order to solve the problems of cross-domain, short speech, and noise interference in industrial application scenarios of speaker recognition, this paper proposes an improved ECAPA-TDNN for a multi-scene robust speaker verification system architecture-improved DD-ECAPA-TDNN. The design of the DD-ECAPA-TDNN architecture is inspired by the model ECAPA-TDNN, which has recently become popular in ASV systems. Firstly, we use FBanks to extract acoustic features, followed by the DD-SERes2Net Block proposed in this paper to capture local features efficiently. Finally, the output feature mapping of all DD-SERes2Net Blocks aggregated at multiple scales, and finally the ASP pooling operation is performed. The experiments were based on the VoxCelebl-dev dataset, and SC-AAMSoftmax was used to train a speaker identification model for 1211 speakers. This DD-ECAPA-TDNN model was used as speaker embedding extractor to construct an automatic speaker verification (ASV) system. We used VoxMovies and VoxCelebl-O evaluation sets to simulate three scenarios of cross-domain, short speech and noise interference, respectively, to evaluate the performance of the DD-ECAPA-TDNN system under multiple scenarios. The system achieves an EER of 2.51% on VoxCelebl-O. The DD-ECAPA-TDNN system significantly outperforms the ECAPA-TDNN system in terms of recognition performance in multiple scenarios. Finally, our ablation experiments show that the DD-SERes2Net Block has a positive impact on the performance of the ASV system, as well as that the DD-ECAPA-TDNN can extract robust and accurate speaker embedding with good scene generalization.

引用

页码：1689 / 1693

页数：5

共 11 条

[1] PLAYING A PART: SPEAKER VERIFICATION AT THE MOVIES [J].

Brown, Andrew ;

Huh, Jaesung ;

Nagrani, Arsha ;

Chung, Joon Son ;

Zisserman, Andrew .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6174-6178

[2] Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces [J].

Deng, Jiankang ;

Guo, Jia ;

Liu, Tongliang ;

Gong, Mingming ;

Zafeiriou, Stefanos .

COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :741-757

[3]

Desplanques B, 2020, Arxiv, DOI arXiv:2005.07143

[4] Speaker Recognition by Machines and Humans [J].

Hansen, John H. L. ;

Hasan, Taufiq .

IEEE SIGNAL PROCESSING MAGAZINE, 2015, 32 (06) :74-99

[5]

Kingma DP, 2014, ADV NEUR IN, V27

[6]

Ko T, 2017, INT CONF ACOUST SPEE, P5220, DOI 10.1109/ICASSP.2017.7953152

[7]

Nagrani A, 2018, Arxiv, DOI arXiv:1706.08612

[8]

Park DS, 2020, INT CONF ACOUST SPEE, P6879, DOI [10.1109/icassp40776.2020.9053205, 10.1109/ICASSP40776.2020.9053205]

[9]

Paszke A, 2019, ADV NEUR IN, V32

[10]

Snyder D, 2015, Arxiv, DOI arXiv:1510.08484

← 1 2 →