Synthetic Speech Detection Based on the Temporal Consistency of Speaker Features

被引：3

作者：

Zhang, Yuxiang ^{[1
,2
]}

Li, Zhuo ^{[1
,2
]}

Lu, Jingze ^{[1
,2
]}

Wang, Wenchao ^{[1
,2
]}

Zhang, Pengyuan ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Feature extraction; Speech synthesis; Signal processing algorithms; Training; Robustness; Partitioning algorithms; Task analysis; Anti-spoofing; interpretability; pre-trained system; robustness; speaker verification; VERIFICATION;

D O I：

10.1109/LSP.2024.3381890

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Current synthetic speech detection (SSD) methods perform well on specific datasets but require improvement in interpretability and robustness. One possible reason is the lack of interpretability analysis of synthetic speech defects. In this paper, the flaws in the temporal consistency (TC) of speaker features inherent in the speech synthesis process are analyzed. Differences in the TC of intra-utterance speaker features arise due to limited control over speaker features during speech synthesis. The speech generated by text-to-speech algorithms exhibits higher TC, while the speech generated by voice conversion algorithms yields slightly lower TC compared to bonafide speech. Based on this finding, a new SSD method based on the TC of speaker features is proposed. Modeling the TC of intra-utterance speaker features extracted by a pre-trained ASV system can be used for SSD. The proposed method achieves equal error rates of 0.84%, 3.93%, 12.98% and 24.66% on the ASVspoof 2019 LA, 2021 LA, 2021 DF and IntheWild evaluation datasets, respectively, demonstrating strong interpretability and robustness.

引用

页码：944 / 948

页数：5

共 43 条

[1] A Subnetwork Approach for Spoofing Aware Speaker Verification [J].

Alenin, Alexander ;

Torgashov, Nikita ;

Okhotnikov, Anton ;

Makarov, Rostislav ;

Yakovlev, Ivan .

INTERSPEECH 2022, 2022, :2888-2892

[2]

Baevski A, 2020, ADV NEUR IN, V33

[3]

Barbany O., 2020, P JOINT WORKSH BLIZZ, P145

[4] HYU Submission for the SASV Challenge 2022: Reforming Speaker Embeddings with Spoofing-Aware Conditioning [J].

Choi, Jeong-Hwan ;

Yang, Joon-Young ;

Jeoung, Ye-Rin ;

Chang, Joon-Hyuk .

INTERSPEECH 2022, 2022, :2873-2877

[5]

Chung JS, 2018, INTERSPEECH, P1086

[6] ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].

Deng, Jiankang ;

Guo, Jia ;

Xue, Niannan ;

Zafeiriou, Stefanos .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694

[7] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].

Desplanques, Brecht ;

Thienpondt, Jenthe ;

Demuynck, Kris .

INTERSPEECH 2020, 2020, :3830-3834

[8] SASV 2022: The First Spoofing-Aware Speaker Verification Challenge [J].

Jung, Jee-weon ;

Tak, Hemlata ;

Shim, Hye-jin ;

Heo, Hee-Soo ;

Lee, Bong-Jin ;

Chung, Soo-Whan ;

Yu, Ha-Jin ;

Evans, Nicholas ;

Kinnunen, Tomi .

INTERSPEECH 2022, 2022, :2893-2897

[9] AASIST: AUDIO ANTI-SPOOFING USING INTEGRATED SPECTRO-TEMPORAL GRAPH ATTENTION NETWORKS [J].

Jung, Jee-weon ;

Heo, Hee-Soo ;

Tak, Hemlata ;

Shim, Hye-jin ;

Chung, Joon Son ;

Lee, Bong-Jin ;

Yu, Ha-Jin ;

Evans, Nicholas .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6367-6371

[10] Optimizing Tandem Speaker Verification and Anti-Spoofing Systems [J].

Kanervisto, Anssi ;

Hautamaki, Ville ;

Kinnunen, Tomi ;

Yamagishi, Junichi .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :477-488

← 1 2 3 4 5 →