Voice Deepfake Detection Using the Self-Supervised Pre-Training Model HuBERT

被引：10

作者：

Li, Lanting ^{[1
]}

Lu, Tianliang ^{[1
]}

Ma, Xingbang ^{[1
]}

Yuan, Mengjiao ^{[1
]}

Wan, Da ^{[1
]}

机构：

[1] Peoples Publ Secur Univ China, Coll Informat & Cyber Secur, Beijing 100038, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 14期

关键词：

voice deepfake detection; self-supervised learning; pre-training; feature map scaling; anti-spoofing;

D O I：

10.3390/app13148488

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

In recent years, voice deepfake technology has developed rapidly, but current detection methods have the problems of insufficient detection generalization and insufficient feature extraction for unknown attacks. This paper presents a forged speech detection method (HuRawNet2_modified) based on a self-supervised pre-trained model (HuBERT) to improve detection (and address the above problems). A combination of impulsive signal-dependent additive noise and additive white Gaussian noise was adopted for data boosting and augmentation, and the HuBERT model was fine-tuned on different language databases. On this basis, the size of the extracted feature maps was modified independently by the & alpha;-feature map scaling (& alpha;-FMS) method, with a modified end-to-end method using the RawNet2 model as the backbone structure. The results showed that the HuBERT model could extract features more comprehensively and accurately. The best evaluation indicators were an equal error rate (EER) of 2.89% and a minimum tandem detection cost function (min t-DCF) of 0.2182 on the database of the ASVspoof2021 LA challenge, which verified the effectiveness of the detection method proposed in this paper. Compared with the baseline systems in databases of the ASVspoof 2021 LA challenge and the FMFCC-A, the values of EER and min t-DCF decreased. The results also showed that the self-supervised pre-trained model with fine-tuning can extract acoustic features across languages. And the detection can be slightly improved when the languages of the pre-trained database, and the fine-tuned and tested database are the same.

引用

页数：15

共 31 条

[1]

Baevski A, 2020, ADV NEUR IN, V33

[2] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].

Chen, Sanyuan ;

Wang, Chengyi ;

Chen, Zhengyang ;

Wu, Yu ;

Liu, Shujie ;

Chen, Zhuo ;

Li, Jinyu ;

Kanda, Naoyuki ;

Yoshioka, Takuya ;

Xiao, Xiong ;

Wu, Jian ;

Zhou, Long ;

Ren, Shuo ;

Qian, Yanmin ;

Qian, Yao ;

Zeng, Michael ;

Yu, Xiangzhan ;

Wei, Furu .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518

[3] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].

Desplanques, Brecht ;

Thienpondt, Jenthe ;

Demuynck, Kris .

INTERSPEECH 2020, 2020, :3830-3834

[4]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[5] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].

Hsu, Wei-Ning ;

Bolte, Benjamin ;

Tsai, Yao-Hung Hubert ;

Lakhotia, Kushal ;

Salakhutdinov, Ruslan ;

Mohamed, Abdelrahman .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460

[6] Towards End-to-End Synthetic Speech Detection [J].

Hua, Guang ;

Teoh, Andrew Beng Jin ;

Zhang, Haijian .

IEEE SIGNAL PROCESSING LETTERS, 2021, 28 (28) :1265-1269

[7] AASIST: AUDIO ANTI-SPOOFING USING INTEGRATED SPECTRO-TEMPORAL GRAPH ATTENTION NETWORKS [J].

Jung, Jee-weon ;

Heo, Hee-Soo ;

Tak, Hemlata ;

Shim, Hye-jin ;

Chung, Joon Son ;

Lee, Bong-Jin ;

Yu, Ha-Jin ;

Evans, Nicholas .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6367-6371

[8] Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms [J].

Jung, Jee-weon ;

Kim, Seung-bin ;

Shim, Hye-jin ;

Kim, Ju-ho ;

Yu, Ha-Jin .

INTERSPEECH 2020, 2020, :1496-1500

[9] α-feature map scaling for raw waveform speaker verification [J].

Jung, Jee-weon ;

Shim, Hye-jin ;

Kim, Ju-ho ;

Yu, Ha-Jin .

JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2020, 39 (05) :441-446

[10]

Ma HX, 2022, Arxiv, DOI arXiv:2207.12308

← 1 2 3 4 →