Potential of Speech-Pathological Features for Deepfake Speech Detection

被引:2
作者
Chaiwongyen, Anuwat [1 ,2 ]
Duangpummet, Suradej [3 ]
Karnjana, Jessada [3 ]
Kongprawechnon, Waree [2 ]
Unoki, Masashi [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Grad Sch Adv Sci & Technol, Nomi, Ishikawa 9231292, Japan
[2] Thammasat Univ, Sirindhorn Int Inst Technol, Khlong Nueng 12120, Pathum Thani, Thailand
[3] Natl Sci & Technol Dev Agcy, Natl Elect & Comp Technol Ctr NECTEC, Khlong Nueng 12120, Pathum Thani, Thailand
关键词
Deepfakes; Jitter; Feature extraction; Noise measurement; Pathology; Voice activity detection; Training; Speech analysis; Harmonic analysis; Deepfake speech detection; speech-pathological feature; jitter and shimmer; glottal-to-noise; harmonics-to-noise ratio; cepstral-harmonics-to-noise ratio; normalized noise energy; FREQUENCY;
D O I
10.1109/ACCESS.2024.3447582
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
There is a great concern regarding the misuse of deepfake speech technology to synthesize a real person's voice. Therefore, developing speech-security systems capable of detecting deepfake speech remains paramount in safeguarding against such misuse. Although various speech features and methods have been proposed, their potential for distinguishing between genuine and deepfake speech remains unclear. Since speech-pathological features with deep learning are widely used to assess unnaturalness in disordered voices associated with voice-production mechanisms, we investigated the potential of eleven speech-pathological features for distinguishing between genuine and deepfake speech, i.e., jitter (three types), shimmer (four types), harmonics-to-noise ratio, cepstral-harmonics-to-noise ratio, normalized noise energy, and glottal-to-noise excitation ratio. This paper proposes a method of combining two models on the basis of two different dimensions of speech-pathological features to greatly improve the effectiveness of deepfake speech detection, along with mel-spectrogram features, to enhance detection efficiency. We evaluated the proposed method on the datasets of the Automatic Speaker Verification Spoofing and Countermeasures Challenges ASVspoof 2019 and 2021. The results indicate that the proposed method outperforms the baselines in terms of accuracy, recall, F1-score, and F2-score, achieving 95.06, 99.46, 97.30, and 98.59%, respectively, on the ASVspoof 2019 dataset. It also surpasses the baselines on the ASVspoof 2021 dataset in terms of recall, F1-score, F2-score, and equal error rate, achieving 99.96, 96.65, 98.18, and 15.97%, respectively.
引用
收藏
页码:121958 / 121970
页数:13
相关论文
共 58 条
[1]   A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions [J].
Almutairi, Zaynab ;
Elgibreen, Hebah .
ALGORITHMS, 2022, 15 (05)
[2]  
Alzantot M, 2019, Arxiv, DOI arXiv:1907.00501
[3]  
Delgado H, 2021, Arxiv, DOI arXiv:2109.00535
[4]  
Dellwo Volker, 2007, Speaker Classification I. Fundamentals, Features, and Methods. (Lecture Notes in Artificial Intelligence vol. 4343), P1, DOI 10.1007/978-3-540-74200-5_1
[5]  
Duraibi Salahaldeen, 2020, 2020 International Conference on Computational Science and Computational Intelligence (CSCI), P170, DOI 10.1109/CSCI51800.2020.00036
[6]   Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach [J].
Fang, Shih-Hau ;
Tsao, Yu ;
Hsiao, Min-Jing ;
Chen, Ji-Ying ;
Lai, Ying-Hui ;
Lin, Feng-Chuan ;
Wang, Chi-Te .
JOURNAL OF VOICE, 2019, 33 (05) :634-641
[7]  
Farrús M, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P1153
[8]  
Ge WY, 2021, Arxiv, DOI arXiv:2104.03123
[9]   Pathological Voice Detection and Classification Based on Multimodal Transmission Network [J].
Geng, Lei ;
Liang, Yan ;
Shan, Hongfeng ;
Xiao, Zhitao ;
Wang, Wei ;
Wei, Mei .
JOURNAL OF VOICE, 2025, 39 (03) :591-601
[10]   PhysioBank, PhysioToolkit, and PhysioNet - Components of a new research resource for complex physiologic signals [J].
Goldberger, AL ;
Amaral, LAN ;
Glass, L ;
Hausdorff, JM ;
Ivanov, PC ;
Mark, RG ;
Mietus, JE ;
Moody, GB ;
Peng, CK ;
Stanley, HE .
CIRCULATION, 2000, 101 (23) :E215-E220