Analysis of Spectro-Temporal Modulation Representation for Deep-Fake Speech Detection

被引:4
作者
Cheng, Haowei [1 ]
Mawalim, Candy Olivia [1 ]
Li, Kai [1 ]
Wang, Lijun [1 ]
Unoki, Masashi [1 ]
机构
[1] Japan Adv Inst Sci & Technol, 1-1 Asahidai, Nomi, Ishikawa 9231292, Japan
来源
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC | 2023年
关键词
D O I
10.1109/APSIPAASC58517.2023.10317309
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep-fake speech detection aims to develop effective techniques for identifying fake speech generated using advanced deep-learning methods. It can reduce the negative impact of malicious production or dissemination of fake speech in real-life scenarios. Although humans can relatively easy to distinguish between genuine and fake speech due to human auditory mechanisms, it is difficult for machines to distinguish them correctly. One major reason for this challenge is that machines struggle to effectively separate speech content from human vocal system information. Common features used in speech processing face difficulties in handling this issue, hindering the neural network from learning the discriminative differences between genuine and fake speech. To address this issue, we investigated spectro-temporal modulation representations in genuine and fake speech, which simulate the human auditory perception process. Next, the spectro-temporal modulation was fit to a light convolutional neural network bidirectional long short-term memory for classification. We conducted experiments on the benchmark datasets of the Automatic Speaker Verification and Spoofing Countermeasures Challenge 2019 (ASVspoof2019) and the Audio Deep synthesis Detection Challenge 2023 (ADD2023), achieving an equal-error rate of 8.33% and 42.10%, respectively. The results showed that spectro-temporal modulation representations could distinguish the genuine and deep-fake speech and have adequate performance in both datasets.
引用
收藏
页码:1822 / 1829
页数:8
相关论文
共 35 条
[1]  
Alzantot M, 2019, Arxiv, DOI arXiv:1907.00501
[2]   Toward Robust Audio Spoofing Detection: A Detailed Comparison of Traditional and Learned Features [J].
Balamurali, B. T. ;
Lin, Kin Wan Edward ;
Lui, Simon ;
Chen, Jer-Ming ;
Herremans, Dorien .
IEEE ACCESS, 2019, 7 :84229-84241
[3]  
Bhukya R. K., 2022, P IEEE 9 UTT PRAD SE, P1
[4]   An account of monaural phase sensitivity [J].
Carlyon, RP ;
Shamma, S .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2003, 114 (01) :333-348
[5]  
Chadha Anupama, 2021, Proceedings of Second International Conference on Computing, Communications, and Cyber-Security. IC4S 2020. Lecture Notes in Networks and Systems (LNNS 203), P557, DOI 10.1007/978-981-16-0733-2_39
[6]  
Chaiwongyen A., 2022, Journal of Signal Processing, V26, P171
[7]  
Chavan RupaliS., 2013, International Journal of Computer Science and Mobile Computing, V2, P233
[8]   Multiresolution spectrotemporal analysis of complex sounds [J].
Chi, T ;
Ru, PW ;
Shamma, SA .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2005, 118 (02) :887-906
[9]   Spectro-temporal modulation transfer functions and speech intelligibility [J].
Chi, TS ;
Gao, YJ ;
Guyton, MC ;
Ru, PW ;
Shamma, S .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1999, 106 (05) :2719-2732
[10]  
Irino T, 1998, INT CONF ACOUST SPEE, P3653, DOI 10.1109/ICASSP.1998.679675