A Deep Learning Framework for Audio Deepfake Detection

被引:42
作者
Khochare, Janavi [1 ]
Joshi, Chaitali [1 ]
Yenarkar, Bakul [1 ]
Suratkar, Shraddha [1 ]
Kazi, Faruk [1 ]
机构
[1] Veermata Jijabai Technol Inst, Mumbai, Maharashtra, India
关键词
Audio deepfakes; Feature-based classification; Image-based classification; Temporal convolutional networks; Spatial transformer networks; SPEAKER VERIFICATION; SPEECH; CLASSIFICATION; NETWORKS; FEATURES;
D O I
10.1007/s13369-021-06297-w
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Audio deepfakes have been increasingly emerging as a potential source of deceit, with the development of avant-garde methods of synthetic speech generation. Hence, differentiating fake audio from the real one is becoming even more difficult owing to the increasing accuracy of text-to-speech models, posing a serious threat to speaker verification systems. Within the domain of audio deepfake detection, a majority of experiments have been based on the ASVSpoof or the AVSpoof dataset using various machine learning and deep learning approaches. In this work, experiments were performed on a more recent dataset, the Fake or Real (FoR) dataset which contains data generated using some of the best text to speech models. Two approaches have been adopted to the solve problem: feature-based approach and image-based approach. The feature-based approach involves converting audio data into a dataset consisting of various spectral features of the audio samples, which are fed to the machine learning algorithms for the classification of audio as fake or real. While in the image-based approach audio samples are converted into melspectrograms which are input into deep learning algorithms, namely Temporal Convolutional Network (TCN) and Spatial Transformer Network (STN). TCN has been implemented because it is a sequential model and has been shown to give good results on sequential data. A comparison between the performances of both the approaches has been made, and it is observed that deep learning algorithms, particularly TCN, outperforms the machine learning algorithms by a significant margin, with a 92 percent test accuracy. This solution presents a model for audio deepfake classification which has an accuracy comparable to the traditional CNN models like VGG16, XceptionNet, etc.
引用
收藏
页码:3447 / 3458
页数:12
相关论文
共 40 条
[1]   MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation [J].
Abu Farha, Yazan ;
Gall, Juergen .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3570-3579
[2]  
Alqahtani S., 2019, ARXIV PREPRINT ARXIV
[3]  
Bai Shaojie, 2018, CoRR
[4]   Toward Robust Audio Spoofing Detection: A Detailed Comparison of Traditional and Learned Features [J].
Balamurali, B. T. ;
Lin, Kin Wan Edward ;
Lui, Simon ;
Chen, Jer-Ming ;
Herremans, Dorien .
IEEE ACCESS, 2019, 7 :84229-84241
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   A comparison of features for speech, music discrimination. [J].
Carey, MJ ;
Parris, ES ;
Lloyd-Thomas, H .
ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, :149-152
[7]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[8]   Probabilistic forecasting with temporal convolutional neural network [J].
Chen, Yitian ;
Kang, Yanfei ;
Chen, Yixiong ;
Wang, Zizhuo .
NEUROCOMPUTING, 2020, 399 :491-501
[9]  
Danilyuk K., 2017, DATA SCI
[10]   Detection of COVID-19 from speech signal using bio-inspired based cepstral features [J].
Dash, Tusar Kanti ;
Mishra, Soumya ;
Panda, Ganapati ;
Satapathy, Suresh Chandra .
PATTERN RECOGNITION, 2021, 117