Experimental Case Study of Self-Supervised Learning for Voice Spoofing Detection

被引:2
作者
Lee, Yerin [1 ]
Kim, Narin [1 ]
Jeong, Jaehong [2 ,3 ]
Kwak, Il-Youp [1 ]
机构
[1] Chung Ang Univ, Dept Appl Stat, Seoul 06974, South Korea
[2] Hanyang Univ, Dept Math, Seoul 04763, South Korea
[3] Hanyang Univ, Res Inst Nat Sci, Seoul 04763, South Korea
来源
IEEE ACCESS | 2023年 / 11卷
基金
新加坡国家研究基金会;
关键词
Self-supervised learning; Task analysis; Supervised learning; Speech processing; Deep learning; Training; Microphones; Spoofing detection; self-supervised learning; contrastive learning;
D O I
10.1109/ACCESS.2023.3254880
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study aims to improve the performance of voice spoofing attack detection through self-supervised pre-training. Supervised learning needs appropriate input variables and corresponding labels for constructing the machine learning models that are to be applied. It is necessary to secure a large number of labeled datasets to improve the performance of supervised learning processes. However, labeling requires substantial inputs of time and effort. One of the methods for managing this requirement is self-supervised learning, which uses pseudo-labeling without the necessity for substantial human input. This study experimented with contrastive learning, a well-performing self-supervised learning approach, to construct a voice spoofing detection model. We applied MoCo's dynamic dictionary, SimCLR's symmetric loss, and COLA's bilinear similarity in our contrastive learning framework. Our model was trained using VoxCeleb data and voice data extracted from YouTube videos. Our self-supervised model improved the performance of the baseline model from 6.93% to 5.26% for a logical access (LA) scenario and improved the performance of the baseline model from 0.60% to 0.40% for a physical access (PA) scenario. In the case of PA, the best performance was achieved when random crop augmentation was applied, and in the case of LA, the best performance was obtained when random crop and random shifting augmentations were considered.
引用
收藏
页码:24216 / 24226
页数:11
相关论文
共 42 条
  • [21] Kingma DP, 2014, ADV NEUR IN, V27
  • [22] The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection
    Kinnunen, Tomi
    Sahidullah, Md
    Delgado, Hector
    Todisco, Massimiliano
    Evans, Nicholas
    Yamagishi, Junichi
    Lee, Kong Aik
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2 - 6
  • [23] ResMax: Detecting Voice Spoofing Attacks with Residual Network and Max Feature Map
    Kwak, Il-Youp
    Kwag, Sungsu
    Lee, Junhee
    Huh, Jun Ho
    Lee, Choong-Hoon
    Jeon, Youngbae
    Hwang, Jeonghwan
    Yoon, Ji Won
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 4837 - 4844
  • [24] Laskin M, 2020, PR MACH LEARN RES, V119
  • [25] STC Antispoofing Systems for the ASVspoof2019 Challenge
    Lavrentyeva, Galina
    Novoselov, Sergey
    Tseren, Andzhukaev
    Volkova, Marina
    Gorlanov, Artem
    Kozlov, Alexandr
    [J]. INTERSPEECH 2019, 2019, : 1033 - 1037
  • [26] Audio replay attack detection with deep learning frameworks
    Lavrentyeva, Galina
    Novoselov, Sergey
    Malykh, Egor
    Kozlov, Alexander
    Kudashev, Oleg
    Shchemelinin, Vadim
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 82 - 86
  • [27] Liu AT, 2020, INT CONF ACOUST SPEE, P6419, DOI [10.1109/ICASSP40776.2020.9054458, 10.1109/icassp40776.2020.9054458]
  • [28] VoxCeleb: a large-scale speaker identification dataset
    Nagrani, Arsha
    Chung, Joon Son
    Zisserman, Andrew
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2616 - 2620
  • [29] Najafabadi MM., 2015, J BIG DATA, V2, P1, DOI [DOI 10.1186/S40537-014-0007-7, 10.1186/s40537-014-0007-7]
  • [30] Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
    Noroozi, Mehdi
    Favaro, Paolo
    [J]. COMPUTER VISION - ECCV 2016, PT VI, 2016, 9910 : 69 - 84