CHARACTERIZING THE ADVERSARIAL VULNERABILITY OF SPEECH SELF-SUPERVISED LEARNING

被引：4

作者：

Wu, Haibin ^{[1
,2
]}

Zheng, Bo ^{[2
,3
]}

Li, Xu ^{[3
]}

Wu, Xixin ^{[2
,3
]}

Lee, Hung-Yi ^{[1
]}

Meng, Helen ^{[2
,3
]}

机构：

[1] Natl Taiwan Univ, Grad Inst Commun Engn, Taipei, Taiwan

[2] Chinese Univ Hong Kong, Ctr Perceptual & Interact Intelligence, Hong Kong, Peoples R China

[3] Chinese Univ Hong Kong, Human Comp Commun Lab, Hong Kong, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Adversarial attack; self-supervised learning;

D O I：

10.1109/ICASSP43922.2022.9747242

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and a small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation. As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority. In this paper, we make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries, and the attacks generated by zero-knowledge adversaries are with transferability. The XAB test verifies the imperceptibility of crafted adversarial attacks.

引用

页码：3164 / 3168

页数：5

共 38 条

[1] [Anonymous], 2021, SLT 2021, DOI DOI 10.1109/SLT48900.2021.9383529
[2] Baevski Alexei, 2020, Advances in neural information processing systems
[3] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[4] Audio Adversarial Examples: Targeted Attacks on Speech-to-Text
Carlini, Nicholas
Wagner, David
[J]. 2018 IEEE SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS (SPW 2018), 2018, : 1 - 7
[5] Cosentino J., 2020, ARXIV200511262
[6] Devlin J., 2019, North American Chapter of the Association for Computational Linguistics, V1, P4171, DOI [DOI 10.48550/ARXIV.1810.04805, DOI 10.18653/V1/N19-1423, 10.48550/ARXIV.1810.04805]
[7] End-to-End Neural Speaker Diarization with Permutation-Free Objectives
Fujita, Yusuke
Kanda, Naoyuki
Horiguchi, Shota
Nagamatsu, Kenji
Watanabe, Shinji
[J]. INTERSPEECH 2019, 2019, : 4300 - 4304
[8] HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING?
Hsu, Wei-Ning
Tsai, Yao-Hung Hubert
Bolte, Benjamin
Salakhutdinov, Ruslan
Mohamed, Abdelrahman
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6533 - 6537
[9] Kassis Andre, 2021, ARXIV210714642
[10] Kreuk F, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P1962, DOI 10.1109/ICASSP.2018.8462693

← 1 2 3 4 →