Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

被引：1

作者：

Wang, Qing ^{[1
]}

Yao, Jixun ^{[1
]}

Wang, Ziqian ^{[1
]}

Guo, Pengcheng ^{[1
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Audio Speech & Language Proc Grp ASLP NPU, Sch Comp Sci, Xian, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

speaker identification; adversarial attack; black-box; timbre-reserved; COUNTERMEASURES; VERIFICATION;

D O I：

10.21437/Interspeech.2023-1352

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines.

引用

页码：3994 / 3998

页数：5

共 34 条

[1]

Abdullah H., 2019, ARXIV190405734

[2]

[Anonymous], 2020, P INT C LEARN REPR, DOI DOI 10.1109/ICME46284.2020.9102886

[3] The Attacker's Perspective on Automatic Speaker Verification: An Overview [J].

Das, Rohan Kumar ;

Tian, Xiaohai ;

Kinnunen, Tomi ;

Li, Haizhou .

INTERSPEECH 2020, 2020, :4213-4217

[4] ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].

Deng, Jiankang ;

Guo, Jia ;

Xue, Niannan ;

Zafeiriou, Stefanos .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694

[5] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].

Desplanques, Brecht ;

Thienpondt, Jenthe ;

Demuynck, Kris .

INTERSPEECH 2020, 2020, :3830-3834

[6]

Goodfellow IJ, 2014, ARXIV14126572

[7] Conformer: Convolution-augmented Transformer for Speech Recognition [J].

Gulati, Anmol ;

Qin, James ;

Chiu, Chung-Cheng ;

Parmar, Niki ;

Zhang, Yu ;

Yu, Jiahui ;

Han, Wei ;

Wang, Shibo ;

Zhang, Zhengdong ;

Wu, Yonghui ;

Pang, Ruoming .

INTERSPEECH 2020, 2020, :5036-5040

[8] Speaker Recognition by Machines and Humans [J].

Hansen, John H. L. ;

Hasan, Taufiq .

IEEE SIGNAL PROCESSING MAGAZINE, 2015, 32 (06) :74-99

[9] Adversarial attack and defense strategies for deep speaker recognition systems [J].

Jati, Arindam ;

Hsu, Chin-Cheng ;

Pal, Monisankha ;

Peri, Raghuveer ;

AbdAlmageed, Wael ;

Narayanan, Shrikanth .

COMPUTER SPEECH AND LANGUAGE, 2021, 68

[10]

Kong J., 2020, ADV NEUR IN, V33, P17022

← 1 2 3 4 →