Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

被引：0

作者：

Zhu, Biru ^{[1
]}

Cui, Ganqu ^{[2
]}

Chen, Yangyi ^{[3
]}

Qin, Yujia ^{[2
]}

Yuan, Lifan ^{[2
]}

Fu, Chong ^{[4
]}

Deng, Yangdong ^{[1
]}

Liu, Zhiyuan ^{[2
]}

Sun, Maosong ^{[2
]}

Gu, Ming ^{[1
]}

机构：

[1] Tsinghua Univ, Sch Software, Beijing, Peoples R China

[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Univ Illinois, Champaign, IL USA

[4] Zhejiang Univ, Zhejiang, Peoples R China

来源：

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS | 2023年 / 11卷

基金：

国家重点研发计划;

关键词：

Compendex;

D O I：

10.1162/tacl_a_00622

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.

引用

页码：1608 / 1623

页数：16

共 23 条

[21] GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Byun, Jaeseok
Hwang, Taebaek
Fu, Jianlong
Moon, Taesup
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2022, 13679 LNCS : 395 - 412
[22] A Gradient-Similarity Based Multi-Topic Jointly Pre-Training Method for Automated Essay Scoring
Li, Chenliang
Wu, Hongtao
Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2022, 51 (04): : 558 - 564
[23] Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-Training
Gao, Yipeng
Wang, Zeyu
Zheng, Wei-Shi
Xie, Cihang
Zhou, Yuyin
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024, : 22998 - 23008

← 1 2 3 →