Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

被引：0

作者：

Zhu, Biru ^{[1
]}

Cui, Ganqu ^{[2
]}

Chen, Yangyi ^{[3
]}

Qin, Yujia ^{[2
]}

Yuan, Lifan ^{[2
]}

Fu, Chong ^{[4
]}

Deng, Yangdong ^{[1
]}

Liu, Zhiyuan ^{[2
]}

Sun, Maosong ^{[2
]}

Gu, Ming ^{[1
]}

机构：

[1] Tsinghua Univ, Sch Software, Beijing, Peoples R China

[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Univ Illinois, Champaign, IL USA

[4] Zhejiang Univ, Zhejiang, Peoples R China

来源：

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS | 2023年 / 11卷

基金：

国家重点研发计划;

关键词：

Compendex;

D O I：

10.1162/tacl_a_00622

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.

引用

页码：1608 / 1623

页数：16

共 23 条

[1] Classification of Regional Food Using Pre-Trained Transfer Learning Models
Gadhiya, Jeet
Khatik, Anjali
Kodinariya, Shruti
Ramoliya, Dipak
7th International Conference on Electronics, Communication and Aerospace Technology, ICECA 2023 - Proceedings, 2023, : 1237 - 1241
[2] Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding
Ghaddar, Abbas
Wu, Yimeng
Bagga, Sunyam
Rashid, Ahmad
Bibi, Khalil
Rezagholizadeh, Mehdi
Xing, Chao
Wang, Yasheng
Xinyu, Duan
Wang, Zhefeng
Huai, Baoxing
Jiang, Xin
Liu, Qun
Langlais, Philippe
arXiv, 2022,
[3] Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Processing
Huawei Technologies Co., Ltd.
不详
不详
Proc. Conf. Empir. Methods Nat. Lang. Process., EMNLP, (3135-3151):
[4] Improving fault localization with pre-training
Zhang, Zhuo
Li, Ya
Xue, Jianxin
Mao, Xiaoguang
FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (01)
[5] Are Pre-trained Language Models Useful for Model Ensemble in Chinese Grammatical Error Correction?
Tang, Chenming
Wu, Xiuyu
Wu, Yunfang
arXiv, 2023,
[6] Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
Rothe, Sascha
Narayan, Shashi
Severyn, Aliaksei
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 : 264 - 280
[7] PERL: Pivot-based Domain Adaptation for Pre-trained Deep Contextualized Embedding Models
Ben-David, Eyal
Rabinovitz, Carmel
Reichart, Roi
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 : 504 - 521
[8] Introducing pre-trained transformers for high entropy alloy informatics
Kamnis, Spyros
Materials Letters, 2024, 358
[9] Fast multi-modal reuse: Co-occurrence pre-trained deep learning models
Iyer, Vasanth
Aved, Alexander
Howlett, Todd B.
Carlo, Jeffrey T.
Mehmood, Asif
Pissinou, Niki
Iyengar, S.S.
Proceedings of SPIE - The International Society for Optical Engineering, 2019, 10996
[10] Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels
Anantaprayoon, Panatchakorn
Kaneko, Masahiro
Okazaki, Naoaki
arXiv, 2023,

← 1 2 3 →