Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

被引:0
|
作者
Zhu, Biru [1 ]
Cui, Ganqu [2 ]
Chen, Yangyi [3 ]
Qin, Yujia [2 ]
Yuan, Lifan [2 ]
Fu, Chong [4 ]
Deng, Yangdong [1 ]
Liu, Zhiyuan [2 ]
Sun, Maosong [2 ]
Gu, Ming [1 ]
机构
[1] Tsinghua Univ, Sch Software, Beijing, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Univ Illinois, Champaign, IL USA
[4] Zhejiang Univ, Zhejiang, Peoples R China
基金
国家重点研发计划;
关键词
Compendex;
D O I
10.1162/tacl_a_00622
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.
引用
收藏
页码:1608 / 1623
页数:16
相关论文
共 23 条
  • [1] Classification of Regional Food Using Pre-Trained Transfer Learning Models
    Gadhiya, Jeet
    Khatik, Anjali
    Kodinariya, Shruti
    Ramoliya, Dipak
    7th International Conference on Electronics, Communication and Aerospace Technology, ICECA 2023 - Proceedings, 2023, : 1237 - 1241
  • [2] Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding
    Ghaddar, Abbas
    Wu, Yimeng
    Bagga, Sunyam
    Rashid, Ahmad
    Bibi, Khalil
    Rezagholizadeh, Mehdi
    Xing, Chao
    Wang, Yasheng
    Xinyu, Duan
    Wang, Zhefeng
    Huai, Baoxing
    Jiang, Xin
    Liu, Qun
    Langlais, Philippe
    arXiv, 2022,
  • [3] Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Processing
    Huawei Technologies Co., Ltd.
    不详
    不详
    Proc. Conf. Empir. Methods Nat. Lang. Process., EMNLP, (3135-3151):
  • [4] Improving fault localization with pre-training
    Zhang, Zhuo
    Li, Ya
    Xue, Jianxin
    Mao, Xiaoguang
    FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (01)
  • [5] Are Pre-trained Language Models Useful for Model Ensemble in Chinese Grammatical Error Correction?
    Tang, Chenming
    Wu, Xiuyu
    Wu, Yunfang
    arXiv, 2023,
  • [6] Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
    Rothe, Sascha
    Narayan, Shashi
    Severyn, Aliaksei
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 : 264 - 280
  • [7] PERL: Pivot-based Domain Adaptation for Pre-trained Deep Contextualized Embedding Models
    Ben-David, Eyal
    Rabinovitz, Carmel
    Reichart, Roi
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 : 504 - 521
  • [8] Introducing pre-trained transformers for high entropy alloy informatics
    Kamnis, Spyros
    Materials Letters, 2024, 358
  • [9] Fast multi-modal reuse: Co-occurrence pre-trained deep learning models
    Iyer, Vasanth
    Aved, Alexander
    Howlett, Todd B.
    Carlo, Jeffrey T.
    Mehmood, Asif
    Pissinou, Niki
    Iyengar, S.S.
    Proceedings of SPIE - The International Society for Optical Engineering, 2019, 10996
  • [10] Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels
    Anantaprayoon, Panatchakorn
    Kaneko, Masahiro
    Okazaki, Naoaki
    arXiv, 2023,