Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

被引:0
|
作者
Zhu, Biru [1 ]
Cui, Ganqu [2 ]
Chen, Yangyi [3 ]
Qin, Yujia [2 ]
Yuan, Lifan [2 ]
Fu, Chong [4 ]
Deng, Yangdong [1 ]
Liu, Zhiyuan [2 ]
Sun, Maosong [2 ]
Gu, Ming [1 ]
机构
[1] Tsinghua Univ, Sch Software, Beijing, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Univ Illinois, Champaign, IL USA
[4] Zhejiang Univ, Zhejiang, Peoples R China
基金
国家重点研发计划;
关键词
Compendex;
D O I
10.1162/tacl_a_00622
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.
引用
收藏
页码:1608 / 1623
页数:16
相关论文
共 23 条
  • [21] GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
    Byun, Jaeseok
    Hwang, Taebaek
    Fu, Jianlong
    Moon, Taesup
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2022, 13679 LNCS : 395 - 412
  • [22] A Gradient-Similarity Based Multi-Topic Jointly Pre-Training Method for Automated Essay Scoring
    Li, Chenliang
    Wu, Hongtao
    Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2022, 51 (04): : 558 - 564
  • [23] Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-Training
    Gao, Yipeng
    Wang, Zeyu
    Zheng, Wei-Shi
    Xie, Cihang
    Zhou, Yuyin
    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024, : 22998 - 23008