Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

被引：0

作者：

Liu, Yuanxin ^{[1
,2
]}

Meng, Fandong ^{[3
]}

Lin, Zheng ^{[1
,2
]}

Fu, Peng ^{[1
]}

Cao, Yanan ^{[1
,2
]}

Wang, Weiping ^{[1
]}

Zhou, Jie ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China

[3] Tencent Inc, WeChat AI, Pattern Recognit Ctr, Shenzhen, Peoples R China

来源：

NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES | 2022年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent studies on the lottery ticket hypothesis (LTH) show that pre-trained language models (PLMs) like BERT contain matching subnetworks that have similar transfer learning performance as the original PLM. These subnetworks are found using magnitude-based pruning. In this paper, we find that the BERT subnetworks have even more potential than these studies have shown. Firstly, we discover that the success of magnitude pruning can be attributed to the preserved pre-training performance, which correlates with the downstream transferability. Inspired by this, we propose to directly optimize the subnetwork structure towards the pre-training objectives, which can better preserve the pre-training performance. Specifically, we train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork, which is agnostic to any specific downstream tasks. We then fine-tune the subnetworks on the GLUE benchmark and the SQuAD dataset. The results show that, compared with magnitude pruning, mask training can effectively find BERT subnetworks with improved overall performance on downstream tasks. Moreover, our method is also more efficient in searching subnetworks and more advantageous when fine-tuning within a certain range of data scarcity. Our code is available at https://github.com/llyx97/TAMT.

引用

页码：5840 / 5857

页数：18

共 48 条

[1] [Anonymous], 2019, PHYSIOTHER THEOR PR, DOI DOI 10.1080/09593985.2019.1709234
[2] [Anonymous], 2013, CoRR abs/1308.3432
[3] Ba Jimmy Lei, 2016, LAYER NORMALIZATION, DOI 10.48550/arXiv.1607.06450
[4] Caron Mathilde, 2020, ABS200103554 CORR
[5] The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
Chen, Tianlong
Frankle, Jonathan
Chang, Shiyu
Liu, Sijia
Zhang, Yang
Carbin, Michael
Wang, Zhangyang
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 16301 - 16311
[6] Chen Tianlong, 2020, Advances in neural information processing systems, V33, P15834
[7] Chen XH, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), P2195
[8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9] Elsen E, 2020, PROC CVPR IEEE, P14617, DOI 10.1109/CVPR42600.2020.01464
[10] Frankle Jonathan, 2020, INT C MACHINE LEARNI, P3259

← 1 2 3 4 5 →