Continual pre-training mitigates forgetting in language and vision

被引：0

作者：

Cossu, Andrea ^{[1
]}

Carta, Antonio ^{[1
]}

Passaro, Lucia ^{[1
]}

Lomonaco, Vincenzo ^{[1
]}

Tuytelaars, Tinne ^{[2
]}

Bacciu, Davide ^{[1
]}

机构：

[1] Univ Pisa, Largo B Pontecorvo 3, I-56127 Pisa, Italy

[2] Katholieke Univ Leuven, Kasteelpk Arenberg 10, B-3001 Leuven, Belgium

来源：

NEURAL NETWORKS | 2024年 / 179卷

基金：

欧盟地平线“2020”;

关键词：

Continual-learning; Lifelong-learning; Pre-training; Self-supervised; Forgetting;

D O I：

10.1016/j.neunet.2024.106492

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pre-trained models are commonly used in Continual Learning to initialize the model before training on the stream of non-stationary data. However, pre-training is rarely applied during Continual Learning. We investigate the characteristics of the Continual Pre-Training scenario, where a model is continually pre-trained on a stream of incoming data and only later fine-tuned to different downstream tasks. We introduce an evaluation protocol for Continual Pre-Training which monitors forgetting against a Forgetting Control dataset not present in the continual stream. We disentangle the impact on forgetting of 3 main factors: the input modality (NLP, Vision), the architecture type (Transformer, ResNet) and the pre-training protocol (supervised, self-supervised). Moreover, we propose a Sample-Efficient Pre-training method (SEP) that speeds up the pre- training phase. We show that the pre-training protocol is the most important factor accounting for forgetting. Surprisingly, we discovered that self-supervised continual pre-training in both NLP and Vision is sufficient to mitigate forgetting without the use of any Continual Learning strategy. Other factors, like model depth, input modality and architecture type are not as crucial.

引用

页数：14

共 47 条

[1] Bao H., 2021, INT C LEARN REPR
[2] Bommasani R., 2022, arXiv, DOI [DOI 10.48550/ARXIV.2108.07258, 10.48550/arXiv.2108.07258]
[3] Chen XL, 2020, Arxiv, DOI arXiv:2003.04297
[4] Conneau A, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P1699
[5] Davari M., 2022, arXiv
[6] A Continual Learning Survey: Defying Forgetting in Classification Tasks
De Lange, Matthias
Aljundi, Rahaf
Masana, Marc
Parisot, Sarah
Jia, Xu
Leonardis, Ales
Slabaugh, Greg
Tuytelaars, Tinne
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) : 3366 - 3385
[7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8] Dosovitskiy A, 2021, INT C LEARN REPR
[9] DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion
Douillard, Arthur
Rame, Alexandre
Couairon, Guillaume
Cord, Matthieu
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 9275 - 9285
[10] Fini Enrico, 2021, arXiv

← 1 2 3 4 5 →