Document-level Relation Extraction with Progressive Self-distillation

被引:0
作者
Wang, Quan [1 ]
Mao, Zhendong [2 ,3 ]
Gao, Jie [4 ]
Zhang, Yongdong [2 ,3 ]
机构
[1] Beijing Univ Posts & Telecommun, MOE Key Lab Trustworthy Distributed Comp & Serv, 10 Xitucheng Rd, Beijing 100876, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci & Engn, 96 Jinzhai Rd, Hefei 230026, Anhui, Peoples R China
[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, 5089 Wangjiang West Rd, Hefei 230088, Anhui, Peoples R China
[4] Univ Sci & Technol China, Sch Informat Sci & Engn, 96 Jinzhai Rd, Hefei 230026, Anhui, Peoples R China
基金
中国国家自然科学基金;
关键词
Document-level relation extraction; soft-label training regime; online knowledge distillation; self-knowledge distillation;
D O I
10.1145/3656168
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Document-level relation extraction (RE) aims to simultaneously predict relations (including no-relation cases denoted as NA) between all entity pairs in a document. It is typically formulated as a relation classification task with entities pre-detected in advance and solved by a hard-label training regime, which, however, neglects the divergence of the NA class and the correlations among other classes. This article introduces progressive self-distillation (PSD), a new training regime that employs online, self-knowledge distillation (KD) to produce and incorporate soft labels for document-level RE.The key idea of PSD is to gradually soften hard labels using past predictions from an RE model itself, which are adjusted adaptively as training proceeds. As such, PSD has to learn only one RE model within a single training pass, requiring no extra computation or annotation to pretrain another high-capacity teacher. PSD is conceptually simple, easy to implement, and generally applicable to various RE models to further improve their performance, without introducing additional parameters or significantly increasing training overheads into the models. It is also a general framework that can be flexibly extended to distilling various types of knowledge, rather than being restricted to soft labels themselves. Extensive experiments on four benchmarking datasets verify the effectiveness and generality of the proposed approach. The code is available at https://github.com/GaoJieCN/psd
引用
收藏
页数:34
相关论文
共 80 条
[1]  
Anil R., 2018, ARXIV180403235
[2]  
Ba LJ, 2014, ADV NEUR IN, V27
[3]  
Beltagy I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3615
[4]  
Cai R, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P756
[5]   Adversarial Distillation for Efficient Recommendation with External Knowledge [J].
Chen, Xu ;
Zhang, Yongfeng ;
Xu, Hongteng ;
Qin, Zheng ;
Zha, Hongyuan .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2019, 37 (01)
[6]  
Chen ZY, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P345
[7]  
Cheng Q, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, P2819
[8]  
Christopoulou F, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P4925
[9]  
Czarnecki Wojciech Marian, 2017, Advances in Neural Information Processing Systems, V30
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171