Cracking double-blind review: Authorship attribution with deep learning

被引:8
作者
Bauersfeld, Leonard [1 ]
Romero, Angel [1 ]
Muglikar, Manasi [1 ]
Scaramuzza, Davide [1 ]
机构
[1] Univ Zurich, Robot & Percept Grp, Zurich, Switzerland
来源
PLOS ONE | 2023年 / 18卷 / 06期
基金
欧洲研究理事会; 瑞士国家科学基金会;
关键词
D O I
10.1371/journal.pone.0287611
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Double-blind peer review is considered a pillar of academic research because it is perceived to ensure a fair, unbiased, and fact-centered scientific discussion. Yet, experienced researchers can often correctly guess from which research group an anonymous submission originates, biasing the peer-review process. In this work, we present a transformer-based, neural-network architecture that only uses the text content and the author names in the bibliography to attribute an anonymous manuscript to an author. To train and evaluate our method, we created the largest authorship-identification dataset to date. It leverages all research papers publicly available on arXiv amounting to over 2 million manuscripts. In arXiv-subsets with up to 2,000 different authors, our method achieves an unprecedented authorship attribution accuracy, where up to 73% of papers are attributed correctly. We present a scaling analysis to highlight the applicability of the proposed method to even larger datasets when sufficient compute capabilities are more widely available to the academic community. Furthermore, we analyze the attribution accuracy in settings where the goal is to identify all authors of an anonymous manuscript. Thanks to our method, we are not only able to predict the author of an anonymous work but we also provide empirical evidence of the key aspects that make a paper attributable. We have open-sourced the necessary tools to reproduce our experiments.
引用
收藏
页数:20
相关论文
共 39 条
[1]  
Agun HV, 2017, PROCEEDINGS OF 2017 2ND INTERNATIONAL CONFERENCE ON KNOWLEDGE ENGINEERING AND APPLICATIONS (ICKEA), P194, DOI 10.1109/ICKEA.2017.8169928
[2]  
Atanassova Iana, 2019, Front Res Metr Anal, V4, P2, DOI 10.3389/frma.2019.00002
[3]   Dynamics of Polarizing Rhetoric in Congressional Tweets [J].
Ballard, Andrew O. ;
DeTamble, Ryan ;
Dorsey, Spencer ;
Heseltine, Michael ;
Johnson, Marcus .
LEGISLATIVE STUDIES QUARTERLY, 2023, 48 (01) :105-144
[4]  
Bojanowski P., 2017, Trans ACL, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051, DOI 10.1162/TACL_A_00051]
[5]  
Bradley JosephK., 2008, Author identification from citations
[6]  
Clement C.B., 2019, arXiv
[7]  
Cruz JCB., 2020, Establishing baselines for text classification in lowresource languages
[8]  
Davidson T., 2017, ICWSM, DOI DOI 10.1609/ICWSM.V11I1.14955
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Holmes D. I., 1998, Literary & Linguistic Computing, V13, P111, DOI 10.1093/llc/13.3.111