Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

被引:27
作者
Bogomolov, Egor [1 ,2 ]
Kovalenko, Vladimir [1 ,3 ]
Rebryk, Yurii [2 ]
Bacchelli, Alberto [4 ]
Bryksin, Timofey [1 ,2 ]
机构
[1] JetBrains Res, Foster City, CA 94404 USA
[2] Higher Sch Econ, St Petersburg, Russia
[3] JetBrains NV, Amsterdam, Netherlands
[4] Univ Zurich, Zurich, Switzerland
来源
PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21) | 2021年
基金
瑞士国家科学基金会;
关键词
Copyrights; Machine learning; Methods of data collection; Software process; Software maintenance; Security;
D O I
10.1145/3468264.3468606
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-theart results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.
引用
收藏
页码:932 / 944
页数:13
相关论文
共 54 条
[11]  
Bird Christian, 2006, P 2006 INT WORKSH MI, P137, DOI 10.1145/1137983.1138016
[12]  
Burrows S., 2007, P TWELTH AUSTRALASIA, P32
[13]   Comparing techniques for authorship attribution of source code [J].
Burrows, Steven ;
Uitdenbogerd, Alexandra L. ;
Turpin, Andrew .
SOFTWARE-PRACTICE & EXPERIENCE, 2014, 44 (01) :1-32
[14]  
Burrows S, 2009, P INT COMP SOFTW APP, P593
[15]  
Caliskan-Islam A, 2015, PROCEEDINGS OF THE 24TH USENIX SECURITY SYMPOSIUM, P255
[16]  
Cover Thomas M., 1991, THEORY
[17]  
Elenbogen B.S., 2008, J. Comput. Sci. Coll., V23, P50
[18]  
Falleri J-R, 2014, P 29 ACM IEEE INT C, P313, DOI DOI 10.1145/2642937.2642982
[19]  
Frantzeskou G., 2007, International Journal of Digital Evidence, V1, P1
[20]  
Frantzeskou G, 2006, INT FED INFO PROC, V204, P508