Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

被引:27
作者
Bogomolov, Egor [1 ,2 ]
Kovalenko, Vladimir [1 ,3 ]
Rebryk, Yurii [2 ]
Bacchelli, Alberto [4 ]
Bryksin, Timofey [1 ,2 ]
机构
[1] JetBrains Res, Foster City, CA 94404 USA
[2] Higher Sch Econ, St Petersburg, Russia
[3] JetBrains NV, Amsterdam, Netherlands
[4] Univ Zurich, Zurich, Switzerland
来源
PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21) | 2021年
基金
瑞士国家科学基金会;
关键词
Copyrights; Machine learning; Methods of data collection; Software process; Software maintenance; Security;
D O I
10.1145/3468264.3468606
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-theart results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.
引用
收藏
页码:932 / 944
页数:13
相关论文
共 54 条
[1]   Learning Natural Coding Conventions [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Bird, Christian ;
Sutton, Charles .
22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, :281-293
[2]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[3]  
Alon U, 2018, PROCEEDINGS OF THE 39TH ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, PLDI 2018, P404, DOI [10.1145/3296979.3192412, 10.1145/3192366.3192412]
[4]  
Alon Uri., 2018, CoRR abs/1808.01400
[5]   Source Code Authorship Attribution Using Long Short-Term Memory Based Networks [J].
Alsulami, Bander ;
Dauber, Edwin ;
Harang, Richard ;
Mancoridis, Spiros ;
Greenstadt, Rachel .
COMPUTER SECURITY - ESORICS 2017, PT I, 2018, 10492 :65-82
[6]  
[Anonymous], 2010, P ECRIME RES SUMM DA, DOI DOI 10.1109/ECRIME.2010.5706698
[7]  
Anvik J., 2006, Proceedings of the 28th International Conference on Software Engineering, P361, DOI DOI 10.1145/1134285.1134336
[8]   Usage and attribution of Stack Overflow code snippets in GitHub projects [J].
Baltes, Sebastian ;
Diehl, Stephan .
EMPIRICAL SOFTWARE ENGINEERING, 2019, 24 (03) :1259-1295
[9]  
Bird C, 2011, P 19 ACM SIGSOFT S 1, P4, DOI DOI 10.1145/2025113.2025119
[10]   The Promises and Perils of Mining Git [J].
Bird, Christian ;
Rigby, Peter C. ;
Barr, Earl T. ;
Hamilton, David J. ;
German, Daniel M. ;
Devanbu, Prem .
2009 6TH IEEE INTERNATIONAL WORKING CONFERENCE ON MINING SOFTWARE REPOSITORIES, 2009, :1-+