Source Code Authorship Identification Using Deep Neural Networks

被引:16
作者
Kurtukova, Anna [1 ]
Romanov, Aleksandr [1 ]
Shelupanov, Alexander [1 ]
机构
[1] Tomsk State Univ Control Syst & Radioelect, Fac Secur, Tomsk 634050, Russia
来源
SYMMETRY-BASEL | 2020年 / 12卷 / 12期
关键词
source code; authorship; symmetry; software engineering; machine learning; deanonymization; neural networks;
D O I
10.3390/sym12122044
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Many open-source projects are developed by the community and have a common basis. The more source code is open, the more the project is open to contributors. The possibility of accidental or deliberate use of someone else's source code as a closed functionality in another project (even a commercial) is not excluded. This situation could create copyright disputes. Adding a plagiarism check to the project lifecycle during software engineering solves this problem. However, not all code samples for comparing can be found in the public domain. In this case, the methods of identifying the source code author can be useful. Therefore, identifying the source code author is an important problem in software engineering, and it is also a research area in symmetry. This article discusses the problem of identifying the source code author and modern methods of solving this problem. Based on the experience of researchers in the field of natural language processing (NLP), the authors propose their technique based on a hybrid neural network and demonstrate its results both for simple cases of determining the authorship of the code and for those complicated by obfuscation and using of coding standards. The results show that the author's technique successfully solves the essential problems of analogs and can be effective even in cases where there are no obvious signs indicating authorship. The average accuracy obtained for all programming languages was 95% in the simple case and exceeded 80% in the complicated ones.
引用
收藏
页数:17
相关论文
共 37 条
[1]   Large-Scale and Language-Oblivious Code Authorship Identification [J].
Abuhamad, Mohammed ;
AbuHmed, Tamer ;
Mohaisen, Aziz ;
Nyang, DaeHun .
PROCEEDINGS OF THE 2018 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY (CCS'18), 2018, :101-114
[2]   Source Code Authorship Attribution Using Long Short-Term Memory Based Networks [J].
Alsulami, Bander ;
Dauber, Edwin ;
Harang, Richard ;
Mancoridis, Spiros ;
Greenstadt, Rachel .
COMPUTER SECURITY - ESORICS 2017, PT I, 2018, 10492 :65-82
[3]  
Anckaert B, 2007, QOP'07: PROCEEDINGS OF THE 2007 ACM WORKSHOP ON QUALITY OF PROTECTION, P15
[4]  
[Anonymous], 2007, International Journal of Digital Evidence
[5]   Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inflow Forecasting [J].
Apaydin, Halit ;
Feizi, Hajar ;
Sattari, Mohammad Taghi ;
Colak, Muslume Sevba ;
Shamshirband, Shahaboddin ;
Chau, Kwok-Wing .
WATER, 2020, 12 (05)
[6]  
Buintsev D.N., 2015, INFORM SECUR IS, V3, P38
[7]  
Burrows S, 2009, LECT NOTES COMPUT SC, V5463, P699, DOI 10.1007/978-3-642-00887-0_61
[8]  
Caliskan-Islam A., 2017, ARXIV170105681
[9]  
Caliskan-Islam A, 2015, PROCEEDINGS OF THE 24TH USENIX SECURITY SYMPOSIUM, P255
[10]   The Effectiveness of Source Code Obfuscation: an Experimental Assessment [J].
Ceccato, Mariano ;
Di Penta, Massimiliano ;
Nagra, Jasvir ;
Falcarin, Paolo ;
Ricca, Filippo ;
Torchiano, Marco ;
Tonella, Paolo .
ICPC: 2009 IEEE 17TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, 2009, :178-+