Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task

被引:5
作者
Gorchakov, Artyom V. [1 ]
Demidova, Liliya A. [1 ]
Sovietov, Peter N. [1 ]
机构
[1] Russian Technol Univ, Fed State Budget Educ Inst Higher Educ, Inst Informat Technol, MIREA, 78 Vernadsky Ave, Moscow 119454, Russia
基金
英国科研创新办公室;
关键词
program classification; Markov chains; abstract syntax trees; task classification; static analysis; k-nearest neighbor; support vector machine; random forest; multilayer perceptron;
D O I
10.3390/fi15090314
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware-software partitioning, programming standard violation detection, and semantic code duplication search. We provide the comparative analysis of modern approaches to source code transformation into vector-based representations that extend the variety of classification and clustering algorithms that can be used for intelligent source code analysis. These approaches include word2vec, code2vec, first-order and second-order Markov chains constructed from abstract syntax trees (AST), histograms of assembly language instruction opcodes, and histograms of AST node types. The vectors obtained with the forementioned approaches are then used to train such classification algorithms as k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP). The obtained results show that the use of program vectors based on first-order AST-based Markov chains with an RF-based classifier leads to the highest accuracy, precision, recall, and F1 score. Increasing the order of Markov chains considerably increases the dimensionality of a vector, without any improvements in classifier quality, so we assume that first-order Markov chains are best suitable for real world applications. Additionally, the experimental study shows that first-order AST-based Markov chains are least sensitive to the used classification algorithm.
引用
收藏
页数:28
相关论文
共 57 条
[41]  
Python Software Foundation, 2023, Tokenize-Tokenizer for Python Source
[42]   Malware Classification Based on Multilayer Perception and Word2Vec for IoT Security [J].
Qiao, Yanchen ;
Zhang, Weizhe ;
Du, Xiaojiang ;
Guizani, Mohsen .
ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2022, 22 (01)
[43]  
Rad BB, 2012, 2012 INTERNATIONAL CONFERENCE ON E-LEARNING AND E-TECHNOLOGIES IN EDUCATION (ICEEE), P209, DOI 10.1109/ICeLeTE.2012.6333411
[44]  
Rehurek R., 2011, GENSIM PYTHON FRAMEW, V3, P2
[45]   Learning Syntactic Program Transformations from Examples [J].
Rolim, Reudismam ;
Soares, Gustavo ;
D'Antoni, Loris ;
Polozov, Oleksandr ;
Gulwani, Sumit ;
Gheyi, Rohit ;
Suzuki, Ryo ;
Hartmann, Bjorn .
2017 IEEE/ACM 39TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2017, :404-415
[46]   THE PERCEPTRON - A PROBABILISTIC MODEL FOR INFORMATION-STORAGE AND ORGANIZATION IN THE BRAIN [J].
ROSENBLATT, F .
PSYCHOLOGICAL REVIEW, 1958, 65 (06) :386-408
[47]   Automated Vulnerability Detection in Source Code Using Deep Representation Learning [J].
Russell, Rebecca L. ;
Kim, Louis ;
Hamilton, Lei H. ;
Lazovich, Tomo ;
Harer, Jacob A. ;
Ozdemir, Onur ;
Ellingwood, Paul M. ;
McConley, Marc W. .
2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, :757-762
[48]   PathPair2Vec: An AST path pair-based code representation method for defect prediction [J].
Shi, Ke ;
Lu, Yang ;
Chang, Jingfei ;
Wei, Zhen .
JOURNAL OF COMPUTER LANGUAGES, 2020, 59
[49]  
Simon F, 2001, FIFTH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING, PROCEEDINGS, P30, DOI 10.1109/CSMR.2001.914965
[50]  
Sovetov S.I., 2023, Russ. Technol. J, V11, P46, DOI [10.32362/2500-316X-2023-11-3-46-55, DOI 10.32362/2500-316X-2023-11-3-46-55]