Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task

被引:5
作者
Gorchakov, Artyom V. [1 ]
Demidova, Liliya A. [1 ]
Sovietov, Peter N. [1 ]
机构
[1] Russian Technol Univ, Fed State Budget Educ Inst Higher Educ, Inst Informat Technol, MIREA, 78 Vernadsky Ave, Moscow 119454, Russia
基金
英国科研创新办公室;
关键词
program classification; Markov chains; abstract syntax trees; task classification; static analysis; k-nearest neighbor; support vector machine; random forest; multilayer perceptron;
D O I
10.3390/fi15090314
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware-software partitioning, programming standard violation detection, and semantic code duplication search. We provide the comparative analysis of modern approaches to source code transformation into vector-based representations that extend the variety of classification and clustering algorithms that can be used for intelligent source code analysis. These approaches include word2vec, code2vec, first-order and second-order Markov chains constructed from abstract syntax trees (AST), histograms of assembly language instruction opcodes, and histograms of AST node types. The vectors obtained with the forementioned approaches are then used to train such classification algorithms as k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP). The obtained results show that the use of program vectors based on first-order AST-based Markov chains with an RF-based classifier leads to the highest accuracy, precision, recall, and F1 score. Increasing the order of Markov chains considerably increases the dimensionality of a vector, without any improvements in classifier quality, so we assume that first-order Markov chains are best suitable for real world applications. Additionally, the experimental study shows that first-order AST-based Markov chains are least sensitive to the used classification algorithm.
引用
收藏
页数:28
相关论文
共 57 条
[1]  
Alias C., 2003, P 10 WORK C REV ENG
[2]   Mining Idioms from Source Code [J].
Allamanis, Miltiadis ;
Sutton, Charles .
22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, :472-483
[3]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[4]   AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION [J].
ALTMAN, NS .
AMERICAN STATISTICIAN, 1992, 46 (03) :175-185
[5]  
[Anonymous], 2023, Python Software Foundation Dis-Disassembler for Python Bytecode
[6]  
Arató P, 2003, I S INTELL SIG PR, P197
[7]   A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python']Python [J].
Bagheri, Amirreza ;
Hegedus, Peter .
QUALITY OF INFORMATION AND COMMUNICATIONS TECHNOLOGY, QUATIC 2021, 2021, 1439 :267-281
[8]   Exploration of Convolutional Neural Network models for source code classification [J].
Barchi, Francesco ;
Parisi, Emanuele ;
Urgese, Gianvito ;
Ficarra, Elisa ;
Acquaviva, Andrea .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 97 (97)
[9]   Learning from Examples to Improve Code Completion Systems [J].
Bruch, Marcel ;
Monperrus, Martin ;
Mezini, Mira .
7TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2009, :213-222
[10]  
Bui N.D.Q., 2017, Corr