An alternative to code comment generation? Generating comment from bytecode

被引:0
作者
Chen, Xiangping [2 ]
Chen, Junqi [1 ]
Lian, Zhilu [1 ]
Huang, Yuan [1 ]
Zhou, Xiaocong [3 ,4 ]
Wu, Yunzhi [5 ]
Zheng, Zibin [1 ]
机构
[1] Sun Yat Sen Univ, Sch Software Engn, Zhuhai, Peoples R China
[2] Sun Yat Sen Univ, Sch Journalism & Commun, Guangzhou, Peoples R China
[3] Sun Yat Sen Univ, Guangzhou, Peoples R China
[4] Sch Comp Sci & Engn, Guangzhou, Peoples R China
[5] Guangzhou Modern Informat Engn Coll, Sch Econ & Management, Guangzhou, Peoples R China
基金
国家重点研发计划;
关键词
Code comment; Comment generation; Bytecode; Source code unavailable;
D O I
10.1016/j.infsof.2024.107623
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Due to the importance and necessity of code comments, recent works propose many comment generation models with source code as input. But sometimes there has no access to obtain the source code, only the bytecode, such as many Apps. Objective: If there is away to generate comments for bytecode directly, tasks such as malware detection and understanding closed-source software can benefit from the generated comment because it improves the understandability of the system. Therefore, we propose a novel approach called ByteGen to generate comments from bytecode. Methods: Specifically, to extract the structure characteristic of the bytecode, we utilize the control flow graph (CFG) of the bytecode and use a special traversal named enhanced SBT to serialize CFG. The enhanced SBT can completely preserve the structure of the CFG in a sequence. We set up experiments on a dataset with a scale of about 50,000 bytecode-comment pairs collected from Maven. Results: Experimental results show that the average BLEU-4 score of ByteGen is 28.67, which outperforms several baselines, and a human study also indicates the effectiveness of ByteGen in generating comments from bytecodes. Conclusion: In general, ByteGen performs better than other baselines. Therefore, this also proves the effectiveness of our approach in the code comment generation scenario without source code.
引用
收藏
页数:12
相关论文
共 63 条
  • [1] A Survey of Machine Learning for Big Code and Naturalness
    Allamanis, Miltiadis
    Barr, Earl T.
    Devanbu, Premkumar
    Sutton, Charles
    [J]. ACM COMPUTING SURVEYS, 2018, 51 (04)
  • [2] Banerjee S., 2005, P ACL WORKSH INTR EX, V29, P65, DOI DOI 10.3115/1626355.1626389
  • [3] Chen XP, 2024, Arxiv, DOI arXiv:2410.13110
  • [4] Small extracellular vesicles from young plasma reverse age-related functional declines by improving mitochondrial energy metabolism
    Chen, Xiaorui
    Luo, Yang
    Zhu, Qing
    Zhang, Jingzi
    Huang, Huan
    Kan, Yansheng
    Li, Dian
    Xu, Ming
    Liu, Shuohan
    Li, Jianxiao
    Pan, Jinmeng
    Zhang, Li
    Guo, Yan
    Wang, Binghao
    Qi, Guantong
    Zhou, Zhen
    Zhang, Chen-Yu
    Fang, Lei
    Wang, Yanbo
    Chen, Xi
    [J]. NATURE AGING, 2024, 4 (06): : 814 - 838
  • [5] THE CNN PARADIGM
    CHUA, LO
    ROSKA, T
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 1993, 40 (03) : 147 - 156
  • [6] Dahm M., 1999, JIT'99. Java-Information-Tag 1999, P267
  • [7] DEXRAY: A Simple, yet Effective Deep Learning Approach to Android Malware Detection Based on Image Representation of Bytecode
    Daoudi, Nadia
    Samhi, Jordan
    Kabore, Abdoul Kader
    Allix, Kevin
    Bissyande, Tegawende F.
    Klein, Jacques
    [J]. DEPLOYABLE MACHINE LEARNING FOR SECURITY DEFENSE, MLHAT 2021, 2021, 1482 : 81 - 106
  • [8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [9] Dey R, 2017, MIDWEST SYMP CIRCUIT, P1597, DOI 10.1109/MWSCAS.2017.8053243
  • [10] Do Code Summarization Models Process Too Much Information? Function Signature May Be All That Is Needed
    Ding, Xi
    Peng, Rui
    Chen, Xiangping
    Huang, Yuan
    Bian, Jing
    Zheng, Zibin
    [J]. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2024, 33 (06)