A Toolkit for Generating Code Knowledge Graphs

被引:10
作者
Abdelaziz, Ibrahim [1 ]
Dolby, Julian [1 ]
McCusker, Jamie [2 ]
Srinivas, Kavitha [1 ]
机构
[1] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
[2] Rensselaer Polytech Inst RPI, Troy, NY USA
来源
PROCEEDINGS OF THE 11TH KNOWLEDGE CAPTURE CONFERENCE (K-CAP '21) | 2021年
关键词
Knowledge Graphs; Code Understanding; Code Analysis; Code Graphs;
D O I
10.1145/3460210.3493578
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge graphs have been proven extremely useful in powering diverse applications in semantic search and natural language understanding. In this work, we present GRAPHGEN4CODE, a toolkit to build code knowledge graphs that can similarly power various applications such as program search, code understanding, bug detection, and code automation. GRAPHGEN4CODE uses generic techniques to capture code semantics with the key nodes in the graph representing classes, functions and methods. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). Our toolkit uses named graphs in RDF to model graphs per program, or can output graphs as JSON. We show the scalability of the toolkit by applying it to 1.3 million Python files drawn from GitHub, 2,300 Python modules, and 47 million forum posts. This results in an integrated code graph with over 2 billion triples. We make the toolkit to build such graphs as well as the sample extraction of the 2 billion triples graph publicly available to the community for use.
引用
收藏
页码:137 / 144
页数:8
相关论文
共 40 条
  • [1] Abdelaziz I, 2021, AAAI CONF ARTIF INTE, V35, P15985
  • [2] Abdelaziz Ibrahim, 2020, P 19 INT SEMANTICWEB
  • [3] Allamanis M., 2018, INT C LEARNING REPRE
  • [4] A Survey of Machine Learning for Big Code and Naturalness
    Allamanis, Miltiadis
    Barr, Earl T.
    Devanbu, Premkumar
    Sutton, Charles
    [J]. ACM COMPUTING SURVEYS, 2018, 51 (04)
  • [5] Alon U., 2019, ICLR
  • [6] code2vec: Learning Distributed Representations of Code
    Alon, Uri
    Zilberstein, Meital
    Levy, Omer
    Yahav, Eran
    [J]. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL):
  • [7] Graph-based Statistical Language Model for Code
    Anh Tuan Nguyen
    Nguyen, Tien N.
    [J]. 2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 1, 2015, : 858 - 868
  • [8] [Anonymous], 2009, HDB ONTOLOGIES
  • [9] CodeOntology: RDF-ization of Source Code
    Atzeni, Mattia
    Atzori, Maurizio
    [J]. SEMANTIC WEB - ISWC 2017, PT II, 2017, 10588 : 20 - 28
  • [10] Bollacker K, 2008, P 2008 ACM SIGMOD IN, P1247, DOI 10.1145/1376616.1376746