Retrieval on Source Code: A Neural Code Search

被引:104
作者
Sachdev, Saksham [1 ]
Li, Hongyu [2 ]
Luan, Sifei [2 ]
Kim, Seohyun [2 ]
Sen, Koushik [3 ]
Chandra, Satish [2 ]
机构
[1] Univ Waterloo, Waterloo, ON, Canada
[2] Facebook Inc, Cambridge, MA USA
[3] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
MAPL'18: PROCEEDINGS OF THE 2ND ACM SIGPLAN INTERNATIONAL WORKSHOP ON MACHINE LEARNING AND PROGRAMMING LANGUAGES | 2018年
关键词
code search; word-embedding; TF-IDF;
D O I
10.1145/3211346.3211353
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Searching over large code corpora can be a powerful productivity tool for both beginner and experienced developers because it helps them quickly find examples of code related to their intent. Code search becomes even more attractive if developers could express their intent in natural language, similar to the interaction that Stack Overflow supports. In this paper, we investigate the use of natural language processing and information retrieval techniques to carry out natural language search directly over source code, i.e. without having a curated Q&A forum such as Stack Overflow at hand. Our experiments using a benchmark suite derived from Stack Overflow and GitHub repositories show promising results. We find that while a basic word-embedding based search procedure works acceptably, better results can be obtained by adding a layer of supervision, as well as by a customized ranking strategy.
引用
收藏
页码:31 / 41
页数:11
相关论文
共 18 条
  • [1] Allamanis M, 2016, PR MACH LEARN RES, V48
  • [2] Allamanis M, 2015, PR MACH LEARN RES, V37, P2123
  • [3] Bajracharya Sushil, 2006, 21 ANN ACM SIGPLAN C, P681
  • [4] Bojanowski P, 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051, 10.1162/tacl_a_00051]
  • [5] Chan Wing-Kwan, 2012, P ACM SIGSOFT 20 INT
  • [6] Chatterjee S, 2009, LECT NOTES COMPUT SC, V5503, P385
  • [7] DISTRIBUTIONAL STRUCTURE
    Harris, Zellig S.
    [J]. WORD-JOURNAL OF THE INTERNATIONAL LINGUISTIC ASSOCIATION, 1954, 10 (2-3): : 146 - 162
  • [8] Iyer S, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P2073
  • [9] Billion-Scale Similarity Search with GPUs
    Johnson, Jeff
    Douze, Matthijs
    Jegou, Herve
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2021, 7 (03) : 535 - 547
  • [10] CodeHow: Effective Code Search based on API Understanding and Extended Boolean Model
    Lv, Fei
    Zhang, Hongyu
    Lou, Jian-guang
    Wang, Shaowei
    Zhang, Dongmei
    Zhao, Jianjun
    [J]. 2015 30TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2015, : 260 - 270