Evaluating the Evaluations of Code Recommender Systems: A Reality Check

被引:12
作者
Proksch, Sebastian [1 ]
Amann, Sven [1 ]
Nadi, Sarah [1 ]
Mezini, Mira [1 ]
机构
[1] Tech Univ Darmstadt, Software Technol Grp, Darmstadt, Germany
来源
2016 31ST IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE) | 2016年
关键词
Empirical Study; Artificial Evaluation; IDE Interaction Data;
D O I
10.1145/2970276.2970330
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
While researchers develop many new exciting code recommender systems, such as method-call completion, code-snippet completion, or code search, an accurate evaluation of such systems is always a challenge. We analyzed the current literature and found that most of the current evaluations rely on artificial queries extracted from released code, which begs the question: Do such evaluations reflect real-life usages? To answer this question, we capture 6,189 fine-grained development histories from real IDE interactions. We use them as a ground truth and extract 7,157 real queries for a specific method-call recommender system. We compare the results of such real queries with different artificial evaluation strategies and check several assumptions that are repeatedly used in research, but never empirically evaluated. We find that an evolving context that is often observed in practice has a major effect on the prediction quality of recommender systems, but is not commonly reflected in artificial evaluations.
引用
收藏
页码:111 / 121
页数:11
相关论文
共 28 条
[1]  
Amann S., 2016, P 24 INT C PROGR COM
[2]  
[Anonymous], P 34 INT C SOFTW ENG
[3]  
[Anonymous], 2014, RECOMMENDATION SYSTE, DOI DOI 10.1007/978-3-642-45135-5_4
[4]  
[Anonymous], P 11 WORK C MIN SOFT
[5]  
[Anonymous], P 37 INT C SOFTW ENG
[6]  
Bruch M., 2009, P 7 JOINT M EUR SOFT
[7]  
DELINE R, 2005, 2005 IEEE S VIS LANG
[8]  
Gvero T., 2013, P 34 C PROGR LANG DE
[9]  
Hassan A. E., 2006, EMPIRICAL SOFTWARE E, V11
[10]  
Heinemann L., 2012, P 16 EUR C SOFTW MAI