Learning programs from noisy data

被引:0
作者
Raychev V. [1 ]
Bielik P. [1 ]
Vechev M. [1 ]
Krause A. [1 ]
机构
[1] Department of Computer Science, ETH Zurich
来源
| 1600年 / Association for Computing Machinery, 2 Penn Plaza, Suite 701, New York, NY 10121-0701, United States卷 / 51期
基金
欧盟地平线“2020”;
关键词
Anomaly detection; Big code; Noisy data; Program synthesis; Regularization; Statistical code completion;
D O I
10.1145/2837614.2837671
中图分类号
学科分类号
摘要
We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting - a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as "Big Code". The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems. © 2016 ACM.
引用
收藏
页码:761 / 774
页数:13
相关论文
共 37 条
[1]  
Allamanis M., Sutton C., Mining source code repositories at massive scale using language modeling, MSR, (2013)
[2]  
Allamanis M., Sutton C., Mining idioms from source code, Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 472-483, (2014)
[3]  
Baker J.E., Reducing bias and inefficiency in the selection algorithm, Proceedings of the Second International Conference on Genetic Algorithms on Genetic Algorithms and Their Application, pp. 14-21, (1987)
[4]  
Banzhaf W., Francone F.D., Keller R.E., Nordin P., Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications, (1998)
[5]  
Barowy D.W., Gochev D., Berger E.D., Checkcell: Data debugging for spreadsheets, Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &38
[6]  
Applications, pp. 507-523, (2014)
[7]  
Chandola V., Banerjee A., Kumar V., Anomaly detection: A survey, ACM Comput. Surv., 41, 3, pp. 151-1558, (2009)
[8]  
Chaudhuri S., Clochard M., Solar-Lezama A., Bridging boolean and quantitative synthesis using smoothed proof search, Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 207-220, (2014)
[9]  
Cramer N.L., A representation for the adaptive generation of simple sequential programs, Proceedings of the 1st International Conference on Genetic Algorithms, pp. 183-187, (1985)
[10]  
De Moura L., BjOrner N., Z3: An efficient smt solver, Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 337-340, (2008)