Learning programs from noisy data

被引：0

作者：

Raychev V. ^{[1
]}

Bielik P. ^{[1
]}

Vechev M. ^{[1
]}

Krause A. ^{[1
]}

机构：

[1] Department of Computer Science, ETH Zurich

来源：

| 1600年 / Association for Computing Machinery, 2 Penn Plaza, Suite 701, New York, NY 10121-0701, United States卷 / 51期

基金：

欧盟地平线“2020”;

关键词：

Anomaly detection; Big code; Noisy data; Program synthesis; Regularization; Statistical code completion;

D O I：

10.1145/2837614.2837671

中图分类号：

学科分类号：

摘要：

We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting - a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as "Big Code". The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems. © 2016 ACM.

引用

页码：761 / 774

页数：13

共 37 条

[1]

Allamanis M., Sutton C., Mining source code repositories at massive scale using language modeling, MSR, (2013)

[2]

Allamanis M., Sutton C., Mining idioms from source code, Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 472-483, (2014)

[3]

Baker J.E., Reducing bias and inefficiency in the selection algorithm, Proceedings of the Second International Conference on Genetic Algorithms on Genetic Algorithms and Their Application, pp. 14-21, (1987)

[4]

Banzhaf W., Francone F.D., Keller R.E., Nordin P., Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications, (1998)

[5]

Barowy D.W., Gochev D., Berger E.D., Checkcell: Data debugging for spreadsheets, Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &38

[6]

Applications, pp. 507-523, (2014)

[7]

Chandola V., Banerjee A., Kumar V., Anomaly detection: A survey, ACM Comput. Surv., 41, 3, pp. 151-1558, (2009)

[8]

Chaudhuri S., Clochard M., Solar-Lezama A., Bridging boolean and quantitative synthesis using smoothed proof search, Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 207-220, (2014)

[9]

Cramer N.L., A representation for the adaptive generation of simple sequential programs, Proceedings of the 1st International Conference on Genetic Algorithms, pp. 183-187, (1985)

[10]

De Moura L., BjOrner N., Z3: An efficient smt solver, Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 337-340, (2008)

← 1 2 3 4 →