Learning programs from noisy data

被引：0

作者：

Raychev V. ^{[1
]}

Bielik P. ^{[1
]}

Vechev M. ^{[1
]}

Krause A. ^{[1
]}

机构：

[1] Department of Computer Science, ETH Zurich

来源：

ACM SIGPLAN Notices | 2016年 / 51卷 / 01期

基金：

欧盟地平线“2020”;

关键词：

Anomaly detection; Big code; Noisy data; Program synthesis; Regularization; Statistical code completion;

D O I：

10.1145/2837614.2837671

中图分类号：

学科分类号：

摘要：

We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting - a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as "Big Code". The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems. © 2016 ACM.

引用

页码：761 / 774

页数：13

共 37 条

[11]

Github Code Search

[12]

Gulwani S., Dimensions in program synthesis, Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming, pp. 13-24, (2010)

[13]

Gulwani S., Automating string processing in spreadsheets using input-output examples, Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 317-330, (2011)

[14]

Gulwani S., Jha S., Tiwari A., Venkatesan R., Synthesis of loop-free programs, Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 62-73, (2011)

[15]

Gvero T., Kuncak V., Kuraj I., Piskac R., Complete completion using types and weights, Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 27-38, (2013)

[16]

Har-Peled S., Mazumdar S., On coresets for k-means and k-median clustering, Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, pp. 291-300, (2004)

[17]

Hindle A., Barr E.T., Su Z., Gabel M., Devanbu P., On the naturalness of software, Proceedings of the 34th International Conference on Software Engineering, pp. 837-847, (2012)

[18]

Hutter F., Hoos H.H., Leyton-Brown K., Stutzle T., Paramils: An automatic algorithm configuration framework, J. Artif. Int. Res., 36, 1, pp. 267-306, (2009)

[19]

Jha S., Gulwani S., Seshia S.A., Tiwari A., Oracleguided component-based program synthesis, Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, 1, pp. 215-224, (2010)

[20]

Lau T.A., Programming by Demonstration: A Machine Learning Approach, (2001)

← 1 2 3 4 →