Learning programs from noisy data

被引：0

作者：

Raychev V. ^{[1
]}

Bielik P. ^{[1
]}

Vechev M. ^{[1
]}

Krause A. ^{[1
]}

机构：

[1] Department of Computer Science, ETH Zurich

来源：

ACM SIGPLAN Notices | 2016年 / 51卷 / 01期

基金：

欧盟地平线“2020”;

关键词：

Anomaly detection; Big code; Noisy data; Program synthesis; Regularization; Statistical code completion;

D O I：

10.1145/2837614.2837671

中图分类号：

学科分类号：

摘要：

We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting - a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as "Big Code". The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems. © 2016 ACM.

引用

页码：761 / 774

页数：13

共 37 条

[21]

Le V., Gulwani S., Flashextract: A framework for data extraction by examples, Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 542-553, (2014)

[22]

Luxburg U.V., Schoelkopf B., Statistical learning theory: Models, concepts, and results, Inductive Logic., pp. 651-706, (2011)

[23]

Menon A.K., Tamuz O., Gulwani S., Lampson B.W., Kalai A., A machine learning framework for programming by example, Proceedings of the 30th International Conference on Machine Learning, ICML 2013, pp. 187-195, (2013)

[24]

Mitchell T.M., Machine Learning, (1997)

[25]

Nguyen A.T., Nguyen T.N., Graph-based statistical language model for code, Proceedings of the 37th International Conference on Software Engineering, 1, pp. 858-868, (2015)

[26]

Nguyen T.T., Nguyen A.T., Nguyen H.A., Nguyen T.N., A statistical semantic language model for source code, Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 532-542, (2013)

[27]

Nori A.V., Ozair S., Rajamani S.K., Vijaykeerthy D., Efficient synthesis of probabilistic programs, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 208-217, (2015)

[28]

Och F.J., Minimum error rate training in statistical machine translation, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, 1, pp. 160-167, (2003)

[29]

Panchekha P., Sanchez-Stern A., Wilcox J.R., Tatlock Z., Automatically improving accuracy for floating point expressions, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 1-11, (2015)

[30]

Raychev V., Vechev M., Krause A., Predicting program properties from "big code, Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 111-124, (2015)

← 1 2 3 4 →