Learning programs from noisy data

被引:0
作者
Raychev V. [1 ]
Bielik P. [1 ]
Vechev M. [1 ]
Krause A. [1 ]
机构
[1] Department of Computer Science, ETH Zurich
来源
ACM SIGPLAN Notices | 2016年 / 51卷 / 01期
基金
欧盟地平线“2020”;
关键词
Anomaly detection; Big code; Noisy data; Program synthesis; Regularization; Statistical code completion;
D O I
10.1145/2837614.2837671
中图分类号
学科分类号
摘要
We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting - a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as "Big Code". The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems. © 2016 ACM.
引用
收藏
页码:761 / 774
页数:13
相关论文
共 37 条
[21]  
Le V., Gulwani S., Flashextract: A framework for data extraction by examples, Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 542-553, (2014)
[22]  
Luxburg U.V., Schoelkopf B., Statistical learning theory: Models, concepts, and results, Inductive Logic., pp. 651-706, (2011)
[23]  
Menon A.K., Tamuz O., Gulwani S., Lampson B.W., Kalai A., A machine learning framework for programming by example, Proceedings of the 30th International Conference on Machine Learning, ICML 2013, pp. 187-195, (2013)
[24]  
Mitchell T.M., Machine Learning, (1997)
[25]  
Nguyen A.T., Nguyen T.N., Graph-based statistical language model for code, Proceedings of the 37th International Conference on Software Engineering, 1, pp. 858-868, (2015)
[26]  
Nguyen T.T., Nguyen A.T., Nguyen H.A., Nguyen T.N., A statistical semantic language model for source code, Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 532-542, (2013)
[27]  
Nori A.V., Ozair S., Rajamani S.K., Vijaykeerthy D., Efficient synthesis of probabilistic programs, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 208-217, (2015)
[28]  
Och F.J., Minimum error rate training in statistical machine translation, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, 1, pp. 160-167, (2003)
[29]  
Panchekha P., Sanchez-Stern A., Wilcox J.R., Tatlock Z., Automatically improving accuracy for floating point expressions, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 1-11, (2015)
[30]  
Raychev V., Vechev M., Krause A., Predicting program properties from "big code, Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 111-124, (2015)