Learning programs from noisy data

被引:0
作者
Raychev V. [1 ]
Bielik P. [1 ]
Vechev M. [1 ]
Krause A. [1 ]
机构
[1] Department of Computer Science, ETH Zurich
来源
ACM SIGPLAN Notices | 2016年 / 51卷 / 01期
基金
欧盟地平线“2020”;
关键词
Anomaly detection; Big code; Noisy data; Program synthesis; Regularization; Statistical code completion;
D O I
10.1145/2837614.2837671
中图分类号
学科分类号
摘要
We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting - a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as "Big Code". The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems. © 2016 ACM.
引用
收藏
页码:761 / 774
页数:13
相关论文
共 37 条
[11]  
Github Code Search
[12]  
Gulwani S., Dimensions in program synthesis, Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming, pp. 13-24, (2010)
[13]  
Gulwani S., Automating string processing in spreadsheets using input-output examples, Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 317-330, (2011)
[14]  
Gulwani S., Jha S., Tiwari A., Venkatesan R., Synthesis of loop-free programs, Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 62-73, (2011)
[15]  
Gvero T., Kuncak V., Kuraj I., Piskac R., Complete completion using types and weights, Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 27-38, (2013)
[16]  
Har-Peled S., Mazumdar S., On coresets for k-means and k-median clustering, Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, pp. 291-300, (2004)
[17]  
Hindle A., Barr E.T., Su Z., Gabel M., Devanbu P., On the naturalness of software, Proceedings of the 34th International Conference on Software Engineering, pp. 837-847, (2012)
[18]  
Hutter F., Hoos H.H., Leyton-Brown K., Stutzle T., Paramils: An automatic algorithm configuration framework, J. Artif. Int. Res., 36, 1, pp. 267-306, (2009)
[19]  
Jha S., Gulwani S., Seshia S.A., Tiwari A., Oracleguided component-based program synthesis, Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, 1, pp. 215-224, (2010)
[20]  
Lau T.A., Programming by Demonstration: A Machine Learning Approach, (2001)