Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing

被引:0
作者
Wu, Kun [1 ,2 ]
Wang, Lijie [2 ]
Li, Zhenghua [1 ]
Zhang, Ao [2 ]
Xiao, Xinyan [2 ]
Wu, Hua [2 ]
Zhang, Min [1 ]
Wang, Haifeng [2 ]
机构
[1] Soochow Univ, Sch Comp Sci & Thchnol, Inst Artificial Intelligence, Suzhou, Peoples R China
[2] Baidu Inc, Beijing, Peoples R China
来源
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021) | 2021年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data augmentation has attracted a lot of research attention in the deep learning era for its ability in alleviating data sparseness. The lack of labeled data for unseen evaluation databases is exactly the major challenge for cross-domain text-to-SQL parsing. Previous works either require human intervention to guarantee the quality of generated data, or fail to handle complex SQL queries. This paper presents a simple yet effective data augmentation framework. First, given a database, we automatically produce a large number of SQL queries based on an abstract syntax tree grammar. For better distribution matching, we require that at least 80% of SQL patterns in the training data are covered by generated queries. Second, we propose a hierarchical SQL-to-question generation model to obtain high-quality natural language questions, which is the major contribution of this work. Finally, we design a simple sampling strategy that can greatly improve training efficiency given large amounts of generated data. Experiments on three cross-domain datasets, i.e., WikiSQL and Spider in English, and DuSQL in Chinese, show that our proposed data augmentation framework can consistently improve performance over strong baselines, and the hierarchical generation component is the key for the improvement.
引用
收藏
页码:8974 / 8983
页数:10
相关论文
共 31 条
  • [1] [Anonymous], 2019, ACL
  • [2] Barzilay R, 2001, 39TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P50
  • [3] Date CJ, 1997, A guide to the SQL standard: a user's guide to the standard database language
  • [4] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [5] Data Augmentation for Low-Resource Neural Machine Translation
    Fadaee, Marzieh
    Bisazza, Arianna
    Monz, Christof
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 567 - 573
  • [6] Gu JT, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1631
  • [7] Guo Daya., 2018, EMNLP
  • [8] Herzig J, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3810
  • [9] Herzig J, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4320
  • [10] Hou Y., 2018, COLING