Extracting Grammatical Error Corrections from Wikipedia Revision History

被引:0
作者
Chen, Jhih-Jie [1 ]
Wu, Yi-Dong [1 ]
Tai, Yu-Chuan [1 ]
Yang, Ching-Yu [1 ]
Tu, Hai-Lun [1 ]
Chang, Jason S. [1 ]
机构
[1] Natl Tsing Hua Univ, Dept Comp Sci, Hsinchu, Taiwan
来源
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2019年
关键词
Wikipedia; Grammatical Error Correction; MapReduce;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes the process of extracting and filtering Wikipedia revision history as a resource 14 grammatical error correction (GEC). Edits in Wikipedia revision history vary widely, including grammatical error corrections, information supplements, format amendments, and even vandalism. To extract only GEC-related revisions, we use an automated error annotation toolkit, ERRANT(1), and extend it to process large data in parallel efficiently. With error-type analysis, we can then identify GEC-related edits and omit other unrelated edits (i.e., only the correction parts are reserved). The resulting corpus is - to our knowledge - the largest publicly available corpus of parallel possibly erroneous and correct sentences with error type labels.
引用
收藏
页码:6016 / 6018
页数:3
相关论文
共 7 条
  • [1] Brockett C, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P249
  • [2] Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction
    Bryant, Christopher
    Felice, Mariano
    Briscoe, Ted
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 793 - 805
  • [3] Cahill Aoife., 2013, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, P507
  • [4] Felice Mariano, 2016, P COLING 2016 26 INT, P825
  • [5] Grundkiewicz R, 2014, LECT NOTES ARTIF INT, V8686, P478, DOI 10.1007/978-3-319-10888-9_47
  • [6] Napoles Courtney, 2017, P 12 WORKSHOP INNOVA, P9728
  • [7] Zesch Torsten., 2012, EACL, P529