Unsupervised Statistical Text Simplification

被引:8
作者
Qiang, Jipeng [1 ]
Wu, Xindong [2 ,3 ]
机构
[1] Yangzhou Univ, Dept Comp Sci, Yangzhou 225127, Jiangsu, Peoples R China
[2] Hefei Univ Technol, Key Lab Knowledge Engn Big Data, Minist Educ, Hefei 10084, Anhui, Peoples R China
[3] Mininglamp Acad Sci, Minininglamp, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
Encyclopedias; Electronic publishing; Internet; Benchmark testing; Standards; Mathematical model; Text simplification; machine translation; unsupervised;
D O I
10.1109/TKDE.2019.2947679
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most recent approaches for Text Simplification (TS) have drawn on insights from machine translation to learn simplification rewrites from the monolingual parallel corpus of complex and simple sentences, yet their effectiveness strongly relies on large amounts of parallel sentences. However, there has been a serious problem haunting TS for decades, that is, the availability of parallel TS corpora is scarce or not fit for the learning task. In this paper, we will focus on one especially useful and challenging problem of unsupervised TS without a single parallel sentence. To the best of our knowledge, we present the first unsupervised text simplification system based on phrase-based machine translation system, which leverages a careful initialization of phrase tables and language models. On the widely used WikiLarge and WikiSmall benchmarks, our system respectively obtains 39.08 and 25.12 SARI points, even outperforms some supervised baselines.
引用
收藏
页码:1802 / 1806
页数:5
相关论文
共 24 条
  • [1] Artetxe M, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P3632
  • [2] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [3] Cho, 2018, P INT C LEARN REPR, P73
  • [4] Coster William, 2011, P 49 ANN M ASS COMP, P665
  • [5] Heafield K., 2011, P 6 WORKSH STAT MACH, P187
  • [6] Hwang W., 2015, HLT-NAACL, P211, DOI 10.3115/v1/N15-1022
  • [7] Kincaid J.P., 1975, Research Branch Report 8-75
  • [8] Koehn P., 2007, P 45 ANN M ASS COMPU
  • [9] Lample G, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P5039
  • [10] Lapata, 2011, P C EMP METH NAT LAN, P409