Unsupervised incremental acquisition of a thematic corpus from the Web

被引:0
|
作者
Duclaye, F [1 ]
Yvon, F [1 ]
Collin, O [1 ]
机构
[1] France Telecom, R&D, F-22307 Lannion, France
来源
2003 INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, PROCEEDINGS | 2003年
关键词
paraphrases; synonyms; machine learning; Web; automatic classification; EM;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a nearly unsupervised learning methodology for automatically acquiring a thematic corpus from the Web. Relying on a bootstrapping mechanism, our system starts with one single linguistic expression of a given target, semantic relationship. It then samples the Web so as to progressively accumulate a corpus of potential examples of the same relationship. Sampling steps alternate with filtering steps, making it possible to keep the corpus thematically focused. The corpus is finally analysed to search for potential paraphrases of the initial expression of the semantic relationship. These paraphrases will eventually be used to improve our question-answering system. This paper focuses on the learning aspect of the system and reports experimental results regarding the effectiveness of our filtering strategy.
引用
收藏
页码:752 / 757
页数:6
相关论文
共 50 条
  • [21] Thematic Atlas Information Expansion Design: A Storytelling Concept under Web Environment
    Sun, Feiran
    Tang, Xi
    Ye, Tianyu
    Zhu, Feng
    2015 23RD INTERNATIONAL CONFERENCE ON GEOINFORMATICS, 2015,
  • [22] A distributed incremental information acquisition model for large-scale text data
    Sun, Shengtao
    Gong, Jibing
    Zomaya, Albert Y.
    Wu, Aizhi
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 1): : 2383 - 2394
  • [23] Using the Web as an Efficient Source of Building an Arabic Corpus: Presentation and Evaluation
    Bakari, Wided
    Bellot, Patrice
    Neji, Mahmoud
    INNOVATION MANAGEMENT AND EDUCATION EXCELLENCE VISION 2020: FROM REGIONAL DEVELOPMENT SUSTAINABILITY TO GLOBAL ECONOMIC GROWTH, VOLS I - VI, 2016, : 3399 - 3412
  • [24] CORAZON: a web server for data normalization and unsupervised clustering based on expression profiles
    Ramos, Thais A. R.
    Maracaja-Coutinho, Vinicius
    Ortega, J. Miguel
    do Rego, Thais G.
    BMC RESEARCH NOTES, 2020, 13 (01)
  • [25] CORAZON: a web server for data normalization and unsupervised clustering based on expression profiles
    Thaís A. R. Ramos
    Vinicius Maracaja-Coutinho
    J. Miguel Ortega
    Thaís G. do Rêgo
    BMC Research Notes, 13
  • [26] Web as Corpus Supporting Natural Language Generation for Online River Information Communication
    Han, Xiwu
    Ioris, Antonio A. R.
    Lin, Chenghua
    WWW'15 COMPANION: PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2015, : 363 - 364
  • [27] Machine Learning: Automated Knowledge Acquisition Based on Unsupervised Neural Network and Expert System Paradigms
    Elfadil, Nazar
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2005, 9 (06) : 693 - 697
  • [28] Web-based Geographical Vector Data Acquisition Method Research
    Song, Yantao
    Yang, Nana
    2016 5TH INTERNATIONAL CONFERENCE ON EDUCATION AND EDUCATION MANAGEMENT (EEM 2016), 2016, 92 : 348 - 352
  • [29] An unsupervised method for joint information extraction and feature mining across different Web sites
    Wong, Tak-Lam
    Lam, Wai
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (01) : 107 - 125
  • [30] Incremental Learning from Stream Data
    He, Haibo
    Chen, Sheng
    Li, Kang
    Xu, Xin
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2011, 22 (12): : 1901 - 1914