Unsupervised incremental acquisition of a thematic corpus from the Web

被引:0
|
作者
Duclaye, F [1 ]
Yvon, F [1 ]
Collin, O [1 ]
机构
[1] France Telecom, R&D, F-22307 Lannion, France
来源
2003 INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, PROCEEDINGS | 2003年
关键词
paraphrases; synonyms; machine learning; Web; automatic classification; EM;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a nearly unsupervised learning methodology for automatically acquiring a thematic corpus from the Web. Relying on a bootstrapping mechanism, our system starts with one single linguistic expression of a given target, semantic relationship. It then samples the Web so as to progressively accumulate a corpus of potential examples of the same relationship. Sampling steps alternate with filtering steps, making it possible to keep the corpus thematically focused. The corpus is finally analysed to search for potential paraphrases of the initial expression of the semantic relationship. These paraphrases will eventually be used to improve our question-answering system. This paper focuses on the learning aspect of the system and reports experimental results regarding the effectiveness of our filtering strategy.
引用
收藏
页码:752 / 757
页数:6
相关论文
共 50 条
  • [31] An Intelligent Data-Centric Web Crawler Service for API Corpus Construction at Scale
    Assefi, Mehdi
    Bahrami, Mehdi
    Arora, Sarthak
    Taha, Thiab R.
    Arabnia, Hamid R.
    Rasheed, Khaled M.
    Chen, Wei-Peng
    2022 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES (IEEE ICWS 2022), 2022, : 385 - 390
  • [32] On the Design of Web Crawlers for Constructing an Efficient Chinese-Portuguese Bilingual Corpus System
    Cheong, Sio Tai
    Xu, Jiabo
    Liu, Yue
    2018 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2018, : 9 - 12
  • [33] Implementation of Unsupervised k-Means Clustering Algorithm within Amazon Web Services Lambda
    Deese, Anthony S.
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 626 - 632
  • [34] Unsupervised Language Model Adaptation for Automatic Speech Recognition of Broadcast News Using Web 2.0
    Schlippe, Tim
    Gren, Lukasz
    Vu, Ngoc Thang
    Schultz, Tanja
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2697 - 2701
  • [35] Unsupervised Domain Ontology Learning from Text
    Venu, Sree Harissh
    Mohan, Vignesh
    Urkalan, Kodaikkaavirinaadan
    Geetha, T., V
    MINING INTELLIGENCE AND KNOWLEDGE EXPLORATION (MIKE 2016), 2017, 10089 : 132 - 143
  • [36] Mining Parallel Corpus from Sina Microblog
    Xing, Haitao
    Yang, Muyun
    Qi, Haoliang
    Li, Sheng
    Zhao, Tiejun
    2013 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2013), 2013, : 99 - 102
  • [37] UNSUPERVISED LEARNING FOR DETECTION OF LEAKAGE FROM THE HFC NETWORK
    Gibellini, Emilia
    Righetti, Claudio E.
    2018 ITU KALEIDOSCOPE: MACHINE LEARNING FOR A 5G FUTURE (ITU K), 2018,
  • [38] Deployable Models for Approximating Web QoE Metrics From Encrypted Traffic
    Huet, Alexis
    Saverimoutou, Antoine
    Ben Houidi, Zied
    Shi, Hao
    Cai, Shengming
    Xu, Jinchun
    Mathieu, Bertrand
    Rossi, Dario
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2021, 18 (03): : 3336 - 3352
  • [39] Unsupervised learning of atomic environments from simple features
    Reinhart, Wesley F.
    COMPUTATIONAL MATERIALS SCIENCE, 2021, 196
  • [40] From legacy Web applications to Web Services based applications
    Kraiem, Naoufel
    Al-Khanajari, Zuhoor
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ELECTRICAL AND INFORMATION TECHNOLOGIES (ICEIT 2015), 2015, : 47 - 52