BAYESIAN RETRIEVAL USING A SIMILARITY-BASED LEMMATIZER

被引:0
作者
Maragoudakis, Manolis [1 ]
Lyras, Dimitrios P. [2 ]
Sgarbas, Kyriakos [2 ]
机构
[1] Univ Aegean, Dept Informat & Commun Syst Engn, Samos, Greece
[2] Univ Patras, Dept Elect & Comp Engn, Wire Commun Lab, Artificial Intelligence Grp, GR-26500 Patras, Greece
关键词
Bayesian networks; modern Greek; AhR; Ad-hoc retrieval; lemmatization; AUTOMATIC LEMMATIZATION; INFORMATION-RETRIEVAL; MODERN GREEK; MODEL;
D O I
10.1142/S0218213012500248
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The present paper describes a Bayesian network approach to Information Retrieval (IR) from Web documents. The network structure provides an intuitive representation of uncertainty relationships and the embedded conditional probability table is used by inference algorithms in an attempt to identify documents that are relevant to the user's needs, expressed in the form of Boolean queries. Our research has been directed in constructing a probabilistic IR framework that focus on assisting users to perform Ad-hoc retrieval of documents from the various domains such as economics, news, sports, etc. Furthermore, users can integrate feedback regarding the relevance of the retrieved documents in an attempt to improve performance on upcoming requests. Towards these goals, we have expanded the traditional Bayesian network IR system and tested it on several Greek web corpora on different application domains. We have developed two different approaches with regards to the structure: a simple one, where the structure is manually provided, and an automated one, where data mining is used in order to extract the network's structure. Results have depicted competitive performance against successful IR models of different theoretical backgrounds, such as the vector space utilizing tf-idf and the probabilistic model of BM25 in terms of precision-recall curves. In order to further improve the performance of the IR system, we have implemented a novel similarity-based lemmatization framework, reducing thus the ambiguity posed by the plethora of morphological variations of the languages in question. The employed lemmatization framework comprises of 3 core components (i.e. the word segregation, the data cleansing and the lemmatization modules) and is language-independent (i.e. can be applied to other languages with morphological peculiarities and thus improve Ad-hoc retrieval) since it achieves the mapping of an input word to its normalized form by employing two state-of-the-art language independent distance metric models, meaning the Levenshtein Edit distance and the Dice coefficient similarity measure, combined with a language model describing the most frequent inflectional suffixes of the examined language. Experimental results support our claim on the significance of this incorporation to Greek texts web retrieval as results improve by a factor of 4% to 11%.
引用
收藏
页数:32
相关论文
共 50 条
  • [1] Similarity-Based Virtual Screening with a Bayesian Inference Network
    Abdo, Ammar
    Salim, Naomie
    CHEMMEDCHEM, 2009, 4 (02) : 210 - 218
  • [2] A semantic similarity-based social information retrieval model
    Choumane, Ali
    SOCIAL NETWORK ANALYSIS AND MINING, 2014, 4 (01) : 1 - 6
  • [3] Similarity-based knowledge graph queries for recommendation retrieval
    Wenige, Lisa
    Ruhland, Johannes
    SEMANTIC WEB, 2019, 10 (06) : 1007 - 1037
  • [4] Quantitative Similarity-based Evaluation of Text Retrieval Algorithms
    Didari, Parastoo
    Babai, Behrad
    Shakery, Azadeh
    2009 14TH INTERNATIONAL COMPUTER CONFERENCE, 2009, : 264 - 269
  • [5] Similarity-based interference in sentence comprehension: Literature review and Bayesian meta-analysis
    Jaeger, Lena A.
    Engelmann, Felix
    Vasishth, Shravan
    JOURNAL OF MEMORY AND LANGUAGE, 2017, 94 : 316 - 339
  • [6] A generalized similarity measure for similarity-based residual life prediction
    You, M-Y
    Meng, G.
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART E-JOURNAL OF PROCESS MECHANICAL ENGINEERING, 2011, 225 (E3) : 151 - 160
  • [7] Similarity-Based Three-Way Clustering by Using Dimensionality Reduction
    Li, Anlong
    Meng, Yiping
    Wang, Pingxin
    MATHEMATICS, 2024, 12 (13)
  • [8] CM-BOF: visual similarity-based 3D shape retrieval using Clock Matching and Bag-of-Features
    Lian, Zhouhui
    Godil, Afzal
    Sun, Xianfang
    Xiao, Jianguo
    MACHINE VISION AND APPLICATIONS, 2013, 24 (08) : 1685 - 1704
  • [9] A Similarity-Based Clustering Algorithm for Fuzzy Data
    Hung, Wen-Liang
    Yang, Miin-Shen
    2010 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE 2010), 2010,
  • [10] Using Hybrid Similarity-Based Collaborative Filtering Method for Compound Activity Prediction
    Ma, Jun
    Zhang, Ruisheng
    Yuan, Yongna
    Zhao, Zhili
    INTELLIGENT COMPUTING THEORIES AND APPLICATION, PT II, 2018, 10955 : 579 - 588