Using cited references to improve the retrieval of related biomedical documents

被引:14
|
作者
Ortuno, Francisco M. [1 ]
Rojas, Ignacio [1 ]
Andrade-Navarro, Miguel A. [2 ]
Fontaine, Jean-Fred [2 ]
机构
[1] Univ Granada, Comp Architecture & Comp Technol Dept, E-18071 Granada, Spain
[2] Max Delbruck Ctr Mol Med, D-13125 Berlin, Germany
来源
BMC BIOINFORMATICS | 2013年 / 14卷
关键词
Information retrieval; Text categorization; Citations; Full-text documents; Biomedical literature; Query expansion; Document classification; INFORMATION-RETRIEVAL; PROBABILISTIC MODEL; FULL-TEXT; ARTICLES; RANKING; CITATIONS; DATABASE; SEARCH; TERMS;
D O I
10.1186/1471-2105-14-113
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are optimally trained with large sets of relevant documents. As a solution to this problem, we propose a query expansion method that extends the information related to a manuscript using its cited references. Results: Data on cited references and text sections in 249,108 full-text biomedical articles was extracted from the Open Access subset of the PubMed Central (R) database (PMC-OA). Of the five standard sections of a scientific article, the Introduction and Discussion sections contained most of the citations (mean = 10.2 and 9.9 citations, respectively). A large proportion of articles (98.4%) and their cited references (79.5%) were indexed in the PubMed (R) database. Using the MedlineRanker abstract classification tool, cited references allowed accurate retrieval of the citing document in a test set of 10,000 documents and also of documents related to six biomedical topics defined by particular MeSH (R) terms from the entire PMC-OA (p-value<0.01). Classification performance was sensitive to the topic and also to the text sections from which the references were selected. Classifiers trained on the baseline (i.e., only text from the query document and not from the references) were outperformed in almost all the cases. Best performance was often obtained when using all cited references, though using the references from Introduction and Discussion sections led to similarly good results. This query expansion method performed significantly better than pseudo relevance feedback in 4 out of 6 topics. Conclusions: The retrieval of documents related to a single document can be significantly improved by using the references cited by this document (p-value<0.01). Using references from Introduction and Discussion performs almost as well as using all references, which might be useful for methods that require reduced datasets due to computational limitations. Cited references from particular sections might not be appropriate for all topics. Our method could be a better alternative to pseudo relevance feedback though it is limited by full text availability.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Learning to Refine Expansion Terms for Biomedical Information Retrieval Using Semantic Resources
    Xu, Bo
    Lin, Hongfei
    Lin, Yuan
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2019, 16 (03) : 954 - 966
  • [42] Performing binary-categorization on multiple-record web documents using information retrieval models and application ontologies
    Kwong, LW
    Ng, YK
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2003, 6 (03): : 281 - 303
  • [43] Using Event Identification Algorithm (EIA) to improve microblog retrieval effectiveness
    You, Sukjin
    Huang, Wei
    Mu, Xiangming
    2015 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT), VOL 3, 2015, : 122 - 125
  • [44] Predicting disease-related genes using integrated biomedical networks
    Peng, Jiajie
    Bai, Kun
    Shang, Xuequn
    Wang, Guohua
    Xue, Hansheng
    Jin, Shuilin
    Cheng, Liang
    Wang, Yadong
    Chen, Jin
    BMC GENOMICS, 2017, 18
  • [45] Using rule-based natural language processing to improve disease normalization in biomedical text
    Kang, Ning
    Singh, Bharat
    Afzal, Zubair
    van Mulligen, Erik M.
    Kors, Jan A.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (05) : 876 - 881
  • [46] An ontology-based binary-categorization approach for recognizing multiple-record web documents using a probabilistic retrieval model
    Wang, Q
    Ng, YK
    INFORMATION RETRIEVAL, 2003, 6 (3-4): : 295 - 332
  • [47] An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model
    Quan Wang
    Yiu-Kai Ng
    Information Retrieval, 2003, 6 : 295 - 332
  • [48] Biomedical image representation approach using visualness and spatial information in a concept feature space for interactive region-of-interest-based retrieval
    Rahman, Md. Mahmudur
    Antani, Sameer K.
    Demner-Fushman, Dina
    Thoma, George R.
    JOURNAL OF MEDICAL IMAGING, 2015, 2 (04)
  • [49] FCMDAP: using miRNA family and cluster information to improve the prediction accuracy of disease related miRNAs
    Li, Xiaoying
    Lin, Yaping
    Gu, Changlong
    Yang, Jialiang
    BMC SYSTEMS BIOLOGY, 2019, 13
  • [50] Building and analysis of protein-protein interactions related to diabetes mellitus using support vector machine, biomedical text mining and network analysis
    Vyas, Renu
    Bapat, Sanket
    Jain, Esha
    Karthikeyan, Muthukumarasamy
    Tambe, Sanjeev
    Kulkarni, Bhaskar D.
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2016, 65 : 37 - 44