Learning Biological Sequence Types Using the Literature

被引:1
|
作者
Bouadjenek, Mohamed Reda [1 ]
Verspoor, Karin [1 ]
Zobel, Justin [1 ]
机构
[1] Univ Melbourne, Sch Comp & Informat Syst, Parkville, Vic 3010, Australia
来源
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT | 2017年
基金
澳大利亚研究理事会;
关键词
Data Analysis; Data Quality; Biological Databases; Data Cleansing;
D O I
10.1145/3132847.3133051
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.
引用
收藏
页码:1991 / 1994
页数:4
相关论文
共 50 条
  • [1] Automated assessment of biological database assertions using the scientific literature
    Bouadjenek, Mohamed Reda
    Zobel, Justin
    Verspoor, Karin
    BMC BIOINFORMATICS, 2019, 20 (1)
  • [2] Automated assessment of biological database assertions using the scientific literature
    Mohamed Reda Bouadjenek
    Justin Zobel
    Karin Verspoor
    BMC Bioinformatics, 20
  • [3] Automated detection of records in biological sequence databases that are inconsistent with the literature
    Bouadjenek, Mohamed Reda
    Verspoor, Karin
    Zobel, Justin
    JOURNAL OF BIOMEDICAL INFORMATICS, 2017, 71 : 229 - 240
  • [4] Efficient and scalable indexing techniques for biological sequence data
    Halachev, Mihail
    Shiri, Nematollaah
    Thamildurai, Anand
    BIOINFORMATICS RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2007, 4414 : 464 - +
  • [5] Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization
    Lazar, Alina
    Jin, Ling
    Spurlock, C. Anna
    Wu, Kesheng
    Sim, Alex
    Todd, Annika
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2019, 11 (02):
  • [6] Data Quality Challenges with Missing Values and Mixed Types in Joint Sequence Analysis
    Lazar, Alina
    Jin, Ling
    Spurlock, C. Anna
    Todd, Annika
    Wu, Kesheng
    Sim, Alex
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2620 - 2627
  • [7] Literature classification for semi-automated updating of biological knowledgebases
    Lars Rønn Olsen
    Ulrich Johan Kudahl
    Ole Winther
    Vladimir Brusic
    BMC Genomics, 14
  • [8] Machine learning techniques in chemostratigraphy: A systematic literature review
    Garcia, Luciano Garim
    Ramos, Gabriel de Oliveira
    Teixeira, Jose Manuel Marques
    da Silveira, Ariane Santos
    Cardoso Jr, Marcio
    de Oliveira, Rita Gausina
    Rigo, Sandro Jose
    GEOENERGY SCIENCE AND ENGINEERING, 2024, 243
  • [9] Competence of medicinal plant database using data mining algorithms for large biological databases
    Krishnamoorthy M.
    Karthikeyan R.
    Measurement: Sensors, 2022, 24
  • [10] Where is the Learning in Learning Analytics? A Systematic Literature Review on the Operationalization of Learning-Related Constructs in the Evaluation of Learning Analytics Interventions
    Knobbout, Justian
    van der Stappen, Esther
    IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2020, 13 (03): : 631 - 645