Learning Biological Sequence Types Using the Literature

被引:1
|
作者
Bouadjenek, Mohamed Reda [1 ]
Verspoor, Karin [1 ]
Zobel, Justin [1 ]
机构
[1] Univ Melbourne, Sch Comp & Informat Syst, Parkville, Vic 3010, Australia
来源
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT | 2017年
基金
澳大利亚研究理事会;
关键词
Data Analysis; Data Quality; Biological Databases; Data Cleansing;
D O I
10.1145/3132847.3133051
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.
引用
收藏
页码:1991 / 1994
页数:4
相关论文
共 50 条
  • [31] Premature Birth Prediction Using Machine Learning Techniques
    Meem, Kazi Rafat Haa
    Islam, Sadia
    Adnan, Ahmed Omar Salim
    Momen, Sifat
    ARTIFICIAL INTELLIGENCE TRENDS IN SYSTEMS, VOL 2, 2022, 502 : 270 - 284
  • [32] Water Quality Drinking Classification Using Machine Learning
    el Amin, Gasbaoui Mohammed
    Soumia, Benkrama
    Mostefa, Bendjima
    PROGRAM OF THE 2ND INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND AUTOMATIC CONTROL, ICEEAC 2024, 2024,
  • [33] Trends in web data extraction using machine learning
    Patnaik, Sudhir Kumar
    Babu, C. Narendra
    WEB INTELLIGENCE, 2021, 19 (03) : 169 - 190
  • [34] Insurance Sales Forecast Using Machine Learning Algorithms
    Kurt, Zuhal
    Varyok, Emrecan
    Ayhan, Ege Baran
    Bilgin, Mehmet Turhan
    Duru, Duygu
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION NETWORKS (ICCCN 2021), 2022, 394 : 29 - 38
  • [35] Cardiovascular Disease Prediction Using Machine Learning Metrics
    Gnanavelu, Aashish
    Venkataramu, Champa
    Chintakunta, Ramakrishna
    JOURNAL OF YOUNG PHARMACISTS, 2025, 17 (01) : 226 - 233
  • [36] Various Approaches to the Quantitative Evaluation of Biological and Medical Data Using Mathematical Models
    Zdimalova, Maria
    Chatterjee, Anuprava
    Kosnacova, Helena
    Ghosh, Mridul
    Obaidullah, Sk Md
    Kopani, Martin
    Kosnac, Daniel
    SYMMETRY-BASEL, 2022, 14 (01):
  • [37] Learning Drivers' Behavior Using Social Networking Service
    Li, Yueqing
    Kaneria, Acyut
    Zhao, Xiang
    Manchaiah, Vinaya
    ADVANCES IN HUMAN FACTORS OF TRANSPORTATION, 2020, 964 : 341 - 350
  • [38] Data analysis in the context of teacher training: code sequence analysis using QDA Miner®
    Derobertmasure, Antoine
    Robertson, Jean E.
    QUALITY & QUANTITY, 2014, 48 (04) : 2255 - 2276
  • [39] Identifying Types and Causes of Errors in Mortality data in a Clinical Registry using Multiple Information Systems
    Koetsier, Antonie
    Peek, Niels
    de Keizer, Nicolette
    QUALITY OF LIFE THROUGH QUALITY OF INFORMATION, 2012, 180 : 771 - 775
  • [40] Identifying the effects of soil and climate types on seasonal variation of pavement roughness using MML inference
    Byrne, M.
    Albrecht, D.
    Sanjayan, J. G.
    Kodikara, J.
    JOURNAL OF COMPUTING IN CIVIL ENGINEERING, 2008, 22 (02) : 90 - 99