Learning Biological Sequence Types Using the Literature

被引:1
|
作者
Bouadjenek, Mohamed Reda [1 ]
Verspoor, Karin [1 ]
Zobel, Justin [1 ]
机构
[1] Univ Melbourne, Sch Comp & Informat Syst, Parkville, Vic 3010, Australia
来源
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT | 2017年
基金
澳大利亚研究理事会;
关键词
Data Analysis; Data Quality; Biological Databases; Data Cleansing;
D O I
10.1145/3132847.3133051
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.
引用
收藏
页码:1991 / 1994
页数:4
相关论文
共 50 条
  • [41] Revising the Human Development Sequence Theory Using an Agent-Based Approach and Data
    Spaiser, Viktoria
    Sumpter, David J. T.
    JASSS-THE JOURNAL OF ARTIFICIAL SOCIETIES AND SOCIAL SIMULATION, 2016, 19 (03):
  • [42] Structured Literature Review of Electricity Consumption Classification Using Smart Meter Data
    Tureczek, Alexander Martin
    Nielsen, Per Sieverts
    ENERGIES, 2017, 10 (05)
  • [43] Enhancing Dropout Prediction in Distributed Educational Data Using Learning Pattern Awareness: A Federated Learning Approach
    Zhang, Tiancheng
    Liu, Hengyu
    Tao, Jiale
    Wang, Yuyang
    Yu, Minghe
    Chen, Hui
    Yu, Ge
    MATHEMATICS, 2023, 11 (24)
  • [44] Content based image retrieval using deep learning process
    R. Rani Saritha
    Varghese Paul
    P. Ganesh Kumar
    Cluster Computing, 2019, 22 : 4187 - 4200
  • [45] Using data mining techniques for exploring learning object repositories
    Segura, Alejandra
    Vidal-Castro, Christian
    Menendez-Dominguez, Victor
    Campos, Pedro G.
    Prieto, Manuel
    ELECTRONIC LIBRARY, 2011, 29 (02) : 162 - 180
  • [46] 3D Conceptual Design Using Deep Learning
    Yang, Zhangsihao
    Jiang, Haoliang
    Zou, Lan
    ADVANCES IN COMPUTER VISION, CVC, VOL 1, 2020, 943 : 16 - 26
  • [47] Systematic Review of Using Machine Learning in Imputing Missing Values
    Alabadla, Mustafa
    Sidi, Fatimah
    Ishak, Iskandar
    Ibrahim, Hamidah
    Affendey, Lilly Suriani
    Ani, Zafienas Che
    Jabar, Marzanah A.
    Bukar, Umar Ali
    Devaraj, Navin Kumar
    Muda, Ahmad Sobri
    Tharek, Anas
    Omar, Noritah
    Jaya, M. Izham Mohd
    IEEE ACCESS, 2022, 10 : 44483 - 44502
  • [48] Radio frequency interference mitigation using pseudoinverse learning autoencoders
    Wang, Hong-Feng
    Yuan, Mao
    Yin, Qian
    Guo, Ping
    Zhu, Wei-Wei
    Li, Di
    Feng, Si-Bo
    RESEARCH IN ASTRONOMY AND ASTROPHYSICS, 2020, 20 (08)
  • [49] Detecting irregularities in randomized controlled trials using machine learning
    Nelson, Walter
    Petch, Jeremy
    Ranisau, Jonathan
    Zhao, Robin
    Balasubramanian, Kumar
    Bangdiwala, Shrikant, I
    CLINICAL TRIALS, 2024, : 178 - 187
  • [50] Analysis and prediction of atmospheric ozone concentrations using machine learning
    Rass, Stephan
    Leuenberger, Markus C.
    FRONTIERS IN BIG DATA, 2025, 7