Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

被引:0
作者
BaniMustafa, Ahmed [1 ]
机构
[1] Amer Univ Madaba, Comp Sci Dept, Madaba, Jordan
来源
ISECURE-ISC INTERNATIONAL JOURNAL OF INFORMATION SECURITY | 2019年 / 11卷 / 03期
关键词
Data Mining; Metabolomics; Cachexia; Preprocessing; Imbalanced Classes; Re-sampling; Data Reduction; CLASSIFICATION; NORMALIZATION; DISCOVERY; GENOMICS; TOOL;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes which is known to deteriorate the performance of classifiers. It also influences its validity and generalizablity. The classification models in this study were built using five machine learning algorithms known as PLS-DA, MLP, SVM, C4.5 and ID3. This model is built after carrying out a number of intensive data preprocessing procedures to tackle the problem of imbalanced classes and improve the performance of the constructed classifiers. These procedures involves applying data transformation, normalization, standardization, re-sampling and data reduction procedures using a number of variables importance scorers. The best performance was achieved by building an MLP model that was trained and tested using five-fold cross-validation using datasets that were re-sampled using SMOTE method and then reduced using SVM variable importance scorer. This model was successful in classifying samples with excellent accuracy and also in identifying the potential disease biomarkers. The results confirm the validity of metabolomics data mining for diagnosis of cachexia. It also emphasizes the importance of data preprocessing procedures such as sampling and data reduction for improving data mining results, particularly when data suffers from the problem of imbalanced classes. (C) 2019 ISC. All rights reserved.
引用
收藏
页码:79 / 89
页数:11
相关论文
共 50 条
  • [31] DNEA: an R package for fast and versatile data-driven network analysis of metabolomics data
    Patsalis, Christopher
    Iyer, Gayatri
    Brandenburg, Marci
    Karnovsky, Alla
    Michailidis, George
    BMC BIOINFORMATICS, 2024, 25 (01):
  • [32] Data Mining Application in Higher Learning Institutions
    Delavari, Naeimeh
    Phon-Amnuaisuk, Somnuk
    Beikzadeh, Mohammad Reza
    INFORMATICS IN EDUCATION, 2008, 7 (01): : 31 - 54
  • [33] 3DM: Domain-oriented Data-driven Data Mining
    Wang, Guoyin
    Wang, Yan
    FUNDAMENTA INFORMATICAE, 2009, 90 (04) : 395 - 426
  • [34] Enhancing Precision Medicine: A Big Data-Driven Approach for the Management of Genomic Data
    Leon, Ana
    Pastor, Oscar
    BIG DATA RESEARCH, 2021, 26
  • [35] Domain-Oriented Data-Driven Data Mining Based on Rough Sets
    Guoyin Wang College of Computer Science and Technology Chongqing University of Posts and Telecommunications Chongqing China
    南昌工程学院学报, 2006, (02) : 46 - 46
  • [36] Sustainable Fault Diagnosis of Imbalanced Text Mining for CTCS-3 Data Preprocessing
    Shi, Lijuan
    Li, Ang
    Zhang, Lei
    SUSTAINABILITY, 2021, 13 (04) : 1 - 14
  • [37] Data-driven decision tree learning algorithm based on data relativity
    Wang, Y. (wangyan@lut.cn), 1600, Binary Information Press, Flat F 8th Floor, Block 3, Tanner Garden, 18 Tanner Road, Hong Kong (10): : 1275 - 1282
  • [38] Development of a data-driven scientific methodology: From articles to chemometric data products
    Carballo-Meilan, Ara
    McDonald, Lewis
    Pragot, Wanawan
    Starnawski, Lukasz Michal
    Saleemi, Ali Nauman
    Afzal, Waheed
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2022, 225
  • [39] A NETWORK OF COOPERATIVE LEARNERS FOR DATA-DRIVEN STREAM MINING
    Canzian, Luca
    van der Schaar, Mihaela
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [40] Data Mining in Metabolomics: From Metabolite Profiling to Clinical Diagnosis
    Baumgartner, Christian
    Graber, Armin
    INTEGRATING BIOMEDICAL INFORMATION: FROM E-CELL TO E-PATIENT, 2006, : 39 - +