Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

被引:0
作者
BaniMustafa, Ahmed [1 ]
机构
[1] Amer Univ Madaba, Comp Sci Dept, Madaba, Jordan
来源
ISECURE-ISC INTERNATIONAL JOURNAL OF INFORMATION SECURITY | 2019年 / 11卷 / 03期
关键词
Data Mining; Metabolomics; Cachexia; Preprocessing; Imbalanced Classes; Re-sampling; Data Reduction; CLASSIFICATION; NORMALIZATION; DISCOVERY; GENOMICS; TOOL;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes which is known to deteriorate the performance of classifiers. It also influences its validity and generalizablity. The classification models in this study were built using five machine learning algorithms known as PLS-DA, MLP, SVM, C4.5 and ID3. This model is built after carrying out a number of intensive data preprocessing procedures to tackle the problem of imbalanced classes and improve the performance of the constructed classifiers. These procedures involves applying data transformation, normalization, standardization, re-sampling and data reduction procedures using a number of variables importance scorers. The best performance was achieved by building an MLP model that was trained and tested using five-fold cross-validation using datasets that were re-sampled using SMOTE method and then reduced using SVM variable importance scorer. This model was successful in classifying samples with excellent accuracy and also in identifying the potential disease biomarkers. The results confirm the validity of metabolomics data mining for diagnosis of cachexia. It also emphasizes the importance of data preprocessing procedures such as sampling and data reduction for improving data mining results, particularly when data suffers from the problem of imbalanced classes. (C) 2019 ISC. All rights reserved.
引用
收藏
页码:79 / 89
页数:11
相关论文
共 50 条
  • [41] Porosity prediction from pre-stack seismic data via a data-driven approach
    Yang, Naxia
    Li, Guofa
    Zhao, Pingqi
    Zhang, Jialiang
    Zhao, Dongfeng
    JOURNAL OF APPLIED GEOPHYSICS, 2023, 211
  • [42] Data Mining in Metabolomics: From Metabolite Profiling to Clinical Diagnosis
    Baumgartner, Christian
    Graber, Armin
    INTEGRATING BIOMEDICAL INFORMATION: FROM E-CELL TO E-PATIENT, 2006, : 39 - +
  • [43] Big Data Analytics in Healthcare: Data-Driven Methods for Typical Treatment Pattern Mining
    Guo, Chonghui
    Chen, Jingfeng
    JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING, 2019, 28 (06) : 694 - 714
  • [44] Data-Driven Reliability Modeling, Based on Data Mining in Distribution Network Fault Statistics
    Akhavan-Rezai, E.
    Haghifam, M. -R.
    Fereidunian, A.
    2009 IEEE BUCHAREST POWERTECH, VOLS 1-5, 2009, : 968 - +
  • [45] Introduction to 3DM: Domain-oriented data-driven data mining
    Wang, Guoyin
    ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 25 - 26
  • [46] Data-driven modeling and learning in science and engineering
    Montans, Francisco J.
    Chinesta, Francisco
    Gomez-Bombarelli, Rafael
    Kutz, J. Nathan
    COMPTES RENDUS MECANIQUE, 2019, 347 (11): : 845 - 855
  • [47] Learning to Learn in Collective Adaptive Systems: Mining Design Patterns for Data-driven Reasoning
    D'Angelo, Mirko
    Ghahremani, Sona
    Gerasimou, Simos
    Grohmann, Johannes
    Nunes, Ingrid
    Tomforde, Sven
    Pournaras, Evangelos
    2020 IEEE INTERNATIONAL CONFERENCE ON AUTONOMIC COMPUTING AND SELF-ORGANIZING SYSTEMS COMPANION (ACSOS-C 2020), 2020, : 121 - 126
  • [48] Evaluating subject specific preprocessing choices in multisubject fMRI data sets using data-driven performance metrics
    Shaw, ME
    Strother, SC
    Gavrilescu, M
    Podzebenko, K
    Waites, A
    Watson, J
    Anderson, J
    Jackson, G
    Egan, G
    NEUROIMAGE, 2003, 19 (03) : 988 - 1001
  • [49] Effects of Data-Driven Learning on College Students of Different Grammar Proficiencies: A Preliminary Empirical Assessment in EFL Classes
    Lin, Ming Huei
    SAGE OPEN, 2021, 11 (03):
  • [50] From data collection to knowledge data discovery: A medical application of data mining
    Duhamel, A
    Picavet, M
    Devos, P
    Beuscart, R
    MEDINFO 2001: PROCEEDINGS OF THE 10TH WORLD CONGRESS ON MEDICAL INFORMATICS, PTS 1 AND 2, 2001, 84 : 1329 - 1333