Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

被引:0
|
作者
BaniMustafa, Ahmed [1 ]
机构
[1] Amer Univ Madaba, Comp Sci Dept, Madaba, Jordan
来源
ISECURE-ISC INTERNATIONAL JOURNAL OF INFORMATION SECURITY | 2019年 / 11卷 / 03期
关键词
Data Mining; Metabolomics; Cachexia; Preprocessing; Imbalanced Classes; Re-sampling; Data Reduction; CLASSIFICATION; NORMALIZATION; DISCOVERY; GENOMICS; TOOL;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes which is known to deteriorate the performance of classifiers. It also influences its validity and generalizablity. The classification models in this study were built using five machine learning algorithms known as PLS-DA, MLP, SVM, C4.5 and ID3. This model is built after carrying out a number of intensive data preprocessing procedures to tackle the problem of imbalanced classes and improve the performance of the constructed classifiers. These procedures involves applying data transformation, normalization, standardization, re-sampling and data reduction procedures using a number of variables importance scorers. The best performance was achieved by building an MLP model that was trained and tested using five-fold cross-validation using datasets that were re-sampled using SMOTE method and then reduced using SVM variable importance scorer. This model was successful in classifying samples with excellent accuracy and also in identifying the potential disease biomarkers. The results confirm the validity of metabolomics data mining for diagnosis of cachexia. It also emphasizes the importance of data preprocessing procedures such as sampling and data reduction for improving data mining results, particularly when data suffers from the problem of imbalanced classes. (C) 2019 ISC. All rights reserved.
引用
收藏
页码:79 / 89
页数:11
相关论文
共 50 条
  • [21] Big Machinery Data Preprocessing Methodology for Data-Driven Models in Prognostics and Health Management
    Cofre-Martel, Sergio
    Droguett, Enrique Lopez
    Modarres, Mohammad
    SENSORS, 2021, 21 (20)
  • [22] Moving metabolomics from a data-driven science to an integrative systems science
    Stacey N Reinke
    David I Broadhurst
    Genome Medicine, 4
  • [23] Moving metabolomics from a data-driven science to an integrative systems science
    Reinke, Stacey N.
    Broadhurst, David I.
    GENOME MEDICINE, 2012, 4
  • [24] Data-Driven Decision-Making for Bank Target Marketing Using Supervised Learning Classifiers on Imbalanced Big Data
    Nasir, Fahim
    Ahmed, Abdulghani Ali
    Kiraz, Mehmet Sabir
    Yevseyeva, Iryna
    Saif, Mubarak
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 81 (01): : 1703 - 1728
  • [25] Data mining and preprocessing application on component reports of an airline company in Turkey
    Gurbuz, Feyza
    Ozbakir, Lale
    Yapici, Huseyin
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (06) : 6618 - 6626
  • [26] A data-driven approach to selection of critical process steps in the semiconductor manufacturing process considering missing and imbalanced data
    Lee, Dong-Hee
    Yang, Jin-Kyung
    Lee, Cho-Heui
    Kim, Kwang-Jae
    JOURNAL OF MANUFACTURING SYSTEMS, 2019, 52 : 146 - 156
  • [27] Data-Driven Control and Learning Systems
    Hou, Zhongsheng
    Gao, Huijun
    Lewis, Frank L.
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2017, 64 (05) : 4070 - 4075
  • [28] The Impact of Local Data Characteristics on Learning from Imbalanced Data
    Stefanowski, Jerzy
    ROUGH SETS AND INTELLIGENT SYSTEMS PARADIGMS, RSEISP 2014, 2014, 8537 : 1 - 13
  • [29] Metric Learning from Imbalanced Data
    Gautheron, Leo
    Habrard, Amaury
    Morvant, Emilie
    Sebban, Marc
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 923 - 930
  • [30] RETRACTED: Data mining and visualization of data-driven news in the era of big data (Retracted Article)
    Qi, Erna
    Yang, Xingrui
    Wang, Zongjun
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 4): : S10333 - S10346