Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

被引:0
|
作者
BaniMustafa, Ahmed [1 ]
机构
[1] Amer Univ Madaba, Comp Sci Dept, Madaba, Jordan
来源
ISECURE-ISC INTERNATIONAL JOURNAL OF INFORMATION SECURITY | 2019年 / 11卷 / 03期
关键词
Data Mining; Metabolomics; Cachexia; Preprocessing; Imbalanced Classes; Re-sampling; Data Reduction; CLASSIFICATION; NORMALIZATION; DISCOVERY; GENOMICS; TOOL;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes which is known to deteriorate the performance of classifiers. It also influences its validity and generalizablity. The classification models in this study were built using five machine learning algorithms known as PLS-DA, MLP, SVM, C4.5 and ID3. This model is built after carrying out a number of intensive data preprocessing procedures to tackle the problem of imbalanced classes and improve the performance of the constructed classifiers. These procedures involves applying data transformation, normalization, standardization, re-sampling and data reduction procedures using a number of variables importance scorers. The best performance was achieved by building an MLP model that was trained and tested using five-fold cross-validation using datasets that were re-sampled using SMOTE method and then reduced using SVM variable importance scorer. This model was successful in classifying samples with excellent accuracy and also in identifying the potential disease biomarkers. The results confirm the validity of metabolomics data mining for diagnosis of cachexia. It also emphasizes the importance of data preprocessing procedures such as sampling and data reduction for improving data mining results, particularly when data suffers from the problem of imbalanced classes. (C) 2019 ISC. All rights reserved.
引用
收藏
页码:79 / 89
页数:11
相关论文
共 50 条
  • [1] Enhancing data-driven modeling of fluoride concentration using new data mining algorithms
    Gupta, Praveen Kumar
    Maiti, Saumen
    ENVIRONMENTAL EARTH SCIENCES, 2022, 81 (03)
  • [2] Learning from class-imbalanced data: review of data driven methods and algorithm driven methods
    Huang, Cui Yin
    Dai, Hong Liang
    DATA SCIENCE IN FINANCE AND ECONOMICS, 2021, 1 (01): : 21 - 36
  • [3] Domain-oriented data-driven data mining: a new understanding for data mining
    Wang, Guo-yin
    Wang, Yan
    2008 INTERNATIONAL FORUM ON KNOWLEDGE TECHNOLOGY, 2008, : 266 - 271
  • [4] Enhancing Ovarian Tumor Dataset Analysis Through Data Mining Preprocessing Techniques
    Shetty, Roopashri
    Geetha, M.
    Dinesh Acharya, U.
    Shyamala, G.
    IEEE ACCESS, 2024, 12 : 122300 - 122312
  • [5] A Comprehensive Evaluation of Metabolomics Data Preprocessing Methods for Deep Learning
    Abram, Krzysztof Jan
    McCloskey, Douglas
    METABOLITES, 2022, 12 (03)
  • [6] Domain-oriented data-driven data mining:a new understanding for data mining
    WANG GuoyinWANG YanInstitute of Computer Science TechnologyChongqing University of Posts and TelecommunicationsChongqing PRChinaSchool of Information Science TechnologySouthwest Jiaotong UniversityChengdu PRChinaCollege of Computer and CommunicationLanzhou University of TechnolegyGansuLanzhou PRChina
    重庆邮电大学学报(自然科学版), 2008, (03) : 266 - 271
  • [7] Enhancing metabolomics research through data mining
    Martinez-Arranz, Ibon
    Mayo, Rebeca
    Perez-Cormenzana, Miriam
    Minchole, Itziar
    Salazar, Lorena
    Alonso, Cristina
    Mato, Jose M.
    Journal of Proteomics, 2015, 127 : 275 - 288
  • [8] Enhancing techniques for learning decision trees from imbalanced data
    Chaabane, Ikram
    Guermazi, Radhouane
    Hammami, Mohamed
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2020, 14 (03) : 677 - 745
  • [9] Data Mining and Data-Driven Modelling in Engineering Geology Applications
    Doglioni, Angelo
    Galeandro, Annalisa
    Simeone, Vincenzo
    ENGINEERING GEOLOGY FOR SOCIETY AND TERRITORY, VOL 5: URBAN GEOLOGY, SUSTAINABLE PLANNING AND LANDSCAPE EXPLOITATION, 2015, : 647 - 650
  • [10] Data-driven learning from dynamic pricing data - Classification and forecasting
    Christensen, Morten Herget
    Nozal, Diego Caviedes
    Kavadakis, Ioannis
    Pinson, Pierre
    2019 IEEE MILAN POWERTECH, 2019,