Selecting Features for Breast Cancer Analysis and Prediction

被引:6
|
作者
Ray, Sujan [1 ]
AlGhamdi, Ali [1 ,2 ]
Alshouiliy, Khaldoon [1 ]
Agrawal, Dharma P. [1 ]
机构
[1] Univ Cincinnati, EECS, Ctr Distributed & Mobile Comp, Cincinnati, OH 45221 USA
[2] AlBaha Univ, Dept Comp Sci & Informat Technol, Al Agig, Saudi Arabia
来源
PROCEEDINGS OF THE 2020 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING (ICACCE-2020) | 2020年
关键词
Apache Spark; Big Data; Breast Cancer; Data Pre-processing; Decision Tree; Healthcare; Machine Learning; Normalization; PCA; Random Forest; Wisconsin Diagnosis Breast Cancer Dataset;
D O I
10.1109/icacce49060.2020.9154919
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Breast Cancer (BC) is the second most common cancer in women after skin cancer and has become a major health issue. As a result, it is very important to diagnose BC correctly and categorizing the tumors into malignant or benign groups. We know that Machine Learning (ML) techniques have unique advantages and that is why they are widely used to analyze complex BC dataset and predict the disease. Wisconsin Diagnosis Breast Cancer (WDBC) dataset has been used to develop predictive models for BC by researchers in this field. The dataset has 573 instances and 32 features. In this paper, we have proposed a method for analyzing and predicting BC on the same dataset using Apache Spark. This big data framework is a very powerful tool for working on huge volume of data, such as healthcare data [4]. Principle Component Analysis (PCA) has been applied on the dataset for selecting the most important features. We have run experiments with top 6 and 10 features. The experiments are executed on Hadoop cluster, a cloud platform provided by the Electrical Engineering and Computer Science (EECS) department of University of Cincinnati. We have also made a comparison between the performance of different machine learning techniques: Decision Tree and Random Forest Classifier. We have set the performance of Decision Tree with top 10 features as a benchmark in our work. Random forest Classifier performs better than Decision Tree algorithm with top 6 as well as top 10 features. Random Forest achieves 97.52% accuracy using top 10 features. Our results show that selecting the right features significantly improves accuracy in predicting BC.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Analysis and Prediction of Breast Cancer using AzureML Platform
    Alshouiliy, Khaldoon
    Shivanna, Abhishek
    Ray, Sujan
    AlGhamdi, Ali
    AlGhamdi, Ali
    Agrawal, Dharma P.
    2019 IEEE 10TH ANNUAL INFORMATION TECHNOLOGY, ELECTRONICS AND MOBILE COMMUNICATION CONFERENCE (IEMCON), 2019, : 212 - 218
  • [2] Analysis of Classification Algorithms for Breast Cancer Prediction
    Rajamohana, S. P.
    Umamaheswari, K.
    Karunya, K.
    Deepika, R.
    DATA MANAGEMENT, ANALYTICS AND INNOVATION, ICDMAI 2019, VOL 1, 2020, 1042 : 517 - 528
  • [3] Analysis of breast cancer prediction and visualisation using machine learning models
    Magesh G.
    Swarnalatha P.
    International Journal of Cloud Computing, 2022, 11 (01) : 43 - 60
  • [4] Analysis of DCE-MRI Features in Tumor for Prediction of the Prognosis in Breast Cancer
    Liu, Bin
    Fan, Ming
    Zheng, Shuo
    Li, Lihua
    MEDICAL IMAGING 2019: IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, 2019, 10954
  • [5] Prediction of Breast Cancer Using AI-Based Methods
    Aamir, Sanam
    Rahim, Aqsa
    Bashir, Sajid
    Naeem, Muddasar
    INTELLIGENT ENVIRONMENTS 2021, 2021, 29 : 213 - 220
  • [6] Random Forest for Breast Cancer Prediction
    Octaviani, T. L.
    Rustam, Z.
    PROCEEDINGS OF THE 4TH INTERNATIONAL SYMPOSIUM ON CURRENT PROGRESS IN MATHEMATICS AND SCIENCES (ISCPMS2018), 2019, 2168
  • [7] BCPUML: Breast Cancer Prediction Using Machine Learning Approach—A Performance Analysis
    Karmakar R.
    Chatterjee S.
    Das A.K.
    Mandal A.
    SN Computer Science, 4 (4)
  • [8] Machine Learning techniques for Prediction from various Breast Cancer Datasets
    Shalini, M.
    Radhika, S.
    2020 SIXTH INTERNATIONAL CONFERENCE ON BIO SIGNALS, IMAGES, AND INSTRUMENTATION (ICBSII), 2020,
  • [9] On the Temporal Effects of Features on the Prediction of Breast Cancer Survivability
    Shawky, Doaa M.
    Seddik, Ahmed F.
    CURRENT BIOINFORMATICS, 2017, 12 (04) : 378 - 384
  • [10] MultiOmics analysis of metabolic dysregulation and immune features in breast cancer
    Zhou, Zuo-Yuan
    Bai, Nan
    Zheng, Wen-Jie
    Ni, Su-Jie
    INTERNATIONAL IMMUNOPHARMACOLOGY, 2025, 152