Selecting Features for Breast Cancer Analysis and Prediction

被引:6
作者
Ray, Sujan [1 ]
AlGhamdi, Ali [1 ,2 ]
Alshouiliy, Khaldoon [1 ]
Agrawal, Dharma P. [1 ]
机构
[1] Univ Cincinnati, EECS, Ctr Distributed & Mobile Comp, Cincinnati, OH 45221 USA
[2] AlBaha Univ, Dept Comp Sci & Informat Technol, Al Agig, Saudi Arabia
来源
PROCEEDINGS OF THE 2020 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING (ICACCE-2020) | 2020年
关键词
Apache Spark; Big Data; Breast Cancer; Data Pre-processing; Decision Tree; Healthcare; Machine Learning; Normalization; PCA; Random Forest; Wisconsin Diagnosis Breast Cancer Dataset;
D O I
10.1109/icacce49060.2020.9154919
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Breast Cancer (BC) is the second most common cancer in women after skin cancer and has become a major health issue. As a result, it is very important to diagnose BC correctly and categorizing the tumors into malignant or benign groups. We know that Machine Learning (ML) techniques have unique advantages and that is why they are widely used to analyze complex BC dataset and predict the disease. Wisconsin Diagnosis Breast Cancer (WDBC) dataset has been used to develop predictive models for BC by researchers in this field. The dataset has 573 instances and 32 features. In this paper, we have proposed a method for analyzing and predicting BC on the same dataset using Apache Spark. This big data framework is a very powerful tool for working on huge volume of data, such as healthcare data [4]. Principle Component Analysis (PCA) has been applied on the dataset for selecting the most important features. We have run experiments with top 6 and 10 features. The experiments are executed on Hadoop cluster, a cloud platform provided by the Electrical Engineering and Computer Science (EECS) department of University of Cincinnati. We have also made a comparison between the performance of different machine learning techniques: Decision Tree and Random Forest Classifier. We have set the performance of Decision Tree with top 10 features as a benchmark in our work. Random forest Classifier performs better than Decision Tree algorithm with top 6 as well as top 10 features. Random Forest achieves 97.52% accuracy using top 10 features. Our results show that selecting the right features significantly improves accuracy in predicting BC.
引用
收藏
页数:6
相关论文
共 50 条
  • [21] EFFECT OF BREAST DENSITY IN SELECTING FEATURES FOR NORMAL MAMMOGRAM DETECTION
    Elshinawy, Mona
    Badawy, AbdelHameed
    Abdelmageed, Wael
    Chouikha, Mohamed
    2011 8TH IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING: FROM NANO TO MACRO, 2011, : 141 - 147
  • [22] Applying Best Machine Learning Algorithms for Breast Cancer Prediction and Classification
    Khourdifi, Youness
    Bahaj, Mohamed
    2018 INTERNATIONAL CONFERENCE ON ELECTRONICS, CONTROL, OPTIMIZATION AND COMPUTER SCIENCE (ICECOCS), 2018,
  • [23] Breast cancer data analysis for survivability studies and prediction
    Shukla, Nagesh
    Hagenbuchner, Markus
    Win, Khin Than
    Yang, Jack
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2018, 155 : 199 - 208
  • [24] Interpretable prediction of drug synergy for breast cancer by random forest with features from Boolean modeling of signaling pathways
    Kittisak Taoma
    Marasri Ruengjitchatchawalya
    Kanthida Kusonmano
    Teerasit Termsaithong
    Thana Sutthibutpong
    Monrudee Liangruksa
    Teeraphan Laomettachit
    Scientific Reports, 15 (1)
  • [25] Invasive ductal breast cancer molecular subtype prediction by MRI radiomic and clinical features based on machine learning
    Sheng, Weiyong
    Xia, Shouli
    Wang, Yaru
    Yan, Lizhao
    Ke, Songqing
    Mellisa, Evelyn
    Gong, Fen
    Zheng, Yun
    Tang, Tiansheng
    FRONTIERS IN ONCOLOGY, 2022, 12
  • [26] Analysis of Breast Cancer Dataset Using Big Data Algorithms for Accuracy of Diseases Prediction
    Sinha, Ankita
    Sahoo, Bhaswati
    Rautaray, Siddharth Swarup
    Pandey, Manjusha
    SECOND INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS AND COMMUNICATION TECHNOLOGIES, ICCNCT 2019, 2020, 44 : 271 - 277
  • [27] Survival Prediction of Patients with Breast Cancer: Comparisons of Decision Tree and Logistic Regression Analysis
    Momenyan, Somayeh
    Baghestani, Ahmad Reza
    Momenyan, Narges
    Naseri, Parisa
    Akbari, Mohammad Esmaeil
    INTERNATIONAL JOURNAL OF CANCER MANAGEMENT, 2018, 11 (07)
  • [28] Leveraging survival analysis and machine learning for accurate prediction of breast cancer recurrence and metastasis
    Noman, Shahd M.
    Fadel, Youssef M.
    Henedak, Mayar T.
    Attia, Nada A.
    Essam, Malak
    Elmaasarawii, Sarah
    Fouad, Fayrouz A.
    Eltasawi, Esraa G.
    Al-Atabany, Walid
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [29] Fuzzy Decision Tree for Breast Cancer Prediction
    Domingo, Mylene J.
    Gerardo, Bobby D.
    Medina, Ruji P.
    PROCEEDINGS OF THE 1ST INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION SCIENCE AND SYSTEM, AISS 2019, 2019,
  • [30] Machine learning-based models for the prediction of breast cancer recurrence risk
    Zuo, Duo
    Yang, Lexin
    Jin, Yu
    Qi, Huan
    Liu, Yahui
    Ren, Li
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2023, 23 (01)