Selecting Features for Breast Cancer Analysis and Prediction

被引:6
作者
Ray, Sujan [1 ]
AlGhamdi, Ali [1 ,2 ]
Alshouiliy, Khaldoon [1 ]
Agrawal, Dharma P. [1 ]
机构
[1] Univ Cincinnati, EECS, Ctr Distributed & Mobile Comp, Cincinnati, OH 45221 USA
[2] AlBaha Univ, Dept Comp Sci & Informat Technol, Al Agig, Saudi Arabia
来源
PROCEEDINGS OF THE 2020 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING (ICACCE-2020) | 2020年
关键词
Apache Spark; Big Data; Breast Cancer; Data Pre-processing; Decision Tree; Healthcare; Machine Learning; Normalization; PCA; Random Forest; Wisconsin Diagnosis Breast Cancer Dataset;
D O I
10.1109/icacce49060.2020.9154919
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Breast Cancer (BC) is the second most common cancer in women after skin cancer and has become a major health issue. As a result, it is very important to diagnose BC correctly and categorizing the tumors into malignant or benign groups. We know that Machine Learning (ML) techniques have unique advantages and that is why they are widely used to analyze complex BC dataset and predict the disease. Wisconsin Diagnosis Breast Cancer (WDBC) dataset has been used to develop predictive models for BC by researchers in this field. The dataset has 573 instances and 32 features. In this paper, we have proposed a method for analyzing and predicting BC on the same dataset using Apache Spark. This big data framework is a very powerful tool for working on huge volume of data, such as healthcare data [4]. Principle Component Analysis (PCA) has been applied on the dataset for selecting the most important features. We have run experiments with top 6 and 10 features. The experiments are executed on Hadoop cluster, a cloud platform provided by the Electrical Engineering and Computer Science (EECS) department of University of Cincinnati. We have also made a comparison between the performance of different machine learning techniques: Decision Tree and Random Forest Classifier. We have set the performance of Decision Tree with top 10 features as a benchmark in our work. Random forest Classifier performs better than Decision Tree algorithm with top 6 as well as top 10 features. Random Forest achieves 97.52% accuracy using top 10 features. Our results show that selecting the right features significantly improves accuracy in predicting BC.
引用
收藏
页数:6
相关论文
共 50 条
[31]   Breast Cancer Prediction: Importance of Feature Selection [J].
Prateek .
ADVANCES IN COMPUTER COMMUNICATION AND COMPUTATIONAL SCIENCES, IC4S 2018, 2019, 924 :733-742
[32]   Machine learning-based models for the prediction of breast cancer recurrence risk [J].
Zuo, Duo ;
Yang, Lexin ;
Jin, Yu ;
Qi, Huan ;
Liu, Yahui ;
Ren, Li .
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2023, 23 (01)
[33]   Functional and Structural Connectome Features for Machine Learning Chemo-Brain Prediction in Women Treated for Breast Cancer with Chemotherapy [J].
Chen, Vincent Chin-Hung ;
Lin, Tung-Yeh ;
Yeh, Dah-Cherng ;
Chai, Jyh-Wen ;
Weng, Jun-Cheng .
BRAIN SCIENCES, 2020, 10 (11) :1-13
[34]   Learning Techniques for Prediction of Breast Cancer Disease: A Comparative Analysis [J].
Das, Chandramouli ;
Sahoo, Abhaya Kumar ;
Yadav, Amrendra Singh ;
Mohanty, Jnyana Ranjan ;
Barik, Rabindra Kumar .
PROCEEDINGS OF THIRD DOCTORAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE, DOSCI 2022, 2023, 479 :503-514
[35]   The Role of Linear Discriminant Analysis for Accurate Prediction of Breast Cancer [J].
Jessica, Egwom Onyinyechi ;
Hamada, Mohamed ;
Yusuf, Saratu Ilu ;
Hassan, Mohammed .
2021 IEEE 14TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC 2021), 2021, :340-344
[36]   Prediction of lymph node metastases in breast cancer by clinicopathological and biological features of the primary tumor [J].
Takatsuka Y. .
Breast Cancer, 1999, 6 (2) :155-158
[37]   Development and validation of a prediction model for the diagnosis of breast cancer based on clinical and ultrasonic features [J].
He, Xuan ;
Lu, Yuanyuan ;
Li, Junlai .
GLAND SURGERY, 2023, 12 (06) :736-+
[38]   Prediction of Histological Grade in Breast Cancer by Combining DCE-MRI and DWI Features [J].
Zhao, Wenrui ;
Fan, Ming ;
Xu, Maosheng ;
Li, Lihua .
MEDICAL IMAGING 2019: IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, 2019, 10954
[39]   Prediction models of breast cancer molecular subtypes based on multimodal ultrasound and clinical features [J].
Hui Li ;
Chang-tao Zhang ;
Hua-guo Shao ;
Lin Pan ;
Zhongyun Li ;
Min Wang ;
Shi-hao Xu .
BMC Cancer, 25 (1)
[40]   DCE-MRI Texture Features for Early Prediction of Breast Cancer Therapy Response [J].
Thibault, Guillaume ;
Tudorica, Alina ;
Afzal, Aneela ;
Chui, Stephen Y-C ;
Naik, Arpana ;
Troxell, Megan L. ;
Kemmer, Kathleen A. ;
Oh, Karen Y. ;
Roy, Nicole ;
Jafarian, Neda ;
Holtorf, Megan L. ;
Huang, Wei ;
Song, Xubo .
TOMOGRAPHY, 2017, 3 (01) :23-32