A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application

被引：20

作者：

Mokoatle, Mpho ^{[1
]}

Marivate, Vukosi ^{[1
]}

Mapiye, Darlington ^{[2
]}

Bornman, Riana ^{[4
]}

Hayes, Vanessa. M. ^{[3
,4
]}

机构：

[1] Univ Pretoria, Dept Comp Sci, Pretoria, South Africa

[2] CapeBio TM Technol, Centurion, South Africa

[3] Univ Sydney, Sch Med Sci, Sydney, Australia

[4] Univ Pretoria, Sch Hlth Syst & Publ Hlth, Pretoria, South Africa

来源：

BMC BIOINFORMATICS | 2023年 / 24卷 / 01期

基金：

英国医学研究理事会;

关键词：

Cancer detection; DNA; Machine learning; SentenceBert; SimCSE; COLORECTAL-CANCER; IMAGE DATABASE; CLASSIFICATION; EPIDEMIOLOGY; ALGORITHMS; NODULES; TISSUE;

D O I：

10.1186/s12859-023-05235-x

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

BackgroundUsing visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer.MethodsIn this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings.ResultsThe XGBoost model, which had the highest overall accuracy of 73 +/- 0.13 % using SBERT embeddings and 75 +/- 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE's sentence transformer only marginally improved the performance of machine learning models.

引用

页数：25

共 122 条

[1] Detecting prostate cancer using deep learning convolution neural network with transfer learning approach
Abbasi, Adeel Ahmed
Hussain, Lal
Awan, Imtiaz Ahmed
Abbasi, Imran
Majid, Abdul
Nadeem, Malik Sajjad Ahmed
Chaudhary, Quratul-Ain
[J]. COGNITIVE NEURODYNAMICS, 2020, 14 (04) : 523 - 533
[2] Abdullah DM., 2021, QUBAHAN ACAD J, V1, P141, DOI [10.48161/qaj.v1n2a58, DOI 10.48161/QAJ.V1N2A58]
[3] On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset
Agarap, Abien Fred M.
[J]. 2ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING (ICMLSC 2018), 2015, : 5 - 9
[4] Boosting Breast Cancer Detection Using Convolutional Neural Network
Alanazi, Saad Awadh
Kamruzzaman, M. M.
Sarker, Md Nazirul Islam
Alruwaili, Madallah
Alhwaiti, Yousef
Alshammari, Nasser
Siddiqi, Muhammad Hameed
[J]. JOURNAL OF HEALTHCARE ENGINEERING, 2021, 2021
[5] Albawi S, 2017, I C ENG TECHNOL
[6] Association of Aspirin and Nonsteroidal Anti-Inflammatory Drugs With Colorectal Cancer Risk by Molecular Subtypes
Amitay, Efrat L.
Carr, Prudence R.
Jansen, Lina
Walter, Viola
Roth, Wilfried
Herpel, Esther
Kloor, Matthias
Blaeker, Hendrik
Chang-Claude, Jenny
Brenner, Hermann
Hoffmeister, Michael
[J]. JNCI-JOURNAL OF THE NATIONAL CANCER INSTITUTE, 2019, 111 (05): : 475 - 483
[7] [Anonymous], WHAT IS COL CANC
[8] [Anonymous], 2015, Breast Cancer
[9] [Anonymous], What is cancer?
[10] [Anonymous], VISUALLAB METHODOLOG

← 1 2 3 4 5 6 7 8 9 10 →