Breast cancer survivability prediction using labeled, unlabeled, and pseudo-labeled patient data

被引:54
作者
Kim, Juhyeon [1 ]
Shin, Hyunjung [1 ]
机构
[1] Ajou Univ, Dept Ind Engn, Suwon 443749, South Korea
基金
新加坡国家研究基金会;
关键词
MODELS;
D O I
10.1136/amiajnl-2012-001570
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Background Prognostic studies of breast cancer survivability have been aided by machine learning algorithms, which can predict the survival of a particular patient based on historical patient data. However, it is not easy to collect labeled patient records. It takes at least 5 years to label a patient record as 'survived' or 'not survived'. Unguided trials of numerous types of oncology therapies are also very expensive. Confidentiality agreements with doctors and patients are also required to obtain labeled patient records. Proposed method These difficulties in the collection of labeled patient data have led researchers to consider semi-supervised learning (SSL), a recent machine learning algorithm, because it is also capable of utilizing unlabeled patient data, which is relatively easier to collect. Therefore, it is regarded as an algorithm that could circumvent the known difficulties. However, the fact is yet valid even on SSL that more labeled data lead to better prediction. To compensate for the lack of labeled patient data, we may consider the concept of tagging virtual labels to unlabeled patient data, that is, 'pseudo-labels,' and treating them as if they were labeled. Results Our proposed algorithm, 'SSL Co-training', implements this concept based on SSL. SSL Co-training was tested using the surveillance, epidemiology, and end results database for breast cancer and it delivered a mean accuracy of 76% and a mean area under the curve of 0.81.
引用
收藏
页码:613 / 618
页数:6
相关论文
共 32 条
[1]  
Abraham A., 2005, Artificial neural networks. Handbook of measuring system design
[2]   Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS) [J].
Allouche, Omri ;
Tsoar, Asaf ;
Kadmon, Ronen .
JOURNAL OF APPLIED ECOLOGY, 2006, 43 (06) :1223-1232
[3]  
[Anonymous], 2021, COMPUTER VISION PATT
[4]  
[Anonymous], 2003, ADV NEURAL INFORM PR
[5]  
[Anonymous], 2006, P 12 ACM SIGKDD INT
[6]  
[Anonymous], 2010, Cancer Facts Figures 2010
[7]  
[Anonymous], J MACHINE LEARNING R
[8]  
[Anonymous], 2008, Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08
[9]   Semi-supervised methods to predict patient survival from gene expression data [J].
Bair, E ;
Tibshirani, R .
PLOS BIOLOGY, 2004, 2 (04) :511-522
[10]   Regularization and semi-supervised learning on large graphs [J].
Belkin, M ;
Matveeva, I ;
Niyogi, P .
LEARNING THEORY, PROCEEDINGS, 2004, 3120 :624-638