Investigating the role of Simpson's paradox in the analysis of top-ranked features in high-dimensional bioinformatics datasets

被引:7
作者
Freitas, Alex A. [1 ]
机构
[1] Univ Kent, Computat Intelligence, Canterbury, Kent, England
关键词
Gene Ontology; machine learning; classification; feature ranking; ageing-related genes;
D O I
10.1093/bib/bby126
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
An important problem in bioinformatics consists of identifying the most important features (or predictors), among a large number of features in a given classification dataset. This problem is often addressed by using a machine learning-based feature ranking method to identify a small set of top-ranked predictors (i.e. the most relevant features for classification). The large number of studies in this area has, however, an important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of Simpson's paradox, where the positive or negative association between a predictor and a class variable reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review and investigate the role of Simpson's paradox in the analysis of top-ranked predictors in high-dimensional bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a predictor and the class variable. We perform computational experiments using four well-known feature ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes, where the predictors are Gene Ontology terms. The results show that occurrences of Simpson's paradox involving top-ranked predictors are much more common for one of the feature ranking methods.
引用
收藏
页码:421 / 428
页数:8
相关论文
共 27 条
[1]  
[Anonymous], 2002, Probability and Statistics
[2]   A statistical anomaly indicates symbiotic origins of eukaryotic membranes [J].
Bansal, Suneyna ;
Mittal, Aditya .
MOLECULAR BIOLOGY OF THE CELL, 2015, 26 (07) :1238-1248
[3]  
Brimacombe M., 2014, OPEN ACCESS MED STAT, V4, P1
[4]   Next-Generation Machine Learning for Biological Networks [J].
Camacho, Diogo M. ;
Collins, Katherine M. ;
Powers, Rani K. ;
Costello, James C. ;
Collins, James J. .
CELL, 2018, 173 (07) :1581-1592
[5]  
Fabris CC, RES DEV INTELLIGENT, P148
[6]  
Gaudet P, 2017, METHODS MOL BIOL, V1446, P189, DOI 10.1007/978-1-4939-3743-1_14
[7]  
Guyon I, 2006, STUD FUZZ SOFT COMP, V207, P1
[8]  
Hira Zena M., 2015, Advances in Bioinformatics, V2015, P198363, DOI 10.1155/2015/198363
[9]   Prediction and characterization of human ageing-related proteins by using machine learning [J].
Kerepesi, Csaba ;
Daroczy, Balint ;
Sturm, Adam ;
Vellai, Tibor ;
Benczur, Andras .
SCIENTIFIC REPORTS, 2018, 8
[10]   Simpson's paradox in psychological science: a practical guide [J].
Kievit, Rogier A. ;
Frankenhuis, Willem E. ;
Waldorp, Lourens J. ;
Borsboom, Denny .
FRONTIERS IN PSYCHOLOGY, 2013, 4