Emergent unsupervised clustering paradigms with potential application to bioinformatics

被引:21
作者
Miller, David J. [1 ]
Wang, Yue [2 ]
Kesidis, George [3 ,4 ]
机构
[1] Penn State Univ, Dept Elect Engn, University Pk, PA 16802 USA
[2] Virginia Polytech Inst & State Univ, Dept ECE, Arlington, VA 22203 USA
[3] Penn State Univ, Dept EE, University Pk, PA 16802 USA
[4] Penn State Univ, Dept CSE, University Pk, PA 16802 USA
来源
FRONTIERS IN BIOSCIENCE-LANDMARK | 2008年 / 13卷
关键词
clustering; feature selection; model order selection; semisupervised learning; confounding effects; data fusion; information bottleneck; stability criteria; hierarchical clustering; review;
D O I
10.2741/2711
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
In recent years, there has been a great upsurge in the application of data clustering, statistical classification, and related machine learning techniques to the field of molecular biology, in particular analysis of DNA microarray expression data. Clustering methods can be used to group co-expressed genes, shedding light on gene function and co-regulation. Alternatively, they can group samples or conditions to identify phenotypical groups, disease subgroups, or to help identify disease pathways. A rich variety of unsupervised techniques have been applied, including partitional, hierarchical, graph-based, model-based, and biclustering methods. While a number of machine learning problems and tools have found mainstream applications in bioinformatics, in this article we identify some challenging problems which, though clearly relevant to bioinformatics, have not been extensively investigated in this domain. These include i) unsupervised clustering with unsupervised feature selection, ii) semisupervised learning, iii) unsupervised learning (and supervised learning) in the presence of confounding variables, and iv) stability of clustering solutions. We review recent methods which address these problems and take the position that these methods are well-suited to addressing some common scenarios that occur in bioinformatics.
引用
收藏
页码:677 / 690
页数:14
相关论文
共 67 条
  • [1] Singular value decomposition for genome-wide expression data processing and modeling
    Alter, O
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) : 10101 - 10106
  • [2] [Anonymous], IEEE T COMPUTATIONAL
  • [3] [Anonymous], 2002, J. Mach. Learn. Res
  • [4] Nuclear envelope dystrophies show a transcriptional fingerprint suggesting disruption of Rb-MyoD pathways in muscle regeneration
    Bakay, M
    Wang, ZY
    Melcon, G
    Schiltz, L
    Xuan, JH
    Zhao, P
    Sartorelli, V
    Seo, J
    Pegoraro, E
    Angelini, C
    Shneiderman, B
    Escolar, D
    Chen, YW
    Winokur, ST
    Pachman, LM
    Fan, CG
    Mandler, R
    Nevo, Y
    Gordon, E
    Zhu, YT
    Dong, YB
    Wang, Y
    Hoffman, EP
    [J]. BRAIN, 2006, 129 : 996 - 1013
  • [5] MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING
    BANFIELD, JD
    RAFTERY, AE
    [J]. BIOMETRICS, 1993, 49 (03) : 803 - 821
  • [6] Basu S, 2004, SIAM PROC S, P333
  • [7] Ben-Hur Asa, 2002, Pac Symp Biocomput, P6
  • [8] BENHUR A, 2003, METHOD MOL BIOL, P159
  • [9] Adjustment of systematic microarray data biases
    Benito, M
    Parker, J
    Du, Q
    Wu, JY
    Xang, D
    Perou, CM
    Marron, JS
    [J]. BIOINFORMATICS, 2004, 20 (01) : 105 - 114
  • [10] BLAHUT RE, 1991, PRINCIPLES PRACTICE