AN OUTLIER MAP FOR SUPPORT VECTOR MACHINE CLASSIFICATION

被引:21
作者
Debruyne, Michel [1 ]
机构
[1] Univ Antwerp, Dept Wiskunde Informat, B-2020 Antwerp, Belgium
关键词
Support Vector Machine; high-dimensional data analysis; robust statistics; data visualization; CANCER;
D O I
10.1214/09-AOAS256
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Support Vector Machines are a widely used classification technique. They are computationally efficient and provide excellent predictions even for high-dimensional data. Moreover, Support Vector Machines are very flexible due to the incorporation of kernel functions. The latter allow to model nonlinearity, but also to deal with nonnumerical data such as protein strings. However, Support Vector Machines can suffer a lot from unclean data containing, for example, outliers or mislabeled observations. Although several outlier detection schemes have been proposed in the literature, the selection of outliers versus nonoutliers is often rather ad hoc and does not provide much insight in the data. In robust multivariate statistics outlier maps are quite popular tools to assess the quality of data under consideration. They provide a visual representation of the data depicting several types of outliers. This paper proposes an outlier map designed for Support Vector Machine classification. The Stahel-Donoho outlyingness measure from multivariate statistics is extended to an arbitrary kernel space. A trimmed version of Support Vector Machines is defined trimming part of the samples with largest outlyingness. Based on this classifier, an outlier map is constructed visualizing data in any type of high-dimensional kernel space. The outlier map is illustrated on 4 biological examples showing its use in exploratory data analysis.
引用
收藏
页码:1566 / 1580
页数:15
相关论文
共 25 条
  • [1] Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
    Alon, U
    Barkai, N
    Notterman, DA
    Gish, K
    Ybarra, S
    Mack, D
    Levine, AJ
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) : 6745 - 6750
  • [2] Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival
    Chiaretti, S
    Li, XC
    Gentleman, R
    Vitale, A
    Vignetti, M
    Mandelli, F
    Ritz, J
    Foa, R
    [J]. BLOOD, 2004, 103 (07) : 2771 - 2778
  • [3] Christmann A, 2004, J MACH LEARN RES, V5, P1007
  • [4] DONOHO DL, 1982, BRAKDOWN PROPERTIES
  • [5] Support vector machine classification and validation of cancer tissue samples using microarray expression data
    Furey, TS
    Cristianini, N
    Duffy, N
    Bednarski, DW
    Schummer, M
    Haussler, D
    [J]. BIOINFORMATICS, 2000, 16 (10) : 906 - 914
  • [6] Gene selection for cancer classification using support vector machines
    Guyon, I
    Weston, J
    Barnhill, S
    Vapnik, V
    [J]. MACHINE LEARNING, 2002, 46 (1-3) : 389 - 422
  • [7] ROBPCA: A new approach to robust principal component analysis
    Hubert, M
    Rousseeuw, PJ
    Vanden Branden, K
    [J]. TECHNOMETRICS, 2005, 47 (01) : 64 - 79
  • [8] Robust PCA and classification in biosciences
    Hubert, M
    Engelen, S
    [J]. BIOINFORMATICS, 2004, 20 (11) : 1728 - 1736
  • [9] A discriminative framework for detecting remote protein homologies
    Jaakkola, T
    Diekhans, M
    Haussler, D
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (1-2) : 95 - 114
  • [10] Kadota K., 2003, Chem-Bio Informatics Journal, V3, P30