Active metric learning for supervised classification

被引:4
作者
Kumaran, Krishnan [1 ]
Papageorgiou, Dimitri J. [1 ]
Takac, Martin [2 ]
Lueg, Laurens [3 ]
Sahinidis, Nicolas V. [3 ]
机构
[1] ExxonMobil Res & Engn Co, 1545 US 22, Annandale, NJ 08801 USA
[2] Lehigh Univ, Dept Ind & Syst Engn, 200 West Packer Ave, Bethlehem, PA 18015 USA
[3] Carnegie Mellon Univ, Dept Chem Engn, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
Active learning; Clustering; Metric learning; Mixed-integer optimization;
D O I
10.1016/j.compchemeng.2020.107132
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Clustering and classification critically rely on distance metrics that provide meaningful comparisons between data points. To this end, learning optimal distance functions from data, known as metric learning, aims to facilitate supervised classification, particularly in high-dimensional spaces where visualization is challenging or infeasible. In particular, the Mahalanobis metric is the default choice due to simplicity and interpretability as a transformation of the simple Euclidean metric using a combination of rotation and scaling. In this work, we present several novel contributions to metric learning, both by way of formulation as well as solution methods. Our approach is motivated by agglomerative clustering with certain novel modifications that enable natural interpretation of the user-defined classes as clusters with the optimal metric. Our approach generalizes and improves upon leading methods by removing reliance on pre-designated "target neighbors,""triplets," and "similarity pairs." Starting with the definition of a generalized metric that has the Mahalanobis metric as the second order term, we propose an objective function for metric selection that does not aim to isolate classes from each other like most previous work, but tries to distort the space minimally by aggregating co-class members into local clusters. Further, we formulate the problem as a mixed-integer optimization that can be solved efficiently for small/medium datasets and approximated for larger datasets. Another salient feature of our method is that it facilitates active learning by recommending precise regions to sample using the optimal metric to improve classification performance. These regions are indicated by boundary and outlier points of the dataset as defined by the metric. This targeted acquisition can significantly reduce computation and data acquisition by ensuring training data completeness, representativeness, and economy, which could also provide advantages in training data selection for other established methods like Deep Learning and Random Forests. We demonstrate classification and computational performance of our approach through several simple and intuitive examples, followed by results on real image and benchmark datasets. (C) 2020 Elsevier Ltd. All rights reserved.
引用
收藏
页数:21
相关论文
共 30 条
[1]  
[Anonymous], 2013, ARXIV
[2]  
[Anonymous], 2011, P INT C AUTONOMOUS A
[3]  
Bertsimas D., 2017, ARXIV PREPRINT ARXIV
[4]   Logistic Regression: From Art to Science [J].
Bertsimas, Dimitris ;
King, Angela .
STATISTICAL SCIENCE, 2017, 32 (03) :367-384
[5]   OR Forum-An Algorithmic Approach to Linear Regression [J].
Bertsimas, Dimitris ;
King, Angela .
OPERATIONS RESEARCH, 2016, 64 (01) :2-16
[6]   BEST SUBSET SELECTION VIA A MODERN OPTIMIZATION LENS [J].
Bertsimas, Dimitris ;
King, Angela ;
Mazumder, Rahul .
ANNALS OF STATISTICS, 2016, 44 (02) :813-852
[7]   LEAST QUANTILE REGRESSION VIA MODERN OPTIMIZATION [J].
Bertsimas, Dimitris ;
Mazumder, Rahul .
ANNALS OF STATISTICS, 2014, 42 (06) :2494-2525
[8]  
Bixby R.E., 2012, DOC MATH, V2012, P107
[9]   Learning surrogate models for simulation-based optimization [J].
Cozad, Alison ;
Sahinidis, Nikolaos V. ;
Miller, David C. .
AICHE JOURNAL, 2014, 60 (06) :2211-2227
[10]  
Davis J., 2007, P 24 INT C MACH LEAR, P209