A new class of metrics for learning on real-valued and structured data

被引：0

作者：

Ruiyu Yang

Yuxiang Jiang

Scott Mathews

Elizabeth A. Housworth

Matthew W. Hahn

Predrag Radivojac

机构：

[1] Indiana University,

[2] Northeastern University,undefined

来源：

Data Mining and Knowledge Discovery | 2019年 / 33卷

关键词：

Distance; Metric; Ontology; Machine learning; Text mining; High-dimensional data; Computational biology;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with f-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski (Lp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^p$$\end{document}) family, the fractional Lp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^p$$\end{document} family, two f-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.

引用

页码：995 / 1016

页数：21

共 50 条

[1] A new class of metrics for learning on real-valued and structured data
Yang, Ruiyu
Jiang, Yuxiang
Mathews, Scott
Housworth, Elizabeth A.
Hahn, Matthew W.
Radivojac, Predrag
DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 33 (04) : 995 - 1016
[2] Multiple-instance learning of real-valued data
Dooly, DR
Zhang, Q
Goldman, SA
Amar, RA
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 651 - 678
[3] Multiple-instance learning of real-valued data
Dooly, Daniel R.
Zhang, Qi
Goldman, Sally A.
Amar, Robert A.
2003, MIT Press Journals (03) : 4 - 5
[4] THE CONVERGENCE OF REAL-VALUED CLASS (B~-) GWT
江振鹏
Science Bulletin, 1986, (13) : 934 - 934
[5] A note on the linearity of real-valued functions with respect to suitable metrics
Frosini, P
GEOMETRIAE DEDICATA, 2004, 108 (01) : 105 - 110
[6] A Note on the Linearity of Real-valued Functions with Respect to Suitable Metrics
Patrizio Frosini
Geometriae Dedicata, 2004, 108 : 105 - 110
[7] On agnostic learning with {0,*,1}-valued and real-valued hypotheses
Long, PM
COMPUTATIONAL LEARNING THEORY, PROCEEDINGS, 2001, 2111 : 289 - 302
[8] THE CONVERGENCE OF REAL-VALUED CLASS (B-) GWT
WANG, ZP
KEXUE TONGBAO, 1986, 31 (13): : 934 - 934
[9] Atypical Information Theory for Real-Valued Data
Host-Madsen, Anders
Sabeti, Elyas
2015 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2015, : 666 - 670
[10] Comments on real-valued negative selection vs. real-valued positive selection and one-class SVM
Stibor, Thomas
Timmis, Jonathan
2007 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-10, PROCEEDINGS, 2007, : 3727 - +

← 1 2 3 4 5 →