A new class of metrics for learning on real-valued and structured data

被引:0
|
作者
Ruiyu Yang
Yuxiang Jiang
Scott Mathews
Elizabeth A. Housworth
Matthew W. Hahn
Predrag Radivojac
机构
[1] Indiana University,
[2] Northeastern University,undefined
来源
Data Mining and Knowledge Discovery | 2019年 / 33卷
关键词
Distance; Metric; Ontology; Machine learning; Text mining; High-dimensional data; Computational biology;
D O I
暂无
中图分类号
学科分类号
摘要
We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with f-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski (Lp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^p$$\end{document}) family, the fractional Lp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^p$$\end{document} family, two f-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.
引用
收藏
页码:995 / 1016
页数:21
相关论文
共 50 条
  • [1] A new class of metrics for learning on real-valued and structured data
    Yang, Ruiyu
    Jiang, Yuxiang
    Mathews, Scott
    Housworth, Elizabeth A.
    Hahn, Matthew W.
    Radivojac, Predrag
    DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 33 (04) : 995 - 1016
  • [2] Multiple-instance learning of real-valued data
    Dooly, DR
    Zhang, Q
    Goldman, SA
    Amar, RA
    JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 651 - 678
  • [3] Multiple-instance learning of real-valued data
    Dooly, Daniel R.
    Zhang, Qi
    Goldman, Sally A.
    Amar, Robert A.
    2003, MIT Press Journals (03) : 4 - 5
  • [4] THE CONVERGENCE OF REAL-VALUED CLASS (B~-) GWT
    江振鹏
    Science Bulletin, 1986, (13) : 934 - 934
  • [5] A note on the linearity of real-valued functions with respect to suitable metrics
    Frosini, P
    GEOMETRIAE DEDICATA, 2004, 108 (01) : 105 - 110
  • [6] A Note on the Linearity of Real-valued Functions with Respect to Suitable Metrics
    Patrizio Frosini
    Geometriae Dedicata, 2004, 108 : 105 - 110
  • [7] On agnostic learning with {0,*,1}-valued and real-valued hypotheses
    Long, PM
    COMPUTATIONAL LEARNING THEORY, PROCEEDINGS, 2001, 2111 : 289 - 302
  • [8] THE CONVERGENCE OF REAL-VALUED CLASS (B-) GWT
    WANG, ZP
    KEXUE TONGBAO, 1986, 31 (13): : 934 - 934
  • [9] Atypical Information Theory for Real-Valued Data
    Host-Madsen, Anders
    Sabeti, Elyas
    2015 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2015, : 666 - 670
  • [10] Comments on real-valued negative selection vs. real-valued positive selection and one-class SVM
    Stibor, Thomas
    Timmis, Jonathan
    2007 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-10, PROCEEDINGS, 2007, : 3727 - +