Multi-view representation learning for tabular data integration using inter-feature relationships

被引:2
作者
Tripathi, Sandhya [1 ]
Fritz, Bradley A. [1 ]
Abdelhack, Mohamed [3 ]
Avidan, Michael S. [1 ]
Chen, Yixin [2 ]
King, Christopher R. [1 ]
机构
[1] Washington Univ, Dept Anesthesiol, St Louis, MO 63110 USA
[2] Washington Univ, Dept Comp Sci & Engn, St Louis, MO USA
[3] Ctr Addict & Mental Hlth, Krembil Ctr Neuroinformat, Toronto, ON, Canada
关键词
Schema matching; Electronic health records; Contrastive learning; Fingerprints; Partial autoencoders;
D O I
10.1016/j.jbi.2024.104602
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in healthcare. This integrating is usually resolved using meta -data such as feature names, which may be unavailable or ambiguous. Our goal is to design methods that create a mapping between structured tabular datasets derived from electronic health records independent of meta -data. Methods: We evaluate methods in the challenging case of numeric features without reliable and distinctive univariate summaries, such as nearly Gaussian and binary features. We assume that a small set of features are a priori mapped between two datasets, which share unknown identical features and possibly many unrelated features. Inter -feature relationships are the main source of identification which we expect. We compare the performance of contrastive learning methods for feature representations, novel partial auto -encoders, mutualinformation graph optimizers, and simple statistical baselines on simulated data, public datasets, the MIMIC -III medical -record changeover, and perioperative records from before and after a medical -record system change. Performance was evaluated using both mapping of identical features and reconstruction accuracy of examples in the format of the other dataset. Results: Contrastive learning -based methods overall performed the best, often substantially beating the literature baseline in matching and reconstruction, especially in the more challenging real data experiments. Partial auto -encoder methods showed on -par matching with contrastive methods in all synthetic and some real datasets, along with good reconstruction. However, the statistical method we created performed reasonably well in many cases, with much less dependence on hyperparameter tuning. When validating feature match output in the EHR dataset we found that some mistakes were actually a surrogate or related feature as reviewed by two subject matter experts. Conclusion: In simulation studies and real -world examples, we find that inter -feature relationships are effective at identifying matching or closely related features across tabular datasets when meta -data is not available. Decoder architectures are also reasonably effective at imputing features without an exact match.
引用
收藏
页数:12
相关论文
共 26 条
  • [1] Decoupled representation for multi-view learning
    Sun, Shiding
    Wang, Bo
    Tian, Yingjie
    PATTERN RECOGNITION, 2024, 151
  • [2] Semantically consistent multi-view representation learning
    Zhou, Yiyang
    Zheng, Qinghai
    Bai, Shunshun
    Zhu, Jihua
    KNOWLEDGE-BASED SYSTEMS, 2023, 278
  • [3] Separable Consistency and Diversity Feature Learning for Multi-View Clustering
    Zhang, Fenghua
    Che, Hangjun
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1595 - 1599
  • [4] Deep multi-view clustering with diverse and discriminative feature learning
    Xu, Junpeng
    Meng, Min
    Liu, Jigang
    Wu, Jigang
    PATTERN RECOGNITION, 2025, 161
  • [5] Multi-view representation learning with dual-label collaborative guidance
    Chen, Bin
    Ren, Xiaojin
    Bai, Shunshun
    Chen, Ziyuan
    Zheng, Qinghai
    Zhu, Jihua
    KNOWLEDGE-BASED SYSTEMS, 2024, 305
  • [6] A Clustering-Guided Contrastive Fusion for Multi-View Representation Learning
    Ke, Guanzhou
    Chao, Guoqing
    Wang, Xiaoli
    Xu, Chenyang
    Zhu, Yongqi
    Yu, Yang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 2056 - 2069
  • [7] Multi-view representation for pathological image classification via contrastive learning
    Chen, Kaitao
    Sun, Shiliang
    Zhao, Jing
    Wang, Feng
    Zhang, Qingjiu
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, : 2285 - 2296
  • [8] Multi-view representation learning from local consistency and global alignment
    Si, Lingyu
    Qiang, Wenwen
    Li, Jiangmeng
    Xu, Fanjiang
    Sun, Funchun
    NEUROCOMPUTING, 2022, 501 : 727 - 740
  • [9] Learning invariant and uniformly distributed feature space for multi-view generation?
    Lu, Yuqin
    Cao, Jiangzhong
    He, Shengfeng
    Guo, Jiangtao
    Zhou, Qiliang
    Dai, Qingyun
    INFORMATION FUSION, 2023, 93 : 383 - 395
  • [10] Structure-guided feature and cluster contrastive learning for multi-view clustering
    Shu, Zhenqiu
    Li, Bin
    Mao, Cunli
    Gao, Shengxiang
    Yu, Zhengtao
    NEUROCOMPUTING, 2024, 582