Learning from multiple sources of inaccurate data

被引:2
作者
Baliga, G
Jain, S
Sharma, A
机构
[1] NATL UNIV SINGAPORE, DEPT INFORMAT SYST & COMP SCI, SINGAPORE 117548, SINGAPORE
[2] UNIV NEW S WALES, SCH ENGN & COMP SCI, SYDNEY, NSW 2052, AUSTRALIA
关键词
inductive inference; machine learning; inaccurate data; multiple sources;
D O I
10.1137/S0097539792239461
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Most theoretical models of inductive inference make the idealized assumption that the data available to a learner is from a single and accurate source. The subject of inaccuracies in data emanating from a single source has been addressed by several authors. The present paper argues in favor of a more realistic learning model in which data emanates from multiple sources, some or all of which may be inaccurate. Three kinds of inaccuracies are considered: spurious data (modeled as noisy texts), missing data (modeled as incomplete texts), and a mixture of spurious and missing data (modeled as imperfect texts). Motivated by the above argument, the present paper introduces and theoretically analyzes a number of inference criteria in which a learning machine is fed data from multiple sources, some of which may be infected with inaccuracies. The learning situation modeled is the identification in the limit of programs from graphs of computable functions. The main parameters of the investigation are: the kind of inaccuracy, the total number of data sources, the number of faulty data sources which produce data within an acceptable bound, and the bound on the number of errors allowed in the final hypothesis learned by the machine. Sufficient conditions are determined under which, for the same kind of inaccuracy, for the same bound on the number of errors in the final hypothesis, and for the same bound on the number of inaccuracies, learning from multiple texts, some of which may be inaccurate, is equivalent to learning from a single inaccurate text. The general problem of determining when learning from multiple inaccurate texts is a restriction over learning from a single inaccurate text turns out to be combinatorially Very complex. Significant partial results are provided for this problem. Several results are also provided about conditions under which the detrimental effects of multiple texts can be overcome by either allowing more errors in the final hypothesis or by reducing the number of inaccuracies in the texts. It is also shown that the usual hierarchies resulting from allowing extra errors in the final program (results in increased learning power) and allowing extra inaccuracies in the texts (results in decreased learning power) hold. Finally, it is demonstrated that in the context of learning from multiple inaccurate texts, spurious data is better than missing data, which in turn is better than a mixture of spurious and missing data.
引用
收藏
页码:961 / 990
页数:30
相关论文
共 50 条
  • [41] Machine learning classification of multiple sclerosis patients based on raw data from an instrumented walkway
    Hu, Wenting
    Combden, Owen
    Jiang, Xianta
    Buragadda, Syamala
    Newell, Caitlin J.
    Williams, Maria C.
    Critch, Amber L.
    Ploughman, Michelle
    BIOMEDICAL ENGINEERING ONLINE, 2022, 21 (01)
  • [42] Intelligently learning from data
    Palmer, Edward
    Klapaukh, Roman
    Harris, Steve
    Singer, Mervyn
    Bonnici, Tim
    Al-Hindawi, Ahmed
    Keen, Tom
    CRITICAL CARE, 2019, 23 (1):
  • [43] Fault Tolerant Localization and Tracking of Multiple Sources in WSNs Using Binary Data
    Michaelides, Michalis P.
    Laoudias, Christos
    Panayiotou, Christos G.
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2014, 13 (06) : 1213 - 1227
  • [44] Clustering in applications with multiple data sources-A mutual subspace clustering approach
    Hua, Ming
    Pei, Jian
    NEUROCOMPUTING, 2012, 92 : 133 - 144
  • [45] Cement-α: An Ontology-based Data Access System for Building Analytics with Multiple Data Sources
    He, Fang
    Zhang, Xiaoyang
    Wang, Dan
    PROCEEDINGS OF THE 2022 THE THIRTEENTH ACM INTERNATIONAL CONFERENCE ON FUTURE ENERGY SYSTEMS, E-ENERGY 2022, 2022, : 436 - 437
  • [46] Leveraging Predictive Modelling from Multiple Sources of Big Data to Improve Sample Efficiency and Reduce Survey Nonresponse Error
    Dutwin, David
    Coyle, Patrick
    Bilgen, Ipek
    English, Ned
    JOURNAL OF SURVEY STATISTICS AND METHODOLOGY, 2024, 12 (02) : 435 - 457
  • [47] GraphDTI: A robust deep learning predictor of drug-target interactions from multiple heterogeneous data
    Guannan Liu
    Manali Singha
    Limeng Pu
    Prasanga Neupane
    Joseph Feinstein
    Hsiao-Chun Wu
    J. Ramanujam
    Michal Brylinski
    Journal of Cheminformatics, 13
  • [48] Forest emissions reduction assessment from airborne LiDAR data using multiple machine learning approaches
    Qin, Shize
    Chen, Yiming
    Yang, Bo
    Zhu, Kaiwei
    FRONTIERS IN ENERGY RESEARCH, 2023, 11
  • [49] Machine learning analysis of microbial flow cytometry data from nanoparticles, antibiotics and carbon sources perturbed anaerobic microbiomes
    Abhishek S. Dhoble
    Pratik Lahiri
    Kaustubh D. Bhalerao
    Journal of Biological Engineering, 12
  • [50] GraphDTI: A robust deep learning predictor of drug-target interactions from multiple heterogeneous data
    Liu, Guannan
    Singha, Manali
    Pu, Limeng
    Neupane, Prasanga
    Feinstein, Joseph
    Wu, Hsiao-Chun
    Ramanujam, J.
    Brylinski, Michal
    JOURNAL OF CHEMINFORMATICS, 2021, 13 (01)