Learning from multiple sources of inaccurate data

被引:2
|
作者
Baliga, G
Jain, S
Sharma, A
机构
[1] NATL UNIV SINGAPORE, DEPT INFORMAT SYST & COMP SCI, SINGAPORE 117548, SINGAPORE
[2] UNIV NEW S WALES, SCH ENGN & COMP SCI, SYDNEY, NSW 2052, AUSTRALIA
关键词
inductive inference; machine learning; inaccurate data; multiple sources;
D O I
10.1137/S0097539792239461
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Most theoretical models of inductive inference make the idealized assumption that the data available to a learner is from a single and accurate source. The subject of inaccuracies in data emanating from a single source has been addressed by several authors. The present paper argues in favor of a more realistic learning model in which data emanates from multiple sources, some or all of which may be inaccurate. Three kinds of inaccuracies are considered: spurious data (modeled as noisy texts), missing data (modeled as incomplete texts), and a mixture of spurious and missing data (modeled as imperfect texts). Motivated by the above argument, the present paper introduces and theoretically analyzes a number of inference criteria in which a learning machine is fed data from multiple sources, some of which may be infected with inaccuracies. The learning situation modeled is the identification in the limit of programs from graphs of computable functions. The main parameters of the investigation are: the kind of inaccuracy, the total number of data sources, the number of faulty data sources which produce data within an acceptable bound, and the bound on the number of errors allowed in the final hypothesis learned by the machine. Sufficient conditions are determined under which, for the same kind of inaccuracy, for the same bound on the number of errors in the final hypothesis, and for the same bound on the number of inaccuracies, learning from multiple texts, some of which may be inaccurate, is equivalent to learning from a single inaccurate text. The general problem of determining when learning from multiple inaccurate texts is a restriction over learning from a single inaccurate text turns out to be combinatorially Very complex. Significant partial results are provided for this problem. Several results are also provided about conditions under which the detrimental effects of multiple texts can be overcome by either allowing more errors in the final hypothesis or by reducing the number of inaccuracies in the texts. It is also shown that the usual hierarchies resulting from allowing extra errors in the final program (results in increased learning power) and allowing extra inaccuracies in the texts (results in decreased learning power) hold. Finally, it is demonstrated that in the context of learning from multiple inaccurate texts, spurious data is better than missing data, which in turn is better than a mixture of spurious and missing data.
引用
收藏
页码:961 / 990
页数:30
相关论文
共 50 条
  • [21] Data visualization with multiple machine learning methods
    Kouropteva, O
    Okun, O
    Pietikäinen, M
    PROCEEDINGS OF THE FOURTH IASTED INTERNATIONAL CONFERENCE ON VISUALIZATION, IMAGING, AND IMAGE PROCESSING, 2004, : 190 - 196
  • [22] Reduction of the Risk of Inaccurate Prediction of Electricity Generation from PV Farms Using Machine Learning
    Krechowicz, Maria
    Krechowicz, Adam
    Licholai, Lech
    Pawelec, Artur
    Piotrowski, Jerzy Zbigniew
    Stepien, Anna
    ENERGIES, 2022, 15 (11)
  • [23] Predicting Drug Side Effects Using Data Analytics and the Integration of Multiple Data Sources
    Lee, Wei-Po
    Huang, Jhih-Yuan
    Chang, Hsuan-Hao
    Lee, King-Teh
    Lai, Chao-Ti
    IEEE ACCESS, 2017, 5 : 20449 - 20462
  • [24] Integration of Multiple Data Sources for Gene Network Inference using Genetic Perturbation Data
    Liang, Xiao
    Young, William Chad
    Hung, Ling-Hong
    Raftery, Adrian E.
    Yeung, Ka Yee
    ACM-BCB'18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2018, : 601 - 602
  • [25] Using Machine Learning on mHealth-based Data Sources
    Pryss, Rudiger
    Schickler, Marc
    Schobel, Johannes
    Schlee, Winfried
    Spiliopoulou, Myra
    Probst, Thomas
    Beierle, Felix
    ARTIFICIAL INTELLIGENCE IN MEDICINE, AIME 2022, 2022, 13263 : 443 - 445
  • [26] Deep learning of image features from unlabeled data for multiple sclerosis lesion segmentation
    Yoo, Youngjin
    Brosch, Tom
    Traboulsee, Anthony
    Li, David K.B.
    Tam, Roger
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8679 : 117 - 124
  • [27] Integration of Multiple Data Sources for Gene Network Inference Using Genetic Perturbation Data
    Liang, Xiao
    Young, William Chad
    Hung, Ling-Hong
    Raftery, Adrian E.
    Yeung, Ka Yee
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2019, 26 (10) : 1113 - 1129
  • [28] Deep Learning of Image Features from Unlabeled Data for Multiple Sclerosis Lesion Segmentation
    Yoo, Youngjin
    Brosch, Tom
    Traboulsee, Anthony
    Li, David K. B.
    Tam, Roger
    MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2014), 2014, 8679 : 117 - 124
  • [29] A Machine Learning-Based Classification System for Urban Built-Up Areas Using Multiple Classifiers and Data Sources
    Sun, Lang
    Tang, Lina
    Shao, Guofan
    Qiu, Quanyi
    Lan, Ting
    Shao, Jinyuan
    REMOTE SENSING, 2020, 12 (01)
  • [30] Learning yeast gene functions from heterogeneous sources of data using hybrid weighted Bayesian networks
    Deng, XT
    Geng, HM
    Ali, H
    2005 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS, 2005, : 25 - 34