Learning from multiple sources of inaccurate data

被引:2
|
作者
Baliga, G
Jain, S
Sharma, A
机构
[1] NATL UNIV SINGAPORE, DEPT INFORMAT SYST & COMP SCI, SINGAPORE 117548, SINGAPORE
[2] UNIV NEW S WALES, SCH ENGN & COMP SCI, SYDNEY, NSW 2052, AUSTRALIA
关键词
inductive inference; machine learning; inaccurate data; multiple sources;
D O I
10.1137/S0097539792239461
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Most theoretical models of inductive inference make the idealized assumption that the data available to a learner is from a single and accurate source. The subject of inaccuracies in data emanating from a single source has been addressed by several authors. The present paper argues in favor of a more realistic learning model in which data emanates from multiple sources, some or all of which may be inaccurate. Three kinds of inaccuracies are considered: spurious data (modeled as noisy texts), missing data (modeled as incomplete texts), and a mixture of spurious and missing data (modeled as imperfect texts). Motivated by the above argument, the present paper introduces and theoretically analyzes a number of inference criteria in which a learning machine is fed data from multiple sources, some of which may be infected with inaccuracies. The learning situation modeled is the identification in the limit of programs from graphs of computable functions. The main parameters of the investigation are: the kind of inaccuracy, the total number of data sources, the number of faulty data sources which produce data within an acceptable bound, and the bound on the number of errors allowed in the final hypothesis learned by the machine. Sufficient conditions are determined under which, for the same kind of inaccuracy, for the same bound on the number of errors in the final hypothesis, and for the same bound on the number of inaccuracies, learning from multiple texts, some of which may be inaccurate, is equivalent to learning from a single inaccurate text. The general problem of determining when learning from multiple inaccurate texts is a restriction over learning from a single inaccurate text turns out to be combinatorially Very complex. Significant partial results are provided for this problem. Several results are also provided about conditions under which the detrimental effects of multiple texts can be overcome by either allowing more errors in the final hypothesis or by reducing the number of inaccuracies in the texts. It is also shown that the usual hierarchies resulting from allowing extra errors in the final program (results in increased learning power) and allowing extra inaccuracies in the texts (results in decreased learning power) hold. Finally, it is demonstrated that in the context of learning from multiple inaccurate texts, spurious data is better than missing data, which in turn is better than a mixture of spurious and missing data.
引用
收藏
页码:961 / 990
页数:30
相关论文
共 50 条
  • [1] Transfer learning based clinical concept extraction on data from multiple sources
    Lv, Xinbo
    Guan, Yi
    Deng, Benyang
    JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 52 : 55 - 64
  • [2] Logistic Regression for Transductive Transfer Learning from Multiple Sources
    Zhang, Yuhong
    Hu, Xuegang
    Fang, Yucheng
    ADVANCED DATA MINING AND APPLICATIONS (ADMA 2010), PT II, 2010, 6441 : 175 - 182
  • [3] The Necessity of Multiple Data Sources for ECG-Based Machine Learning Models
    Plagwitz, Lucas
    Vogelsang, Tobias
    Doldi, Florian
    Bickmann, Lucas
    Fujarski, Michael
    Eckardt, Lars
    Varghese, Julian
    CARING IS SHARING-EXPLOITING THE VALUE IN DATA FOR HEALTH AND INNOVATION-PROCEEDINGS OF MIE 2023, 2023, 302 : 33 - 37
  • [4] Increasing Users' Confidence in Uncertain Data by Aggregating Data from Multiple Sources
    Greis, Miriam
    Avci, Emre
    Schmidt, Albrecht
    Machulla, Tonja
    PROCEEDINGS OF THE 2017 ACM SIGCHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'17), 2017, : 828 - 840
  • [5] Adversarial Learning for Knowledge Adaptation From Multiple Remote Sensing Sources
    Al Rahhal, Mohamad Mahmoud
    Bazi, Yakoub
    Al-Hwiti, Huda
    Alhichri, Haikel
    Alajlan, Naif
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2021, 18 (08) : 1451 - 1455
  • [6] Using machine learning and satellite data from multiple sources to analyze mining, water management, and preservation of cultural heritage
    Sousa, Joaquim J.
    Lin, Jiahui
    Wang, Qun
    Liu, Guang
    Fan, Jinghui
    Bai, Shibiao
    Zhao, Hongli
    Pan, Hongyu
    Wei, Wenjing
    Rittlinger, Vanessa
    Mayrhofer, Peter
    Sonnenschein, Ruth
    Steger, Stefan
    Reis, Luis Paulo
    GEO-SPATIAL INFORMATION SCIENCE, 2024, 27 (03): : 552 - 571
  • [7] The value of multiple data sources in machine learning models for power system event prediction
    Hoffmann, Volker
    Klemets, Jonatan Ralf Axel
    Torsaeter, Bendik Nybakk
    Rosenlund, Gjert H.
    Andresen, Christian A.
    2021 INTERNATIONAL CONFERENCE ON SMART ENERGY SYSTEMS AND TECHNOLOGIES (SEST), 2021,
  • [8] Data Type and Data Sources for Agricultural Big Data and Machine Learning
    Cravero, Ania
    Pardo, Sebastian
    Galeas, Patricio
    Fenner, Julio Lopez
    Caniupan, Monica
    SUSTAINABILITY, 2022, 14 (23)
  • [9] Multiplicity in the digital era: Processing and learning from multiple sources and modalities of instructional presentations
    Mason, Lucia
    LEARNING AND INSTRUCTION, 2018, 57 : 76 - 81
  • [10] PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python']Python
    Lu, Haiping
    Liu, Xianyuan
    Zhou, Shuo
    Turner, Robert
    Bai, Peizhen
    Koot, Raivo E.
    Chasmai, Mustafa
    Schobs, Lawrence
    Xu, Hao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 4274 - 4278