Learning from multiple sources of inaccurate data

被引：2

作者：

Baliga, G

Jain, S

Sharma, A

机构：

[1] NATL UNIV SINGAPORE, DEPT INFORMAT SYST & COMP SCI, SINGAPORE 117548, SINGAPORE

[2] UNIV NEW S WALES, SCH ENGN & COMP SCI, SYDNEY, NSW 2052, AUSTRALIA

来源：

SIAM JOURNAL ON COMPUTING | 1997年 / 26卷 / 04期

关键词：

inductive inference; machine learning; inaccurate data; multiple sources;

D O I：

10.1137/S0097539792239461

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Most theoretical models of inductive inference make the idealized assumption that the data available to a learner is from a single and accurate source. The subject of inaccuracies in data emanating from a single source has been addressed by several authors. The present paper argues in favor of a more realistic learning model in which data emanates from multiple sources, some or all of which may be inaccurate. Three kinds of inaccuracies are considered: spurious data (modeled as noisy texts), missing data (modeled as incomplete texts), and a mixture of spurious and missing data (modeled as imperfect texts). Motivated by the above argument, the present paper introduces and theoretically analyzes a number of inference criteria in which a learning machine is fed data from multiple sources, some of which may be infected with inaccuracies. The learning situation modeled is the identification in the limit of programs from graphs of computable functions. The main parameters of the investigation are: the kind of inaccuracy, the total number of data sources, the number of faulty data sources which produce data within an acceptable bound, and the bound on the number of errors allowed in the final hypothesis learned by the machine. Sufficient conditions are determined under which, for the same kind of inaccuracy, for the same bound on the number of errors in the final hypothesis, and for the same bound on the number of inaccuracies, learning from multiple texts, some of which may be inaccurate, is equivalent to learning from a single inaccurate text. The general problem of determining when learning from multiple inaccurate texts is a restriction over learning from a single inaccurate text turns out to be combinatorially Very complex. Significant partial results are provided for this problem. Several results are also provided about conditions under which the detrimental effects of multiple texts can be overcome by either allowing more errors in the final hypothesis or by reducing the number of inaccuracies in the texts. It is also shown that the usual hierarchies resulting from allowing extra errors in the final program (results in increased learning power) and allowing extra inaccuracies in the texts (results in decreased learning power) hold. Finally, it is demonstrated that in the context of learning from multiple inaccurate texts, spurious data is better than missing data, which in turn is better than a mixture of spurious and missing data.

引用

页码：961 / 990

页数：30

共 50 条

[31] A Workflow to Detect Traffic Events Using Multiple Algorithms and Data Sources
Pereira, Alexandra S.
Braga Silva, Thais R. M.
Silva, Fabricio A.
Correia, Luiz H. A.
Loureiro, Antonio A. F.
17TH ANNUAL INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SENSOR SYSTEMS (DCOSS 2021), 2021, : 164 - 170
[32] Combining Multiple Data Sources to Predict IUCN Conservation Status of Reptiles
Soares, Nadia
Goncalves, Joao F.
Vasconcelos, Raquel
Ribeiro, Rita P.
ADVANCES IN INTELLIGENT DATA ANALYSIS XX, IDA 2022, 2022, 13205 : 302 - 314
[33] Learning a Multi-Branch Neural Network from Multiple Sources for Knowledge Adaptation in Remote Sensing Imagery
Al Rahhal, Mohamad M.
Bazi, Yakoub
Abdullah, Taghreed
Mekhalfi, Mohamed L.
AlHichri, Haikel
Zuair, Mansour
REMOTE SENSING, 2018, 10 (12)
[34] Unveiling Cryptocurrency Conversations: Insights From Data Mining and Unsupervised Learning Across Multiple Platforms
Jung, Hae Sun
Lee, Haein
Kim, Jang Hyun
IEEE ACCESS, 2023, 11 : 130573 - 130583
[35] Machine Learning Using a Simple Feature for Detecting Multiple Types of Events From PMU Data
Dokic, Tatjana
Baembitov, Rashid
Hai, Ameen Abdel
Cheng, Zheyuan
Hu, Yi
Kezunovic, Mladen
Obradovic, Zoran
2022 INTERNATIONAL CONFERENCE ON SMART GRID SYNCHRONIZED MEASUREMENTS AND ANALYTICS - SGSMA 2022, 2022,
[36] A Multiple Instance Dictionary Learning Approach for Corn Yield Prediction From Remote Sensing Data
Huang, Risheng
Chen, Shuhan
Li, Xiaorun
Cao, Zeyu
IEEE SENSORS JOURNAL, 2024, 24 (24) : 41702 - 41716
[37] Intelligently learning from data
Edward Palmer
Roman Klapaukh
Steve Harris
Mervyn Singer
Critical Care, 23
[38] The Cost of Training Machine Learning Models Over Distributed Data Sources
Guerra, Elia
Wilhelmi, Francesc
Miozzo, Marco
Dini, Paolo
IEEE OPEN JOURNAL OF THE COMMUNICATIONS SOCIETY, 2023, 4 : 1111 - 1126
[39] On applying Kriging-based approximate optimization to inaccurate data
Sakata, S.
Ashida, F.
Zako, M.
COMPUTER METHODS IN APPLIED MECHANICS AND ENGINEERING, 2007, 196 (13-16) : 2055 - 2069
[40] Machine learning classification of multiple sclerosis patients based on raw data from an instrumented walkway
Wenting Hu
Owen Combden
Xianta Jiang
Syamala Buragadda
Caitlin J. Newell
Maria C. Williams
Amber L. Critch
Michelle Ploughman
BioMedical Engineering OnLine, 21

← 1 2 3 4 5 →