Fast and simple dataset selection for machine learning

被引:5
|
作者
Peter, Timm J. [1 ]
Nelles, Oliver [1 ]
机构
[1] Univ Siegen, Inst Mechan & Regelungstech Mechatron, Dept Maschinenbau, Paul Bonatz Str 9-11, D-57068 Siegen, Germany
关键词
machine learning; dataset selection; design of experiments; space-filling design; domain adaptation;
D O I
10.1515/auto-2019-0010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of data reduction is discussed and a novel selection approach which allows to control the optimal point distribution of the selected data subset is proposed. The proposed approach utilizes the estimation of probability density functions (pdfs). Due to its structure, the new method is capable of selecting a subset either by approximating the pdf of the original dataset or by approximating an arbitrary, desired target pdf. The new strategy evaluates the estimated pdfs solely on the selected data points, resulting in a simple and efficient algorithm with low computational and memory demand. The performance of the new approach is investigated for two different scenarios. For representative subset selection of a dataset, the new approach is compared to a recently proposed, more complex method and shows comparable results. For the demonstration of the capability of matching a target pdf, a uniform distribution is chosen as an example. Here the new method is compared to strategies for space-filling design of experiments and shows convincing results.
引用
收藏
页码:833 / 842
页数:10
相关论文
共 50 条
  • [1] Fast Blind Deconvolution with Simple Machine Learning
    Takeshi, Nagata
    PROCEEDINGS OF THE SEVENTH ASIA INTERNATIONAL SYMPOSIUM ON MECHATRONICS, VOL II, 2020, 589 : 967 - 975
  • [2] Moment set selection for the SMM using simple machine learning
    Zila, Eric
    Kukacka, Jiri
    JOURNAL OF ECONOMIC BEHAVIOR & ORGANIZATION, 2023, 212 : 366 - 391
  • [3] Hyperparameter selection for dataset-constrained semantic segmentation: Practical machine learning optimization
    Boyd, Chris
    Brown, Gregory C.
    Kleinig, Timothy J.
    Mayer, Wolfgang
    Dawson, Joseph
    Jenkinson, Mark
    Bezak, Eva
    JOURNAL OF APPLIED CLINICAL MEDICAL PHYSICS, 2024, 25 (12):
  • [4] A Method for Fast Selection of Machine-Learning Classifiers for Spam Filtering
    Rapacz, Sylwia
    Cholda, Piotr
    Natkaniec, Marek
    ELECTRONICS, 2021, 10 (17)
  • [5] A survey on dataset quality in machine learning
    Gong, Youdi
    Liu, Guangzhen
    Xue, Yunzhi
    Li, Rui
    Meng, Lingzhong
    INFORMATION AND SOFTWARE TECHNOLOGY, 2023, 162
  • [6] CuneiML: A Cuneiform Dataset for Machine Learning
    Chen, Danlu
    Agarwal, Aditi
    Berg-Kirkpatrick, Taylor
    Myerston, Jacobo
    JOURNAL OF OPEN HUMANITIES DATA, 2023, 9
  • [7] Training data selection based on dataset distillation for rapid deployment in machine-learning workflows
    Jeong, Yuna
    Hwang, Myunggwon
    Sung, Wonkyung
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) : 9855 - 9870
  • [8] Training data selection based on dataset distillation for rapid deployment in machine-learning workflows
    Yuna Jeong
    Myunggwon Hwang
    Wonkyung Sung
    Multimedia Tools and Applications, 2023, 82 : 9855 - 9870
  • [9] An Exploratory Analysis of Feature Selection for Malware Detection with Simple Machine Learning Algorithms
    Rahman, Md Ashikur
    Islam, Syful
    Nugroho, Yusuf Sulistyo
    Al Irsyadi, Fatah Yasin
    Hossain, Md Javed
    JOURNAL OF COMMUNICATIONS SOFTWARE AND SYSTEMS, 2023, 19 (03) : 207 - 219
  • [10] Measuring and Visualizing Dataset Coverage for Machine Learning
    Kuhn, D. Richard
    Raunak, M. S.
    Kacker, Raghu N.
    COMPUTER, 2025, 58 (04) : 18 - 26