Fast and simple dataset selection for machine learning

被引:5
|
作者
Peter, Timm J. [1 ]
Nelles, Oliver [1 ]
机构
[1] Univ Siegen, Inst Mechan & Regelungstech Mechatron, Dept Maschinenbau, Paul Bonatz Str 9-11, D-57068 Siegen, Germany
关键词
machine learning; dataset selection; design of experiments; space-filling design; domain adaptation;
D O I
10.1515/auto-2019-0010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of data reduction is discussed and a novel selection approach which allows to control the optimal point distribution of the selected data subset is proposed. The proposed approach utilizes the estimation of probability density functions (pdfs). Due to its structure, the new method is capable of selecting a subset either by approximating the pdf of the original dataset or by approximating an arbitrary, desired target pdf. The new strategy evaluates the estimated pdfs solely on the selected data points, resulting in a simple and efficient algorithm with low computational and memory demand. The performance of the new approach is investigated for two different scenarios. For representative subset selection of a dataset, the new approach is compared to a recently proposed, more complex method and shows comparable results. For the demonstration of the capability of matching a target pdf, a uniform distribution is chosen as an example. Here the new method is compared to strategies for space-filling design of experiments and shows convincing results.
引用
收藏
页码:833 / 842
页数:10
相关论文
共 50 条
  • [21] Dataset for machine learning of microstructures for 9% Cr steels
    Rozman, Kyle A.
    Dogan, Omer N.
    Chinn, Richard
    Jablonksi, Paul D.
    Detrois, Martin
    Gao, Michael C.
    DATA IN BRIEF, 2022, 45
  • [22] HelmetML: A dataset of helmet images for machine learning applications
    Patil, Kailas
    Jadhav, Rohini
    Suryawanshi, Yogesh
    Chumchu, Prawit
    Khare, Gaurav
    Shinde, Tanishk
    DATA IN BRIEF, 2024, 56
  • [23] Runtime Data Layout Scheduling for Machine Learning Dataset
    You, Yang
    Demmel, James
    2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 452 - 461
  • [24] Quantifying Dataset Quality in Radio Frequency Machine Learning
    Clark, William H.
    Michaels, Alan J.
    2021 IEEE MILITARY COMMUNICATIONS CONFERENCE (MILCOM 2021), 2021,
  • [25] MQTTset, a New Dataset for Machine Learning Techniques on MQTT
    Vaccari, Ivan
    Chiola, Giovanni
    Aiello, Maurizio
    Mongelli, Maurizio
    Cambiaso, Enrico
    SENSORS, 2020, 20 (22) : 1 - 17
  • [26] Empirical Analysis on Cancer Dataset with Machine Learning Algorithms
    Vital, T. PanduRanga
    Krishna, M. Murali
    Narayana, G. V. L.
    Suneel, P.
    Ramarao, P.
    SOFT COMPUTING IN DATA ANALYTICS, SCDA 2018, 2019, 758 : 789 - 801
  • [27] A dataset of attributes from papers of a machine learning conference
    Vallejo-Huanga, Diego
    Morillo, Paulina
    Ferri, Cesar
    DATA IN BRIEF, 2019, 24
  • [28] A machine learning dataset for FRB detection in raw data
    Xu, ZhiJun
    An, Tao
    Guo, ShaoGuang
    Lao, BaoQiang
    Lv, WeiJia
    Wu, XiaoCong
    SCIENTIA SINICA-PHYSICA MECHANICA & ASTRONOMICA, 2023, 53 (02)
  • [29] Dry fruit image dataset for machine learning applications
    Meshram, Vishal
    Choudhary, Chetan
    Kale, Atharva
    Rajput, Jaideep
    Meshram, Vidula
    Dhumane, Amol
    DATA IN BRIEF, 2023, 49
  • [30] ModelSet: A labelled dataset of software models for machine learning
    Lopez, Jose Antonio Hernandez
    Izquierdo, Javier Luis Canovas
    Cuadrado, Jesus Sanchez
    SCIENCE OF COMPUTER PROGRAMMING, 2024, 231