Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach

被引:11
作者
Wibbeke, Jelke [1 ,2 ]
Teimourzadeh Baboli, Payam [2 ]
Rohjans, Sebastian [1 ]
机构
[1] Jade Univ Appl Sci, Dept Civil Engn Geoinformat & Hlth Technol, D-26121 Oldenburg, Germany
[2] OFFIS Inst Informat Technol, Energy Dept, D-26121 Oldenburg, Germany
关键词
numerosity reduction; histogram; big data; discretization; neural network; training data; regression; OF-THE-ART; DISCRETIZATION;
D O I
10.3390/en15093092
中图分类号
TE [石油、天然气工业]; TK [能源与动力工程];
学科分类号
0807 ; 0820 ;
摘要
In these days, when complex, IT-controlled systems have found their way into many areas, models and the data on which they are based are playing an increasingly important role. Due to the constantly growing possibilities of collecting data through sensor technology, extensive data sets are created that need to be mastered. In concrete terms, this means extracting the information required for a specific problem from the data in a high quality. For example, in the field of condition monitoring, this includes relevant system states. Especially in the application field of machine learning, the quality of the data is of significant importance. Here, different methods already exist to reduce the size of data sets without reducing the information value. In this paper, the multidimensional binned reduction (MdBR) method is presented as an approach that has a much lower complexity in comparison on the one hand and deals with regression, instead of classification as most other approaches do, on the other. The approach merges discretization approaches with non-parametric numerosity reduction via histograms. MdBR has linear complexity and can be facilitated to reduce large multivariate data sets to smaller subsets, which could be used for model training. The evaluation, based on a dataset from the photovoltaic sector with approximately 92 million samples, aims to train a multilayer perceptron (MLP) model to estimate the output power of the system. The results show that using the approach, the number of samples for training could be reduced by more than 99%, while also increasing the model's performance. It works best with large data sets of low-dimensional data. Although periodic data often include the most redundant samples and thus provide the best reduction capabilities, the presented approach can only handle time-invariant data and not sequences of samples, as often done in time series.
引用
收藏
页数:13
相关论文
共 35 条
[1]  
Alzawaideh B., 2021, P 2021 IEEE MADRID P, P1
[2]   Optimal Temperature-Based Condition Monitoring System for Wind Turbines [J].
Baboli, Payam Teimourzadeh ;
Babazadeh, Davood ;
Raeiszadeh, Amin ;
Horodyvskyy, Susanne ;
Koprek, Isabel .
INFRASTRUCTURES, 2021, 6 (04)
[3]  
Bachem O., 2017, arXiv
[4]   Tool-assisted Surrogate Selection for Simulation Models in Energy Systems [J].
Balduin, Stephan ;
Oest, Frauke ;
Blank-Babazadeh, Marita ;
Niess, Astrid ;
Lehnhoff, Sebastian .
PROCEEDINGS OF THE 2019 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2019, :185-192
[5]  
Barandela R, 2004, LECT NOTES COMPUT SC, V3138, P806
[6]   Machine Learning-Based Condition Monitoring for PV Systems: State of the Art and Future Prospects [J].
Berghout, Tarek ;
Benbouzid, Mohamed ;
Bentrcia, Toufik ;
Ma, Xiandong ;
Djurovic, Sinisa ;
Mouss, Leila-Hayet .
ENERGIES, 2021, 14 (19)
[7]  
Boyd M., 2017, NIST campus photovoltaic (PV) arrays and weather station data sets, DOI DOI 10.18434/M3S67G
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]   Efficient balanced sampling:: The cube method [J].
Deville, JC ;
Tillé, Y .
BIOMETRIKA, 2004, 91 (04) :893-912
[10]  
Dhar S., 2019, ARXIV, DOI [10.1145/3450494, DOI 10.1145/3450494]