Analysis of Synthetic Data Utilization with Generative Adversarial Network in Flood Classification using K-Nearest Neighbor Algorithm

被引:0
作者
Afriza, Wahyu [1 ]
Riasetiawan, Mardhani [1 ]
Tyas, Dyah Aruming [1 ]
机构
[1] Gadjah Mada Univ, Dept Comp Sci & Elect, Yogyakarta, Indonesia
关键词
Classification; rainfall; synthetic data; KNN; GAN;
D O I
10.14569/IJACSA.2023.0141270
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Indonesia is a country with a tropical climate that has high rainfall rates and is supported by the uncertainty of weather and climate conditions. With the uncertainty of weather and climate as well as flood events, minimal predictive information on flooding, and the lack of availability of data on the causes of flooding, a comparison of synthetic data generation from the minimal data available from BMKG with synthetic data generation from Kaggle online platform data in the form of temperature and humidity data, rainfall, and wind speed from BMKG and annual rain data from Kaggle was analyzed. This research aims to obtain the results of data comparison analysis of synthetic data generation from different datasets with the benchmark of classification system results using K -Nearest Neighbor (KNN) and accuracy evaluation with Confusion Matrix. The research process uses climate data from the BMKG DI Yogyakarta Climatology Station within 20 months, the Geophysical Station within 12 months, and Kerala data with a range of 1901-2018. Synthetic data generation is done using the Conditional Tabular Generative Adversarial Network (CTGAN) model. CTGAN produces quite good data in terms of distribution and data differences if the original data is large and the synthetic data generated is small. The KNN classification system on the BMKG data experienced overfitting, as indicated by the accuracy value in the evaluation increasing in the range of 85- 94% and the validation decreasing in the range of 89%-65%. This is because there is no uniqueness in the data and too little original data made into synthetics, which affects the difficulty of the classification system in identifying data that is quite different in distance and data values generated by CTGAN. In Kerala, the accuracy value on evaluation is in the range of 92-95%, and validation is in the range of 0.7-0.83%, with Classifier k1 being the most optimal system.
引用
收藏
页码:678 / 683
页数:6
相关论文
共 10 条
[1]   Short-term rainfall forecasting using machine learning-based approaches of PSO-SVR, LSTM and CNN [J].
Adaryani, Fatemeh Rezaie ;
Mousavi, S. Jamshid ;
Jafari, Fatemeh .
JOURNAL OF HYDROLOGY, 2022, 614
[2]   A Deep learning-based rainfall prediction for flood management [J].
Babar, Mohammad ;
Rani, Maneeha ;
Ali, Ihtisham .
2022 17TH INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES (ICET'22), 2022, :196-199
[3]   Beyond generalization: a theory of robustness in machine learning [J].
Freiesleben, Timo ;
Grote, Thomas .
SYNTHESE, 2023, 202 (04)
[4]   Imbalanced tabular data modelization using CTGAN and machine learning to improve IoT Botnet attacks detection [J].
Habibi, Omar ;
Chemmakha, Mohammed ;
Lazaar, Mohamed .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 118
[5]   Spatiotemporal classification of heavy rainfall patterns to characterize hydrographs in a high-resolution ensemble climate dataset [J].
Hoshino, Tsuyoshi ;
Yamada, Tomohito J. .
JOURNAL OF HYDROLOGY, 2023, 617
[6]  
Karimi Z, 2021, Encyclopedia of Machine Learning and Data Miningno, P260
[7]   Early Flood Risk Assessment using Machine Learning: A Comparative study of SVM, Q-SVM, K-NN and LDA [J].
Khan, Talha Ahmed ;
Shahid, Zeeshan ;
Alam, Muhammad ;
Su'ud, M. M. ;
Kadir, Kushsairy .
2019 13TH INTERNATIONAL CONFERENCE ON MATHEMATICS, ACTUARIAL SCIENCE, COMPUTER SCIENCE AND STATISTICS (MACS-13), 2019,
[8]  
Kindhi Berlian Al, 2022, 2022 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), P67, DOI 10.1109/COMNETSAT56033.2022.9994512
[9]  
Kiran A., 2023, 2023 2 INT C INN TEC
[10]  
Panganiban EB, 2017, TENCON IEEE REGION, P727, DOI 10.1109/TENCON.2017.8227956