Data-Centric Solutions for Addressing Big Data Veracity with Class Imbalance, High Dimensionality, and Class Overlapping

被引:1
作者
Bolivar, Armando [1 ]
Garcia, Vicente [2 ]
Alejo, Roberto [3 ]
Florencia-Juarez, Rogelio [2 ]
Sanchez, J. Salvador [4 ]
机构
[1] Univ Autonoma Ciudad Juarez, Inst Ingn & Tecnol, Av Charro 450 NTE, Ciudad Juarez 32310, Chihuahua, Mexico
[2] Univ Autonoma Ciudad Juarez, Div Multidisciplinaria Ciudad Univ, Av Jose de Jesus Delgado 18100, Ciudad Juarez 32579, Chihuahua, Mexico
[3] Inst Tecnol Toluca, Tecnol Nacl Mexico, Div Postgrad Studies & Res, Av Tecnol S-N, Metepec 52149, Estado De Mexic, Mexico
[4] Univ Jaume 1, Inst New Imaging Technol, Dept Comp Languages & Syst, Av Vicent Sos Baynat S-N, Castellon De La Plana 12071, Spain
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 13期
关键词
big data; class imbalance; high dimensionality; fractional norms; dissimilarity representation; MACHINE-LEARNING ALGORITHMS; SMOTE; DESIGN;
D O I
10.3390/app14135845
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
An innovative strategy for organizations to obtain value from their large datasets, allowing them to guide future strategic actions and improve their initiatives, is the use of machine learning algorithms. This has led to a growing and rapid application of various machine learning algorithms with a predominant focus on building and improving the performance of these models. However, this data-centric approach ignores the fact that data quality is crucial for building robust and accurate models. Several dataset issues, such as class imbalance, high dimensionality, and class overlapping, affect data quality, introducing bias to machine learning models. Therefore, adopting a data-centric approach is essential to constructing better datasets and producing effective models. Besides data issues, Big Data imposes new challenges, such as the scalability of algorithms. This paper proposes a scalable hybrid approach to jointly addressing class imbalance, high dimensionality, and class overlapping in Big Data domains. The proposal is based on well-known data-level solutions whose main operation is calculating the nearest neighbor using the Euclidean distance as a similarity metric. However, these strategies may lose their effectiveness on datasets with high dimensionality. Hence, the data quality is achieved by combining a data transformation approach using fractional norms and SMOTE to obtain a balanced and reduced dataset. Experiments carried out on nine two-class imbalanced and high-dimensional large datasets showed that our scalable methodology implemented in Spark outperforms the traditional approach.
引用
收藏
页数:15
相关论文
共 48 条
  • [1] Aggarwal CC, 2001, LECT NOTES COMPUT SC, V1973, P420
  • [2] Big data resolving using Apache Spark for load forecasting and demand response in smart grid: a case study of Low Carbon London Project
    Ali, Hussien Ali El-Sayed
    Alham, M. H.
    Ibrahim, Doaa Khalil
    [J]. JOURNAL OF BIG DATA, 2024, 11 (01)
  • [3] Trivial State Fuzzy Processing for Error Reduction in Healthcare Big Data Analysis towards Precision Diagnosis
    Anjum, Mohd
    Min, Hong
    Ahmed, Zubair
    [J]. BIOENGINEERING-BASEL, 2024, 11 (06):
  • [4] [Anonymous], Google Programas de Educacion Superior de Google Cloud
  • [5] SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
    Basgall, Maria Jose
    Hasperue, Waldo
    Naiouf, Marcelo
    Fernandez, Alberto
    Herrera, Francisco
    [J]. JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2018, 18 (03): : 203 - 209
  • [6] A Survey of Predictive Modeling on Im balanced Domains
    Branco, Paula
    Torgo, Luis
    Ribeiro, Rita P.
    [J]. ACM COMPUTING SURVEYS, 2016, 49 (02)
  • [7] Reducing Data Complexity Using Autoencoders With Class-Informed Loss Functions
    Charte, David
    Charte, Francisco
    Herrera, Francisco
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 9549 - 9560
  • [8] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [9] Fast mining of massive tabular data via approximate distance computations
    Cormode, G
    Indyk, P
    Koudas, N
    Muthukrishnan, S
    [J]. 18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, : 605 - 614
  • [10] The dissimilarity approach: a review
    Costa, Yandre M. G.
    Bertolini, Diego
    Britto, Alceu S., Jr.
    Cavalcanti, George D. C.
    Oliveira, Luiz E. S.
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2020, 53 (04) : 2783 - 2808