Dynamic Replication Policy on HDFS Based on Machine Learning Clustering

被引:1
作者
Ahmed, Motaz A. [1 ]
Khafagy, Mohamed H. [1 ]
Shaheen, Masoud E. [1 ]
Kaseb, Mostafa R. [1 ]
机构
[1] Fayoum Univ, Fac Comp & Artificial Intelligence, Dept Comp Sci, Al Fayyum 63514, Egypt
关键词
Feature extraction; File systems; Machine learning; Databases; Big Data; Distributed computing; Support vector machines; Replicability; Availability; big data; clustering; Hadoop distributed file system; high-performance distributed computing; machine learning; reliability; replication policy;
D O I
10.1109/ACCESS.2023.3247190
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data growth in recent years has been swift, leading to the emergence of big data science. Distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop Distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a High-Performance Distributed Computing (HPDC).
引用
收藏
页码:18551 / 18559
页数:9
相关论文
共 30 条
  • [1] Abbas O. A., 2008, INT ARAB J INF TECHN, V5, P1
  • [2] Case Study: Spark GPU-Enabled Framework to Control COVID-19 Spread Using Cell-Phone Spatio-Temporal Data
    Abdallah, Hussein Shahata
    Khafagy, Mohamed H.
    Omara, Fatma A.
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 65 (02): : 1303 - 1320
  • [3] Amna, 2022, Recent Advances in Soft Computing and Data Mining: Proceedings of the Fifth International Conference on Soft Computing and Data Mining (SCDM 2022). Lecture Notes in Networks and Systems (457), P341, DOI 10.1007/978-3-031-00828-3_34
  • [4] [Anonymous], APACHE HADOOP 3 3 4
  • [5] [Anonymous], APACHE HADOOP MAIN 3
  • [6] Apache Software Foundation, 2019, Apache hadoop
  • [7] Optimizing Join in HIVE Star Schema Using Key/Facts Indexing
    Azez, Hussien S. H. Abdel
    Khafagy, Mohamed H.
    Omara, Fatma A.
    [J]. IETE TECHNICAL REVIEW, 2018, 35 (02) : 132 - 144
  • [8] Target Tracking with Limited Sensing Range in Autonomous Mobile Sensor Networks
    Bai, Jing
    Cheng, Peng
    Chen, Jiming
    Guenard, Adrien
    Song, Yeqiong
    [J]. 2012 IEEE 8TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SENSOR SYSTEMS (DCOSS), 2012, : 329 - 334
  • [9] Erasure coding for distributed storage: an overview
    Balaji, S. B.
    Krishnan, M. Nikhil
    Vajha, Myna
    Ramkumar, Vinayak
    Sasidharan, Birenjith
    Kumar, P. Vijay
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2018, 61 (10)
  • [10] Dynamic Erasure Coding Policy Allocation (DECPA) in Hadoop 3.0
    Chiniah, Aatish
    Mungur, Avinash
    [J]. 2019 6TH IEEE INTERNATIONAL CONFERENCE ON CYBER SECURITY AND CLOUD COMPUTING (IEEE CSCLOUD 2019) / 2019 5TH IEEE INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD (IEEE EDGECOM 2019), 2019, : 29 - 33