Performance Study on Indexing and Accessing of Small File in Hadoop Distributed File System

被引:2
作者
Rodrigues, Anisha P. [1 ]
Fernandes, Roshan [1 ]
Vijaya, P. [2 ]
Chander, Satish [3 ]
机构
[1] NMAM Inst Technol, Dept Comp Sci & Engn, Nitte, India
[2] Modern Coll Business & Sci, Dept Math & Comp Sci, Bowshar, Oman
[3] Birla Inst Technol, Dept Comp Sci & Engn, Ranchi, Bihar, India
关键词
Hadoop Distributed File System; MapReduce; Hadoop Archive; combinefileinputformat; sequence file; BIG DATA;
D O I
10.1142/S0219649221500519
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Hadoop Distributed File System (HDFS) is developed to efficiently store and handle the vast quantity of files in a distributed environment over a cluster of computers. Various commodity hardware forms the Hadoop cluster, which is inexpensive and easily available. The large number of small files stored in HDFS consumed more memory which lags the performance because small files consumed heavy load on NameNode. Thus, the efficiency of indexing and accessing the small files on HDFS is improved by several techniques, such as archive files, New Hadoop Archive (New HAR), CombineFileInputFormat (CFIF), and Sequence file generation. The archive file combines the small files into single blocks. The new HAR file combines the smaller files into a single large file. The CFIF module merges the multiple files into a single split using NameNode, and the sequence file combines all the small files into a single sequence. The indexing and accessing of a small file in HDFS are evaluated using performance metrics, such as processing time and memory usage. The experiment shows that the sequence file generation approach is efficient when compared to other approaches concerning file access time is 1.5s, memory usage is 20 KB in multi-node, and the processing time is 0.1s.
引用
收藏
页数:21
相关论文
共 30 条
  • [1] Handling Small Size Files in Hadoop: Challenges, Opportunities, and Review
    Ahad, Mohd Abdul
    Biswas, Ranjit
    [J]. SOFT COMPUTING IN DATA ANALYTICS, SCDA 2018, 2019, 758 : 653 - 663
  • [2] Asim M., 2019, Handbook of Big Data and IoT Security, P179, DOI 10
  • [3] Barone PM., 2018, MULTIDISCIPLINARY AP
  • [4] Dealing with Small Files Problem in Hadoop Distributed File System
    Bende, Sachin
    Shedge, Ashree
    [J]. PROCEEDINGS OF INTERNATIONAL CONFERENCE ON COMMUNICATION, COMPUTING AND VIRTUALIZATION (ICCCV) 2016, 2016, 79 : 1001 - 1012
  • [5] Bo Dong, 2010, 2010 IEEE 7th International Conference on Services Computing (SCC 2010), P65, DOI 10.1109/SCC.2010.72
  • [6] Borthakur D., 2007, HDFS Architecture, P1
  • [7] HaLoop Approach for Concept Generation in Formal Concept Analysis
    Chunduri, Raghavendra K.
    Cherukuri, Aswani Kumar
    [J]. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2018, 17 (03)
  • [8] Di Maggio RM., 2017, GEOSCIENTISTS CRIME
  • [9] An optimized approach for storing and accessing small files on cloud storage
    Dong, Bo
    Zheng, Qinghua
    Tian, Feng
    Chao, Kuo-Ming
    Ma, Rui
    Anane, Rachid
    [J]. JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2012, 35 (06) : 1847 - 1862
  • [10] Gao ZP, 2016, INT CONF CLOUD COMPU, P327, DOI 10.1109/CCIS.2016.7790278