A Micropartitioning Technique for Massive Data Analysis Using MapReduce

被引：0

作者：

Mohanapriya, S. ^{[1
]}

Natesan, P. ^{[1
]}

机构：

[1] Kongu Engn Coll, Dept CSE, Perundurai 638052, Erode, India

来源：

2014 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES) | 2014年

关键词：

Hadoop; MapReduce; TeraSort; Partitioning; Skew; Straggler;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Over the past years, large amounts of structured and unstructured data are being collected from various sources. These huge amounts of data are difficult to handle by a single machine which requires the work to be distributed across large number of computers. Hadoop is one such distributed framework which process data in distributed manner by using Mapreduce programming model. In order for Mapreduce to work, it has to divide the workload across the machines in the cluster. The performance of Mapreduce depends on how evenly it distributes the workload to the machines without skew and avoids executing job in a poorly running node called straggler. The workload distribution depends on the algorithm that partitions the data. To overcome the problem from skew, an efficient partitioning technique is proposed. The proposed algorithm improves load balancing as well as reduces the memory requirements. Slow running nodes degrade the performance of Mapreduce job. To overcome this problem, a technique called micropartitioning is used that divide the tasks into smaller tasks greater than the number of reducers and are assigned to reducers. Running many small tasks lessens the impact of stragglers, since work that would have been scheduled on slow nodes is only small which can be performed by other idle workers.

引用

页数：5

共 6 条

[1]

[Anonymous], 2008, 8 USENIX S OP SYST D

[2]

Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137

[3]

Ghemawat Sanjay., 2003, SOSP'03

[4]

Gufler Benjamin, 2011, Proceedings of the 1st International Conference on Cloud Computing and Services Science. CLOSER 2011, P574

[5]

OMalley O., 2008, Terabyte sort on apache hadoop

[6] The Hadoop Distributed Filesystem: Balancing Portability and Performance [J].

Shafer, Jeffrey ;

Rixner, Scott ;

Cox, Alan L. .

2010 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS 2010), 2010, :122-133

← 1 →