| Keywords | 
        
            | Distributed computing; MapReduce; Hadoop; Load balancing; HaSim | 
        
            | INTRODUCTION | 
        
            | Hadoop framework [3] [6] [7] is designed to process large scale data in a distributed computing environment based       on MapReduce computing model [4] [5]. As being claimed by Hadoop, the framework facilitates the developments of       distributed computing based applications. These kinds of facilities are based on the interactions among three important       components mainly which are named HDFS, Map instances (mappers) and Reduce instances       (reducers)Technique(MRT).Though the overall structure of the Hadoop framework simplifies the processing, the       components hide a lot of complex low-layer details including hardware and software aspects at the background. The       framework also supplies a simple job scheduler FIFO (First In First Out) which serves the jobs in order of their       submissions. The sequential scheduler could ease the management of job to some extent and sometimes it is efficient       when the framework deals with the job queue. However, some important heterogeneous factors have not been       considered by the job scheduler yet. And also it is well know that at present, many clusters are running in highly       homogenous environments. For example Hadoop at Yahoo [1] employs a homogenous cluster with 4000 processors,       2TB RAM and 1.5PB storage capacity. A large number of benchmarks and sorting competitions have been tested based       on the environment. The results have been published to show the powerful of Hadoop framework. In these distributed       computing tasks, people focus on the extreme performances using homogenous environment which can ideally avoid       the unbalanced load issues. Therefore, behind these highlighted results, the load balancing issue is quite considerable of       which has been hidden deeply by the homogeneous environments. Normally, it is extremely hard to build up a       homogenous cluster with a number of nodes up to several thousands. As a result, a number of Hadoop clusters with       heterogeneous nodes are quite common. The architecture of Hadoop framework has been designed quite flexible to       adapt to heterogeneous resources. Thus, it can be seen clearly that the heterogeneities of the resources will affect the       performance of the cluster. Devaraj Das [2], the engineering manager of Yahoo Bangalore Grid Computing Group       concludes the load issue from four aspects - imbalance in input splits, imbalance in computations, imbalance in       partition sizes, and imbalance in heterogeneous hardware. | 
        
            | In this paper, we present a load balancing algorithm in heterogeneous MapReduce environments with considering       the interactions of a number of factors including Hadoop parameters. We designed and implemented the algorithm       using the Hadoop simulator HaSim. And then we evaluated the performances of the proposed algorithm and comparison is made with the other load balancing strategies of MapReduce. The results show a great improvement in       performance in terms of the efficiency of the simulated cluster. | 
        
            | II. REVIEW OF EXISTING WORKS | 
        
            | At present there are a few researches focusing on studying load balancing for MapReduce. One research contributed       by Sven Groot [8] pointed out that due to the overheads of data copying, network transferring, local hard disk reading       and writing, a mapper may limit the job execution time. To show the impacts of unbalanced load issue, the author use       Jumbo based on Google Distributed File System [9]. In the author’s scenario he implemented two algorithms of which       one is a single algorithm called ‘word count’ [3] and the other one is a complex one called ‘Parallel FP-Growth       frequent item set mining algorithm’ [10]. The experimental results show that the slower nodes delay the processing       which causes that the faster nodes are not fully utilized. Based on the results, the author claimed that both mapper and       reducer impact the performance of Hadoop framework. However, firstly in this paper only a number of experiments       have been done without any solution on solving the unbalanced load issue. Secondly, the impacts brought by reducer       should be considered. For theoretical algorithm experiments, multiple reducers may be involved in terms of efficiency.       Contrarily, for a practical algorithm, reducer is normally involved to collect the final output which should be regarded       as a whole data set without any segmentation. Thus for the data integrity, single reducer is better than multiple reducers       which needs another job to collect parts from different reducers to form a whole data set. | 
        
            | Therefore it is regarded that in the data processing, the load issues among multiple mappers are more critical. One       group of researchers realized the importance of load balancing issue in Hadoop as well. Sadasivam et al. [11] try to       optimize the performance of the Hadoop cluster so that they proposed an approach called Parallel Hybrid PSO-GA       using MapReduce based on genetic algorithm. In their algorithm they use Hadoop framework itself to deal with the       genetic algorithm [12] which aims to solve the unbalanced load issue in Hadoop. Their algorithm mainly aims on       achieving an optimized scheduler for multiple users based on the different resource capacities. During the processing,       they made the number of iterations maximally 30 times to guarantee the efficiency. Their results show that the PSOGA       algorithm outperforms Max MIPS, typical PSO and typical GA. However, several points can be criticized from       their design and implementation. The first point is that the overhead of Hadoop is quite considerable. When the       framework is involved to compute a Hadoop job scheduler for Hadoop itself, though the overhead of following jobs       may be reduced, the overhead of the scheduler computation definitely cannot be avoided. The second point in their       design is they just simply consider the capacity of a resource in terms of utilization of processor. This simply idea is       lack of accuracy to describe the real Hadoop system. As studied in paper [13], there are a number of factors which may       impact the performances of the framework including processing features, IO features and Hadoop working       mechanisms. Therefore the fitness function based on pure utilizations of processors in Parallel Hybrid PSO-GA cannot       compute the scheduler accurately. The third point is they considered balancing the load among multiple users but they       do not consider the load among mappers for one job. Thus the unbalanced load will make certain number of mappers       unutilized, which delays one job. Moreover the total number of jobs will be affected. | 
        
            | III. HADOOP SIMULATOR -HASIM | 
        
            | The Hadoop simulator HaSim follows a master-slave mode in its design. Parameters related to a simulated cluster       include the number of Hadoop nodes, the topologies of these nodes (currently only supporting simple racks), the       number of mappers and reducers, the CPU speed, memory size, the average reading and writing speeds of hard disk and       network bandwidth of each node. HaSim supports one processor per node and each processor can have one or more       processor cores. | 
        
            | The values of some parameters such as CPU speed and the writing and reading speeds of hard disk can be assigned       based on measurements from real-world experiments. The NumberOfReducers specifies the number of reduce       instances. Figure 1 shows the software architecture of HaSim. | 
        
            | The validation is based on the published benchmark results [14] using algorithms Grep, Selection and UDF       Aggregation. Figure 2 and 3 show the simulator can simulate the framework with stable and accurate performance. | 
        
            | IV. ALGORITHM DESIGN | 
        
            | To solve an optimization problem, genetic algorithm solutions need to be represented as chromosomes       encoded as a set of strings which are normally binary strings. However, a binary representation is not feasible as the       number of mappers in a Hadoop cluster environment is normally large which will result in long binary strings. A       decimal string to represent a chromosome in which the data chunk assigned to a mapper is represented as a gene is       employed. In Hadoop, the total time (T) of a mapper in processing a data chunk consists of the following four parts:       Data copying time (..Tc..) in copying a data chunk from Hadoop distributed file system to local hard disk.       Processor running time (Tp..) in processing a data chunk. | 
        
            | Intermediate data merging time (.Tm.) in combining the output files of the mapper into one file. | 
        
            | Buffer spilling time (Tb.) in emptying filled buffers. T=Tc+Tb+Tm+Tb.......................... | 
        
            | Let | 
        
            | Dm.. be the size of the data chunk. | 
        
            | Hd.. be the writing speed of hard disk in MB/second. | 
        
            | Bw. be the network bandwidth in MB/second. | 
        
            | Pr.be the speed of the processor running the mapper process in MB/second. | 
        
            | Bf. be the size of the buffer of the mapper. | 
        
            | Ra. be the ratio of the size of the intermediate data to the size of the data chunk. | 
        
            | Nf be the number of frequencies in processing intermediate data. | 
        
            | Nb be the number of times that buffer is filled up. | 
        
            | Vb be the volume of data processed by the processor when the buffer is filled up. | 
        
            | S be the sort factor of Hadoop. | 
        
            | Therefore Tc =.Dm/min (Hd,Bw) | 
        
            | Here Tc.. depends on the available resources of hard disk and network bandwidth. The slower one of the two       factors will be the bottleneck in copying data chunks from Hadoop distributed file system to the local hard disk of the       mapper.Tp=Dm/Pr. When a buffer is filling, the processor keeps writing intermediate data into the buffer and in the       mean time the spilling process keeps writing the sorted data from the buffer to hard disk. Therefore the filling speed of       a buffer can be represented by | 
        
            | Pr x Ra -Hd.... Thus the time to fill up a buffer can be computed by Bf/. Pr x Ra -Hd. As a result, for a buffer to be       filled up, the processor will generate a volume of intermediate data with the size of .... which can be computed using       equation below: | 
        
            | Vb = Pr x.Ra x Bf/. Pr x Ra -Hd. | 
        
            | The total amount of intermediate data generated from the original data chunk with a size of .Dm..isDm x Ra.. Therefore       the number of times for a buffer to be filled up can be computed using equation: Nb = Dm x Ra/Vb.       The time for a buffer to be spilled once is Bf/Hd., therefore the time for a buffer to be spilled for Nb.. times is Nb x Bf /       Hd. Then we have | 
        
            | Tb = Nb x Bf / Hd | 
        
            | The frequencies in processing intermediate data Nf. can be computed using equation : .Nf =.(Nb/s)-1 | 
        
            | When the merging occurs once, the whole volume of intermediate data will be written into the hard disk causing an       overhead of Dm x Ra/Hd.. | 
        
            | Thus if the merging occurs Nf times, the time consumed by hard disk IO operations can be represented by       Dm x Ra x Nf/Hd...we have | 
        
            | Tm = Dm x Ra x Nf/Hd | 
        
            | The total time .Ttotal to process data chunks in one processing wave in MapReduce Hadoop is the maximum time       consumed by k participating mappers, where Ttotal= max (T1,T2,T3,.....,Tk). According to divisible load theory [15],       to achieve a minimum Ttotal., it is expected that all the mappers to complete data processing at the same time: | 
        
            | T1=T2=T3=.....=,Tk . Let | 
        
            | Ti be the processing time for the .ith mapper | 
        
            | T¯ be the average time of the k mappers in data processing, at the same time:T¯ =ΣTi/k | 
        
            | The fitness function can be defined using equation: | 
        
            |  | 
        
            | IV. SIMULATED RESULTS | 
        
            | The heterogeneity of the cluster is defined using the equation: | 
        
            |  | 
        
            | whereP. represent the total processing speed of the cluster. | 
        
            | Pi. represent the processing speed of the ith processor | 
        
            | p¯ . represent the average processing speed of the cluster. | 
        
            | k. represent the number of processor employed in the cluster. | 
        
            | Table 1 shows the simulated environment in detail, and Figures 6 ,7and 8 show the simulation results. | 
        
            | The processing speeds of processors: Depending on heterogeneities | 
        
            | Heterogeneities: from 0 to 2.28 | 
        
            | Number of hard disk in each node: 1 | 
        
            | Number of Map instances 2 | 
        
            | Number of Reduce instances 1 | 
        
            | Firstly 10GB data has been tested in the simulated cluster with different levels of heterogeneity. From Figure 4 it can       be observed that when the level of heterogeneity is less than 1.08 which indicates a nearly homogeneous environment, the load balancing scheme does not make any difference to the performance of MR-LSI. However, the load balancing       scheme reduces the overhead of MR-LSI significantly with an increasing level of heterogeneity. | 
        
            | The levels of heterogeneity are keeping the same in the tests but varied the size of data from 1GB to 10GB. This set of       tests was used to evaluate how the load balancing scheme performs with different sizes of datasets. Figure 5 shows that       the load balancing scheme can always reduce the overhead of MR-LSI. | 
        
            | The load balancing scheme builds on a genetic algorithm whose convergence affects the efficiency of MRLSI.       To analyze the convergence of the genetic algorithm, the number of generations is varied and the overhead of MRLSI       in processing a 10GB dataset in the simulated Hadoop environment is measured. Figure 6 shows that MR-LSI       reaches a stable performance when the number of generations in the genetic algorithm reaches 300 indicating a quick       convergence of the genetic algorithm. | 
        
            | VI. CONCLUSION | 
        
            | This paper presents a genetic algorithm based load balancing algorithm for Map Reduce environments in support of       data intensive distributed applications. Simulation results have shown the effectiveness of the algorithm in balancing       workload among Map Reduce nodes .The MR-LSI algorithm speeds up the computation process of SVD and       maintains the high level of accuracy in information retrieval . | 
        
            | Tables at a glance | 
        
            | 
                
                    
                        |  |  
                        | Table 1 |  | 
        
            |  | 
        
            | Figures at a glance | 
        
            | 
                
                    
                        |  |  |  |  
                        | Figure 1 | Figure 2 | Figure 3 |  
                        |  |  |  |  
                        | Figure 4 | Figure 5 | Figure 6 |  | 
        
            |  | 
        
            | References | 
        
            | 
                Yahoo. Hadoop at Yahoo!  Available at: http://developer.yahoo.com/hadoop/. (Lasted accessed:  20-July-2013).
 Devaraj Das. How to Hadoop.  Available at:  http://trac.nchc.org.tw/cloud/raw-attachment/Fwiki/HadoopWorkshop/h  adoop-assembled.pdf (Last accessed: 03-Sept-2012)
 Apache Hadoop! Available at:  http://hadoop.apache.org/ [Accessed November 2, 2012].
 Dean, J., and Ghemawat, S.  (2011). MapReduce: Simplified Data Processing on Large Clusters. In Proc. of  OSDI'04: Sixth Symposium on Operating System Design and Implementation, San  Francisco, CA.
 Lämmel, R. (2007). Google's  MapReduce programming model — Revisited. Sci. Comput. Program. vol. 68.
 Venner, J. (2011). Pro Hadoop  (1st ed). New York: Springer.
 White, T. (2012). Hadoop: The  Definitive Guide (2nd ed.). CA : O’Reilly Media.
 Groot, S. (2010). Jumbo: Beyond  MapReduce for Workload Balancing. VLDB 2010 , 36th International Conference on  Very Large Data BasesSingapore.
 Gobioff, H., and Leung, S.T.  (2003). The google file system. In SOSP '03, pp 29-43, New York, NY, USA.
 Li, H., Wang, Y., Zhang, D.,  Zhang, M., and Chang, E. Y. (2008). Pfp: parallel fp-growth for query  recommendation. In RecSys '08, pp 107-114, New York, NY, USA.
 Sadasivam, G. S., and Selvaraj,  D. (2012). A Novel Parallel Hybrid PSO-GA using MapReduce to Schedule Jobs in  Hadoop Data Grids. 2012 Four World Congress on Nature and Biologically Inspired  Computing, Kitakyushu, Fukuoka, Japan.
 Goldberg, D. E. (1989). Genetic  Algorithms in Search, Optimization, and Machine Learning. Reading, Mass.:  Addison-Wesley, 1989.
 Goda, K., Tamura, T., Oguchi,  M., and Kitsuregawa, M. (2002). Run-time load balancing system on san-connected  pc cluster for dynamic injection of cpu and disk resource - a case study of  data mining application.
 Paulson, E., Rasin, A., Abadi,  D. J., DeWitt, D. J., Madden, and S., Stonebraker, M. (2009). A comparison of  approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD  international conference on Management of data, New York, NY, USA.
 Shokripour, A., and Othman, M.  (2011). Survey on Divisible Load Theory and its Applications. International  Conference on Information Management and Engineering, pp 300-304.
 Li, M., Hammoud, S., Alham, N.  K., and Ponraj, M. (2012). A MapReduce based distributed LSI. In: Proceedings  of the 9th International Conference on Fuzzy Systems and Knowledge Discovery  (FSKD), YanTai, China. Pp 206-212
 |