Green Big Data Processing using Hadoop
Anne-Cécile Orgerie, Shadi Ibrahim
Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source imple- mentation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data centers and clouds. Yet, to meet the ever-growing size of Big Data, Hadoop has been recently deployed on large-scale data centers equipped with thousands of servers that are energy hungry. This results in a tremendous increase in the energy consumed to operate these large-scale data centers and ends up with not only high money bills but also high carbon emis- sion. The goal of this tutorial is to serve as a first step towards not only exploring the Hadoop MapReduce engine but also provide a deep insight into the challenges for making Big Data greener and discuss the main approaches developed in the literature.
We will briefly explain who we are, the prerequisite of this tutorial, and its goal. We will also show the relevance of the tutorial’s topic, before presenting its outline.
- Green Computing
The second part will be dedicated to green computing in general. We will explain the main challenges related to green computing. Then, we will review the main techniques used in green computing, from server-oriented techniques like dynamic voltage frequency scaling, to data-center wide approaches like follow-the-sun frameworks. This part will give the audience the basics principle and main techniques of green computing.
- Big Data
In this part, we will give the audience an overview on Big Data. We will explore different definitions of Big Data and will explain where this data is coming from. Then we will discuss what this Big Data is useful for and finally illustrate the new challenges brought by Big Data.
- Big Data Processing using Hadoop
One of the major challenges of Big Data is Big Data processing. Therefore, we will focus in this part on MapReduce programming model and explain in details Hadoop.
- Green Hadoop
In this part, we will focus on the green approaches specifically developed to deal with Big Data issues, and we will in particular emphasize the methods used to make the Hadoop framework greener. We will explain the general ideas of these approaches and give concrete examples on how they can be used in practice.
- Tutorial Conclusion
Anne-Cécile Orgerie is a permanent research scientist at CNRS. She is working in the Myriads team at IRISA in Rennes, France. She received her PhD. degree in Computer Science from Ecole Normale Superieure de Lyon (France) in September 2011. Her PhD. thesis, entitled An energy-efficient reservation framework for large-scale distributed systems, was awarded the PhD thesis prize from the French chapter of ACM SIGOPS (ASF). Her research interests focus on energy efficiency, large-scale distributed systems and telecommunication networks from both practical and theoretical perspectives.
Shadi Ibrahim is a permanent Inria research scientist within the KerData research team. He obtained his Ph.D. in Computer Science from Huazhong University of Science and Technology in Wuhan of China in 2011. His research interests are in cloud computing, big data management, data-intensive computing, virtualization technology, and file and storage systems. He has published several research papers in recognized big data and cloud computing research journals and conferences, among which, several papers on optimizing and improving Hadoop MapReduce performance in the cloud and one book chapter on MapReduce framework.