PDF slides: MapReduce & Hadoop
Arguably, MapReduce is what started the whole Big Data wave and what got Google its initial technical advantage. In this lecture we first motivate MapReduce by looking at the difficulties previously experienced (in Super Computing) in developing parallel programs, in terms of programming expertise and debugging effort. We argue that for Cluster Computing we need to raise the level of abstraction, viewing the whole cluster as a single machine ("the cluster is the computer"), and developing programming frameworks on that higher level.
We also take a look at cluster hardware architectures (single node, rack of e.g. 80 nodes, cluster of e.g. 40 racks) and analyze the basic costs to access data from (i) local CPU cache, (ii) local memory, (iii) another computer in the rack, (iv) another computer in the cluster, (v) a computer in a different cluster. (vi) a computer in a different datacenter. This shows that disk, and network communication latencies can be very high (and do not improve over time), and though bandwidth does improve with newer hardware, it can still be precious. This motivates cluster programs to communicate less, and if at all, in burst, and preferably read data from the local node and if not then at least sequentially and at a large granularity (idea: minimize bandwidth usage, and amortize latency using large transfers).
We then describe the MapReduce framework. The framework automatically spreads computation over a cluster, following the above rules: it tries to move computation to the data (minimize bandwidth usage) and when communicating it does so in large transfers. MapReduce relies on a distributed file system named HDFS (Hadoop Distributed File System) that stores files in large blocks of 64MB and replicates these blocks on three machines (by default). To keep replication simple, HDFS files cannot be modified, just appended to. A special master machine runs the namenode which keeps track of all files, their blocks, and on which machines (called datanodes) these are stored. HDFS in principle stores all blocks of a HDFS file on the same three datanodes together, and the datanode which originally wrote the data always is one of them. Though, if a datanodes goes down, the namenode will detect this and replicate the block on another datanode to compensate. The datanodes (tend to) run Linux, and store all HDFS data together as Linux files. But in Linux one cannot see the HDFS filesystem directly, one has to use a special HDFS utility program to browse through it (cd, ls/list, copy files) and see data. So, Hadoop (and HDFS) is a cluster software layer on top of Linux.
The user is expected to write a Map() and Reduce() function, both of which map (key,value) pairs into other (key,value) pairs. Both Map() and Reduce() functions can emit zero or more result pairs for each input pair. The reducer receives as input (key,value*): the second parameter is the full list of all values that were emitted by the mappers for that key - so each key is handled by exactly one Reduce() call. Optionally, a Combine() function may be placed in between Map() and Reduce(), which can reduce the amount of communication between mappers and reducers. Its input and output parameters are the same format as the input of the Reduce() function.The MapReduce framework optimized data movement by asking the namenode for the location(s) of the HDFS input files and assigning files (or parts of files, called splits) to mappers on machines where this data is on the local disk. Because the reducers get data from all mappers, as determined by the key, MapReduce must perform network communication for the reduce phase. If provided, it will use Combine() functions to reduce the data volume that needs to be sent. Examples were given in the python programming language to prepare for the practicum. In the past 10 years, Hadoop has evolved from a MapReduce clone into a whole ecosystem with many tools. This is the most popular "cluster operating system" for organizations that manage their own Big Data clusters ("on premise", as opposed to in the cloud). This evolution was enabled by the second major version of Hadoop that separated MapReduce from job scheduling, i.e. YARN. This separation of general facilities (storage: HDFS, scheduling: YARN, compute: MapReduce) thus created room for a whole ecosystem beyond MapReduce, with many alternative tools (e.g. SQL systems such as Impala and Hive) and programming frameworks (e.g. Flink and Spark) operating on top of it.
Practicum Instructions: Google Docs
In the practicum we ask to implement a Map() and a Reduce() function in the Python programming language. For those not familiar with Python, some pointers to tutorials:
It is highly recommendable to install python on your laptop first and develop and debug the python scripts locally before running them inside MapReduce.
Python comes with a development environment called IDLE.
For technical background material, there are two papers,Hadoop is its open-source clone that has become the standard high-level layer used in compute clusters. The second paper is a whitepaper on the architecture of its Hadoop Distributed Filesytem (HDFS).
Please bear in mind we are combining cluster computing (i.e. MapReduce) with cloud computing (Amazon Web services) in the practicum. While one could do this by powering up individual virtual machines in EC2 and install Hadoop on these, Amazon makes this easier with its Elastic MapReduce (EMR) service that pre-installs Hadoop for you on the set of machines you power up, and also allows to make this cluster smaller and bigger as you go (elasticity). The below presentation by Amazon gives more specific information on EMR.
An accessible short introduction to understanding MapReduce: