Big Data Infrastructures and Technologies

Home - BADS: Business Analytics/Data Science -

BADS Module: Big Data Infrastructures & Technologies

Lecture: The MapReduce Framework

[an error occurred while processing this directive]

Practicum

Practicum Instructions: Google Docs

In the practicum we ask to implement a Map() and a Reduce() function in the Python programming language. For those not familiar with Python, some pointers to tutorials:

It is highly recommendable to install python on your laptop first and develop and debug the python scripts locally before running them inside MapReduce.

Python comes with a development environment called IDLE.

open an IDLE window (it is installed with python)
use "Open File" to open your python script (start with the word-count examples in the Google doc)
click "Run Module" (F5 on windows) to run the current script; a new window will open, or highlighted error messages will appear that hopefully will help you fix your code.
copy/paste in some standard input inside that window and hit enter - your script should process that input and react.

Deploying code that at least works conceptually on your laptop wil cut down the debugging effort, compared to doing this inside MapReduce, w here each attempt to run will cost minutes and error messages come hidden inside huge log files.

Technical Literature

For technical background material, there are two papers,

The first is the famous Google paper that introduced MapReduce. Note that MapReduce is a proprietary Google product, but that Hadoop is its open-source clone that has become the standard high-level layer used in compute clusters. The second paper is a whitepaper on the architecture of its Hadoop Distributed Filesytem (HDFS).

Please bear in mind we are combining cluster computing (i.e. MapReduce) with cloud computing (Amazon Web services) in the practicum. While one could do this by powering up individual virtual machines in EC2 and install Hadoop on these, Amazon makes this easier with its Elastic MapReduce (EMR) service that pre-installs Hadoop for you on the set of machines you power up, and also allows to make this cluster smaller and bigger as you go (elasticity). The below presentation by Amazon gives more specific information on EMR.

Amazon EMR Masterclass from Amazon Web Services

Extra Material

An accessible short introduction to understanding MapReduce:

Peter Boncz · Hannes Mühleisen