Big Data technologies for Data Science

Goals & Scope

In this course we dive into cloud technologies that allow organizations to tap into potentially thousands of computers at the click of a button at little upfront cost. We also explain the software that is used to do this and also to program such compute clusters, in order to use them for addressing Big Data problems.

Lectures and Practicum

Every week there is a 2hour Lecture with a matching practicum that is performed in the cloud on Amazon Web Services. The practicum is started in lab sessions (2hour per week), where you work on your own laptop and can ask questions. The solutions need to be submitted via blackboard. The way to contact the TAs is to post a question the Canvas dicussion forum.

Student Projects and Seminar

During the course the students form teams with people from a variety of backgrounds towards solving a (big) data science problem at all its layers (i.e., from raw data to final visualisation of the results). For developing this project, which should result in a presentation and short report there are weekly meetings with Ana Varbanescu.

There is a weekly seminar (1 hour per week) where first there will be invited speakers and later student presentations on the student projects.

Exam and Grading

There is a theoretical exam at the end of the course. This exam counts for 25% of the final grade and must be passed. The average score on the lab session excercises also counts 25%. The student project counts 50%.

Technical Literature and Books

The below books give background information on the hardware, resp. software aspects of Big Data Infrastructures and Technologies:

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (free online edition)
Hadoop: The Definitive Guide (O' Reilly)

These books are not obligatory, but supplement the slides. Every lecture also lists a couple of technical papers that you should read.

The Lecture Pages (in the right menu) further provide a short summary of the main points of the lecture. Always read this summary! Further, these pages also provide access to some lighter extra material, and recommended related presentations (youtube,slideshare) on the web.

Some course overview information is available in the course outline (mostly in Dutch).

You, yes: you!, will program a cluster like this..

The origins of this material are in the Large Scale Data Engineering MSc course (LSDE).

Authors

The Big Data course in the Data Science master at Uva and VU was developed by Peter Boncz and Hannes Mühleisen from the Database Architectures research group of CWI, specifically for the Amsterdam Data Science initiative.

Acknowledgements

The lecture slides for this course are adapted from those used in the Extreme Computing course, which were graciously provided by dr. Stratis Viglas, of University of Edinburgh.

Peter Boncz · Hannes Mühleisen