PDF slides: Big Data and Cloud Computing 101
In this lecture we first define Big Data in terms of data management problems with the three V's: (i) Volume: the data is large, (2) Variety: the data is often not clean and tabular, but messy (text, even images), (3) Velocity: new data keeps arriving continuously. We also discuss Power Laws in many Big Data problems: the mass of data is in the long tail, the long tail cannot be ignored as it may represent the majority of all datapoints. Power Law distributions are typical in social networks, e.g. the distribution of amount of twitter followers has a power law distribution.
We explain Eric Brewers CAP theorem: a global software system cannot achieve all three of Consistency (reads always reflect the latest updates), Availability (the system is always up), Partitionability (the system is resistant against loss of communication between datacenters), and describe concepts as replication and sharding and their consequences in terms of read speed, update speed and consistency.
As we work on large computer infrastructures, we distinguish between three related areas: (1) Super Computing, where performance is king and programmability a side-issue, (2) Cluster Computing, which is about quickly getting things done on large clusters of unreliable machines, and (3) Cloud Computing, where computation is performed in large clusters operated by third party, sold as a commodity service to users based on their actual use, seamlessly allowing to get more or less of it based on their needs (elasticity). In the practicum where we use Hadoop (cluster computing) on Amazon Web Services (cloud computing) we hence combine (2) and (3). Super computing is out of scope for this module. In cloud computing, we further distinguish between IaaS: Infrastructure-as-a-Service (virtual machines for rent, e.g. Amazon EC2), PaaS: Platform as-a-service (database system for rent, e.g. Amazon Redshift) and SaaS: Software-as-a-Service (application for rent, e.g. Microsoft Office 365 or Salesforce).
The last part of the lecture describes Amazon Web Services (AWS) in some detail to prepare you for the practicum. Services to remember are S3 block storage (infinitely scaling storage -- presumably implemented by spreading storage with replication over the hundreds of thousands of machines Amazon owns), EBFS filesystems which are virtual disks that store their data remotely on S3 but do a lot of caching to avoid network traffic, and the EC2 service that allows to power up virtual machines in the cloud (and some of the options to choose from in terms of CPU, RAM, and local disks aka "ephimeral storage" which are empty when you start-up).
Practicum Instructions: Google Doc
The practicum will be performed on Amazon Web Services (AWS).
This book on hardware architecture of clusters provides the reader a deeper technical understanding of cloud hardware and software.
A light way into more material, specifically on Amazon Web Services, is this YouTube video given in 2011 at the STRATA conference by Amazon CTO Werner Vogels -- a VU PhD graduate!
Finally, after having attended the lecture, the technical architects among you might have a shot at the following Amazon promo presentation How to Scale Your Next Idea on AWS: a Love Story that highlights the function of many of their services for constructing resilient, scalable and secure web applications. We hope it will make more sense to you than before the lecture -- although in the lecture we do not really explain all the Amazon Services mentioned there.