Big Data technologies for Data Science

Home - Data Science Master

Lecture: SQL on Big Data

PDF slides: SQL on Big Data

[an error occurred while processing this directive] [an error occurred while processing this directive]

Practicum

Practicum Instructions: Google Doc

Technical Literature

For technical background material, there are the following papers:

Related Material

This course is not sponsored by Amazon.. however here is yet another Amazon technical evangelism video, this time describing its high-performant cloud-based Redshift data warehousing service.

Another cloud-based warehousing solution is Google BigQuery:

Exploring BigData with Google BigQuery from Dharmesh Vaya

Extra Material

In our Hadoop Ecosystem write-up we mention quite a few SQL-on-Hadoop systems: Hive, Impala, Drill, PrestoDB and Spark SQL. It turns out it is really hard to find high-quality non-scientific articles on the topic of SQL on Hadoop systems. The best I could find were two rather superficial papers on infoworld and ZDNet:

However, also without using Hadoop, you can still manage huge amounts of relational data easily in the cloud using SQL-in-the-Cloud systems. The first mover and market leader in the space comes from Amazon, namely Redshift which is based on the columnar parallel database system Paraccel. The second option is Google BigQuery, which also runs on a parallel column store, this time Google's own Dremel system. Recent new entrants to this fast growing market are Snowflake and Microsoft Azure SQL Data Warehouse.

ZDNet article on SQL-in-the-cloud systems

Peter Boncz · Hannes Mühleisen