PDF slides: SQL on Big Data
In this lecture we look at two kinds of systems for SQL on Big Data, namely SQL-on-Hadoop systems (most slides) and SQL-in-the-cloud systems (last few slides only). In the practicum we use Redshift (SQL-in-the-Cloud) as well as Hive (SQL-on-Hadoop), although since we will use Hive on an Amazon cluster, it is actually also running in the cloud. An expected outcome of the practicum is that Redshift is much faster than Hive, as the former is an optimized analytical database system (read below), whereas Hive still relies on MapReduce (or its slightly more modern equivalent Tez) where starting even a "hello world" job takes tens of seconds. The Hive queries iin this practicum are also run on CSV files, so the text has to be interpreted ("parsed" - which is slow) at query time, while Redshift already has the data loaded in its optimized data format (columnar, compressed - see below). If Hive reads data from a faster binary HDFS format (such as Parquet), it would be faster also.
Database systems we focus here on are for analysis; what traditionally is called OLAP (On Line Analytical Processing), in SQL terms. In OLAP workloads, users run a few complex read-only queries that search through huge tables. These queries may take a long time. OLAP can be contrasted with OLTP (On Line Transactional Processing), which is about sustaining a high-throughput of single-record read and update queries, e.g. many thousands per second. OLTP systems also have their Big Data equivalents in the form of NoSQL systems, such as HBase and Cassandra. However, for data science, where we analyze data, we focus on analytical database systems, so noSQL systems are out of scope for this module.
Modern analytical database systems have undergone a transformation in the past decade. The idea that you use the same technology for OLAP and OLTP is gone. Analytical database systems have moved from row-wise to columnar storage and are called "column stores". Column stores typically use data compression; data is compressed per column and this is more effective since the value distribution is more regular than if values from different columns are mixed (as happens in row-stores). Besides general-purpose compression (ZIP, bzip, etc) we discussed a number of database compression schemes that exploit knowledge of the table column datatypes and distributions, among others RLE (Run-Length Encoding), BitMap storage and Differential Encoding and Dictionary Encoding. Compression helps improve performance because the data becomes smaller and therefore less memory, disk and network traffic is needed when executing queries, often improving performance. Sometimes it is even possible to answer queries directly on the compressed data, without decompressing it; in such cases compression is no longer a trade-off between less data movement for more CPU computation (decompression) but a win-win (less data movement and less computation).
Other storage tricks are "zone-maps" which keep simple statistics (Min,Max) for large tuple ranges, which allow to avoid reading (skipping) a zone in the table if the WHERE condition ask for values that outside the [Min,Max] range of a zone. Parallel database systems that run on multiple machines often let the database user (data architect) decide on table distribution (which machine gets which rows?), and partitioning to split the table on each machine further in smaller partitions. Distribution (DISTRIBUTE BY) is typically done by "hashing" on a key column (a hash function is a random but deterministic function) to ensure that data gets spread evenly. Partitioning (PARTITION BY) is often done on a time-related column. One goal of partitioning is to speed up queries by skipping partitions that do not match a WHERE condition. A second goal is data lifecycle management, so one can keep the last X days of data in X partitions, by each day dropping the oldest partition (containing the data of X days ago) and starting a new empty partition in which new data gets inserted. Distribution and partitioning are typically applied both, independently.
Modern analytical database systems also overhaul the SQL query engine, using either vectorization or just-in-time compilation (into fast machine code) to use much less CPU resources per processed tuple than OLTP row-stores. Finally, high-end analytical database systems are parallel systems that run on a cluster and try to involve all cores of all nodes in queries, to make the system faster and more scalable.
Modern Hadoop data formats such as ORC and Parquet have adopted columnar data storage: even though they store all columns together in a HDFS file (to make sure all columns are on the same machine), inside that file the data is placed column after column in huge chunks of rows (rowgroups). Applications that only read a few columns can thus skip over the unused columns. (less I/O). These new data formats also apply columnar compression and sometimes also zone-maps. So, the more advanced SQL-on-Hadoop systems try to adopt all modern ideas on analytical database systems (e.g. Hive uses vectorization, Impala uses JIT, Presto uses neither), while also trying to fit into the Hadoop ecosystem. That means for instance, that they must try to ensure that they execute data near to where it resides on HDFS (to save bandwidth), to be able to query many different Hadoop data formats in situ, without requiring to first load them into the database, and coordinate their work on the Hadoop cluster (busy cores, RAM requirements) with YARN, in order not to run out of resources and/or suffer from interference by other cluster users.
Recently, analytical database systems in the cloud are gaining increasing adoption. Some popular systems currently are Amazon Redshift, Snowflake, Google BigQuery, Amazon Athena and Databricks. From the technical side, an important difference with classic on-premise (and also Hadoop) architectures is that the is no locality in data storage. In other words, there is no hope for running queries on the machines where the data is; rather, each database system will have to fetch data from a cloud storage service such as S3. Cloud database systems can only achieve some locality if they are running on cloud instances with a disk (remember: such disks are for caching only; are empty at start and do not service crashes or restarts). This means they can only be used for caching data that is permanently store in the cloud.
One important question is where the data that a cloud database system queries, is stored. It could be stored in the systems of the service provider; this is the case for Snowflake. The format can be either open source (e.g. ORC, Parquet, CSV, JSON) or proprietary (Snowflake). The latter also means that the data must be loaded into the system using some load pipeline. A next question is the financial model: do you need to pay for storage, for queries, or for both? A final important characteristic is whether you as a customer must start/stop and scale the database service explicitly, determining how many machines are used. Alternatively, this is fully handled by the system ("serverless"). Note that a consequence of serverless is that other processes servicing different customer may run on the same machine or even virtual machine (security risk).
Examples of serverless systems are Athena (Presto under the hood) and Google BigQuery (Dremel under the hood). Redshift is a cloud port of a "shared nothing" high en parallel database system called Paraccel, and is least flexible. In particular, data must first be copied to local disk before it can be queried. The system cannot scale up and down easily as changing the compute node setup also changes the data placement, and moving data takes time. There is a new serverless subsystem in Redshift called Spectrum, that allows to push certain queries for external S3 data into many nodes. Databricks is a spark-in-the-cloud service that uses local storage for caching, and also focuses on working with AI (deep learning) tools..
Practicum Instructions: Google Doc
For technical background material, there are the following papers:
This course is not sponsored by Amazon.. however here is yet another Amazon technical evangelism video, this time describing its high-performant cloud-based Redshift data warehousing service.
Another cloud-based warehousing solution is Google BigQuery:
In our Hadoop Ecosystem write-up we mention quite a few SQL-on-Hadoop systems: Hive, Impala, Drill, PrestoDB and Spark SQL. It turns out it is really hard to find high-quality non-scientific articles on the topic of SQL on Hadoop systems. The best I could find were two rather superficial papers on infoworld and ZDNet:
However, also without using Hadoop, you can still manage huge amounts of relational data easily in the cloud using SQL-in-the-Cloud systems. The first mover and market leader in the space comes from Amazon, namely Redshift which is based on the columnar parallel database system Paraccel. The second option is Google BigQuery, which also runs on a parallel column store, this time Google's own Dremel system. Recent new entrants to this fast growing market are Snowflake and Microsoft Azure SQL Data Warehouse.