PDF slides: Spark (based on slides by Matei Zaharia & Xiangrui Meng & Reynold Xin)
In this lecture we focus on the Spark system, which is by far the hottest technology entering the Hadoop ecosystem (though we should mention there is a very similar framework from Europe, called Apache Flink, which is very much alike Spark, and arguably better on some fronts, in particular streaming data). Where MapReduce writes all data that Mappers send to Reducers to file first, Spark attempts to avoid doing so, keeping as much data in RAM as possible. But, since Spark does not provide a distributed filesystem, or a cluster resource manager, it effectively depends on HDFS and YARN, making Spark a member of the Hadoop ecosystem. In the Amazon cloud, we can use Spark via EMR (Elastic MapReduce), which is Amazon's Hadoop with pre-installed Spark. However, it is also possible to use Spark without Hadoop, though one then somehow needs another shared filesystem and job scheduler instead of HDFS and Spark. The makers of Spark (the company Databricks) operate such a Hadoop-less Spark in the Amazon cloud as a service, where the shared storage is provided by Amazon S3 (a similar service in the Microsoft Azure cloud now also exists).
The central concept in Spark is the RDD: Resilient Distributed Dataset. This concept is Spark's answer to addressing fault tolerance in distributed computation. As Spark wants to avoid writing to disk, the question is how fault-tolerance is achieved if a node dies during a computation, and the input was not written to a replicated disk. RDDs are the answer: they are either persistent datasets (on HDFS) or declarative recipes of how to create a dataset from other RDDs. This recipe allows Spark to recompute lost (parts of) RDDs of failed jobs on-the-fly.
Spark is a programming framework that offers Map and Reduce, but it offers 20+ more communication patterns, e.g. also joins between datasets. As such, it is a generalization of MapReduce. Also notable is its focus on functional programming, in particular Scala. Functional programming languages have no side-effects and therefore are much easier to parallelize than traditional programming languages. This is what Spark does: automatically optimize functional programs over a cluster infrastructure. One feature of functional programming that is often used in Spark are lambdas: these are implicit function definitions. The functions you would write in MapReduce for Map and Reduce in Java, you can provide shortened, inline, as lambdas in Scala. Typically, such a lambda is a parameter to the high-level operators of Spark, e.g. a join operator would expect a boolean function that defines when two items from the two RDDs being joined, match.
Newer versions of Spark introduced DataFrames. A DataFrame is a conceptual layer on top of RDDs. Essentially, a DataFrame is a table, with column names and types. Thanks to this extra information, Spark can execute queries on DataFrames more efficiently than queries on RDDs because it can optimize these queries. There is a query optimizer in Spark, called Catalyst. Both Spark SQL and all scripts on DataFrames use this optimizer. Spark was further made to perform better by low-level compilation of Spark queries into java code, and also support columnar-storage better.
Spark is a distributed Big Data processing environment using Scala, Java or Python programs, but it also has multiple higher-level components, notably Spark SQL (distributed columnar database, will be discussed later), GraphX (vertex-centric graph processing), and MLLIB (machine learning library). The latter two are summarized as follows.
Spark offers its MLLIB library that provides a set of ready-to-run machine learning algorithms that work on RDDs (and now also on DataFrames), and thus scale on a cluster. We focus on its Alternating Least Squares (ALS) implementation for collaborative filtering. In collaborative filtering, typically we have a very partial set of datapoints that describe users and their interests (e.g. songs). The purpose of collaborative filtering is to derive what all users think of all possible interests, specifically to recommend the most interesting items (songs) to users. ALS models this as a matrix problem where for each user and for each song we keep a vector of numbers: each element in this vector corresponds to a latent factor (characteristic). If we would know these factors, then multiplying them would give the matrix of all user/song interests. The ALS method iteratively approaches optimal vector values, using a (possibly small) set of known interests as training set.
MLLIB contains many different machine learning algorithms that can be easily connected to data through the Spark concept of RDDs. The output of the learning (the model) is also an RDD and can subsequently be used to score new users (i.e. make recommendations). Besides tight integration with Spark processing pipelines (and e.g. Spark SQL) another advantage of MLLIB is that the algorithms in it are implemented in a scalable, distributed fashion. Therefore, one can address large problems with MLLIB using clusters, which would not fit on a single machine.
Practicum Instructions: Google Doc
For technical background material, there are four papers,