Graph algorithms are often thought of as more complex than, say, algorithms used in standard relational data management. This may be true, for example due to the recursive nature of graph queries, or intricate pruning algorithms for finding a shortest path. Still, an important basis for graph data management systems are techniques that were pioneered in relational database systems: everything relevant for relational data maanagement applies to graph systems.
Here, we present a challenge where the graph traversal part may seem simple, and the important ingredients are: database implementation techniques, index structure design, and algorithms that control access patterns. Of course, graphs are richly structured, and this is certainly the case in the highly correlated LDBC SNB (Social Network Benchmark) dataset, and this adds a twist of expensive access patterns to the problem.
This challenge aims to let students think about the performance of computational tasks on graph datasets. The social network query that must implemented, requires thinking about the data, its data distributions, as well as algorithms to navigate this data. The underlying goal is that gaining an understanding of the interaction between data, algorithms and hardware will be very useful for the EDBT Summer School 2015 students.
One issue that makes data management problems potentially "big data" problems is the complexity of the task. High complexity can come in the form of algorithms that are CPU intensive (either many cycles per data element, or quadratic in terms of data size or worse), and/or in the shape of "difficult" access patterns. The access pattern to a dataset can be sequential or scattered (random) in which case both temporal and physical locality will determine how expensive this access is going to be. You will find that there is a large difference, certainly when managing datasets larger than RAM, between sequential access, and random access without locality.
It is also important to think about what part of the data is actually needed, and which is not (could potentially be thrown away). Because this challenge gives the opportunity to build a data structure of choice in the data loading phase, thinking about this is certainly worthwhile.
There is no obligatory language in which you have to program, however we recommend using C or C++. Java might also work, but it will eat much more memory - of which you have little - if you do not take care (i.e. put everything as basic types in arrays and avoid using objects and standard hashmaps). Scripting languages like python are much slower and will not make you win this competition and might not even allow you to make the maximal benchmarking timeout. So, certain programming skills come in handy. We do provide an example C implementation, and the changes needed to turn that into a viable solution are limited.
Further, we require you to use git for version control over your software solutions and suggest to use github as a respository. Using git is part of the basic skill-set of any computer scientist.
To make this challenge even more fun, there is a competition to see who can make the fastest solution. The leaderboard is automatically updated by our evaluation infrastructure, which regularly polls your github code repositories for new commits.
In the practical we force you to use a relatively small machine, actually a virtual machine, to manage this social network dataset. This virtual machine runs Linux, and you can fire it up on your own laptop in VirtualBox regardless whether you use MacOS, Windows or Linux.
Alternatively, you can run in the cloud, for free. For some of you this might be the first encounter with utility computing. We prepared a VM image to use in the Amazon EC2 free tier.
The origins of this challenge are in the Large Scale Data Engineering course (LSDE).