Graph Algorithm Performance - LDBC SNB Challenge

Home - EDBT-Summer-School-2015 - LDBC-Social-Network-Benchmark

EDBT Summer School 2015 - LDBC SNB Challenge

Practical Information

Quick Start

We provide a ready-made Linux virtual machine for the VirtualBox system for download (9 GB). Once downloaded, this machine can be imported into VirtualBox and started. The VM contains sample code and datasets (in the folder /opt/lsde) and sample code at /home/admin/leaderboard-demo. To login, use the username "admin" and the password "randomaccess". The virtual machine is also configured to open up SSH access at port 2224. The data can be found in /opt/lsde.

Those using Amazon should register for the free tier (credit card info required, but all this should cost $0), then in the management console go to EC2, and launch a "Community AMI" instance. Searching for "lsde" in the Frankfurt datacenter you will find the lsde-4 image, which can be launched on a "micro instance" for free. Doing so, you will need to create an ssh key pair, and launch the instance. When it is running, you note down the IP address and SSH into it using username "fedora" and passing your key : e.g. ssh -i mykey.pem fedora@1.2.3.4 in a Linux or Mac OS X terminal, or a Windows putty client. The data will be found in the /opt/lsde directory.

You can run the cruncher programs with the data-directory, test-queries and some output-file as parameters, e.g.:

./cruncher /opt/lsde/dataset-sf100 queries-test.csv out.csv

Afterwards, you may inspect the out.csv file with the results.

Code and Data

There is a GitHub repository that contains example code to get you started. The cruncher program reads a binary representation of the CSV files and a query set and produces the required output in a rather straightforward (read: slow) manner.

The example solution goes through the persons P sequentially, counting how many of the artists A2,A3,A4 are liked as the score and for those with score>0, visits all persons F known to P. For each F (note that touching F has a randomly access pattern), it checks on equal location and whether F already likes A1, and whether F also knows P (because friends are people who mutually know each other, in this social network). If all this succeeds (score,P,F) is added to a result table.

There is also a loader program, which converts the CSV files into the binary representation. There are two versions of the dataset available online, "sf100" (about 220 MB) and "sf3000" (5.3 GB). We have already ran the loader program on both of these, producing .bin files. The smaller dataset is for fast experiments, but the larger one will be used to measure your program's performance.

The queries-test.csv file contains several test query sets, with the structure Query.id|A1|A2|A3|A4|D1|D2 as explained above:

1|2811|1990|5183|1749|2015-04-09|2015-05-09
2|11254|143|2807|2799|2015-04-30|2015-05-30
3|7517|568|2820|6945|2015-07-30|2015-08-29
4|782|2008|1022|8|2015-08-27|2015-09-26
5|206|2837|808|7509|2015-04-23|2015-05-23
...

The cruncher program will create the out.csv file, with the structure Query.id|Score|P|F:

1|2|23749085|23166928
1|2|50823520|7696632439459
1|2|3298594899508|16492734279256
1|2|5497600779230|12094670330460
1|2|6597123132851|17592239440354
...

You can verify that the program created the correct output by comparing it with the file queries-test-output-sf100.csv and queries-test-output-sf3000.csv depending on which dataset you used:

diff out.csv queries-test-output-sf100.csv

If this outputs nothing, all is well.

Benchmarking

There is an automatic testing infrastructure that automatically takes your code from your git URL, compiles it, runs it (first the reorganiser (if present), then the ten test queries, then 10 benchmark queries). For the benchmark queries, the median query time is recorded. There is a timeout, if everything takes longer than 20 minutes, the run is a failure. Further, the query results need to be correct. We provide you with 10 query parameters and correct answers that are representative of (but not equal to) the secret 10 query parameter sets that we use for performance evaluation and correctness testing.

The benchmarking process works as follows. You register your public GIT repository URL at the leaderboard by sending email to Peter Boncz (boncz at cwi dot nl). It will then start polling that repository. Once something is checked in, it will:

Checkout your code from GIT
Run make
If no binary named cruncher is produced, abort
If a binary named reorg is produced, it is run with a single parameter, the folder name where the binary data of the sf3000 dataset is stored. This is your possibility to reorganize the binary data.
Run the ten test queries using the binary named "cruncher" with three parameters as described above
Compare the output with the reference output. If there are discrepancies, abort
Run the 10 benchmark queries and record the time
Record the median runtime and note on the leaderboard

Please note that we run these tests in a virtual machine, which is almost identical to the Quickstart VM detailed above. So 1 GB of main memory and a single CPU is all that's available there. Beware that the test environment runs on a magnetic (spinning) disk. So if you would run your VM on a laptop with SSD drive, your machine could be much faster than the benchmark environment. The Amazon EC2 micro instances use an emulated disk, which might get cached in a RAM cache of the EBS Amazon storage layer, and could therefore potentially also be faster than the benchmarking environment.

Binary Dataset structure

The example loader program produces a binary representation of the CSV data, that throws away the unused columns and adds to "person" offset columns into the "likes" and "knows" arrays, so the graph becomes easier to navigate. The "knows" data in this binary form also identifies persons by their offset rather than by their person-ID - for the same reason. This dataset is considerably smaller than the original CSV files, therefore we chose to distribute the sf3000 dataset inside the VM in this binary form and not as CSV files.

Your reorg program could work on this binary data format, rather than the CSV files. Therefore, we explain its structure here.

The binary representation of the datasets consists of three files, person.bin, interest.bin and knows.bin. These are direct representations of the in-memory layout of C arrays of the respective types, hence they can be memory-mapped directly (see cruncher.c for details).

person.bin : Array of struct Person (see utils.h for its exact specification)
- person_id: Person.id from CSV
- birthday: Birthday in short representation (MMDD)
- location: Location id
- knows_first: Array offset into knows.bin
- knows_n: Number of entries in knows.bin for this person
- interests_first: Array offset into interest.bin
- interest_n: Number of entries in interest.bin for this person
interest.bin : Array of unsigned short containing interest ids
knows.bin : Array of unsigned int, contains array offset into person.bin

Peter Boncz · Hannes Mühleisen