We provide a ready-made Linux virtual machine for the VirtualBox system for download (9 GB). Once downloaded, this machine can be imported into VirtualBox and started. The VM contains sample code and datasets (in the folder /opt/lsde) and sample code at /home/admin/leaderboard-demo. To login, use the username "admin" and the password "randomaccess". The virtual machine is also configured to open up SSH access at port 2224. The data can be found in
Those using Amazon should register for the free tier (credit card info required, but all this should cost $0), then in the management console go to EC2, and launch a "Community AMI" instance. Searching for "lsde" in the Frankfurt datacenter you will find the lsde-4 image, which can be launched on a "micro instance" for free. Doing so, you will need to create an ssh key pair, and launch the instance. When it is running, you note down the IP address and SSH into it using username "fedora" and passing your key : e.g.
ssh -i mykey.pem firstname.lastname@example.org in a Linux or Mac OS X terminal, or a Windows putty client. The data will be found in the
You can run the
cruncher programs with the data-directory, test-queries and some output-file as parameters, e.g.:
./cruncher /opt/lsde/dataset-sf100 queries-test.csv out.csv
Afterwards, you may inspect the out.csv file with the results.
There is a GitHub repository that contains example code to get you started. The cruncher program reads a binary representation of the CSV files and a query set and produces the required output in a rather straightforward (read: slow) manner.
The example solution goes through the persons P sequentially, counting how many of the artists A2,A3,A4 are liked as the score and for those with score>0, visits all persons F known to P. For each F (note that touching F has a randomly access pattern), it checks on equal location and whether F already likes A1, and whether F also knows P (because friends are people who mutually know each other, in this social network). If all this succeeds (score,P,F) is added to a result table.
There is also a loader program, which converts the CSV files into the binary representation. There are two versions of the dataset available online, "sf100" (about 220 MB) and "sf3000" (5.3 GB). We have already ran the loader program on both of these, producing .bin files. The smaller dataset is for fast experiments, but the larger one will be used to measure your program's performance.
The queries-test.csv file contains several test query sets, with the structure Query.id|A1|A2|A3|A4|D1|D2 as explained above:
1|2811|1990|5183|1749|2015-04-09|2015-05-09 2|11254|143|2807|2799|2015-04-30|2015-05-30 3|7517|568|2820|6945|2015-07-30|2015-08-29 4|782|2008|1022|8|2015-08-27|2015-09-26 5|206|2837|808|7509|2015-04-23|2015-05-23 ...
The cruncher program will create the out.csv file, with the structure Query.id|Score|P|F:
1|2|23749085|23166928 1|2|50823520|7696632439459 1|2|3298594899508|16492734279256 1|2|5497600779230|12094670330460 1|2|6597123132851|17592239440354 ...
You can verify that the program created the correct output by comparing it with the file queries-test-output-sf100.csv and queries-test-output-sf3000.csv depending on which dataset you used:
diff out.csv queries-test-output-sf100.csv
If this outputs nothing, all is well.
There is an automatic testing infrastructure that automatically takes your code from your git URL, compiles it, runs it (first the reorganiser (if present), then the ten test queries, then 10 benchmark queries). For the benchmark queries, the median query time is recorded. There is a timeout, if everything takes longer than 20 minutes, the run is a failure. Further, the query results need to be correct. We provide you with 10 query parameters and correct answers that are representative of (but not equal to) the secret 10 query parameter sets that we use for performance evaluation and correctness testing.
The benchmarking process works as follows. You register your public GIT repository URL at the leaderboard by sending email to Peter Boncz (boncz at cwi dot nl). It will then start polling that repository. Once something is checked in, it will:
cruncheris produced, abort
reorgis produced, it is run with a single parameter, the folder name where the binary data of the sf3000 dataset is stored. This is your possibility to reorganize the binary data.
Please note that we run these tests in a virtual machine, which is almost identical to the Quickstart VM detailed above. So 1 GB of main memory and a single CPU is all that's available there. Beware that the test environment runs on a magnetic (spinning) disk. So if you would run your VM on a laptop with SSD drive, your machine could be much faster than the benchmark environment. The Amazon EC2 micro instances use an emulated disk, which might get cached in a RAM cache of the EBS Amazon storage layer, and could therefore potentially also be faster than the benchmarking environment.
loader program produces a binary representation of the CSV data, that throws away the unused columns and adds to "person" offset columns into the "likes" and "knows" arrays, so the graph becomes easier to navigate. The "knows" data in this binary form also identifies persons by their offset rather than by their person-ID - for the same reason. This dataset is considerably smaller than the original CSV files, therefore we chose to distribute the sf3000 dataset inside the VM in this binary form and not as CSV files.
reorg program could work on this binary data format, rather than the CSV files. Therefore, we explain its structure here.
The binary representation of the datasets consists of three files,
knows.bin. These are direct representations of the in-memory layout of C arrays of the respective types, hence they can be memory-mapped directly (see
cruncher.c for details).
utils.hfor its exact specification)
person_id: Person.id from CSV
birthday: Birthday in short representation (MMDD)
location: Location id
knows_first: Array offset into
knows_n: Number of entries in
knows.binfor this person
interests_first: Array offset into
interest_n: Number of entries in
interest.binfor this person
unsigned shortcontaining interest ids
unsigned int, contains array offset into