by Peter Boncz (email@example.com)
important: You must work in couples of two persons for the assignment!
DEADLINE FOR ASSIGNMENT: APRIL 30, 2010
Theme: Architecture-Conscious Algorithms
We will work in a terminal room.
- bench.c a simple C program that performs two microbenchmarks: random fetch, and random chase.
- bench.sh a simple shell script that tests these on different memory sizes: from 256 (4 byte) integers up to 32M integers.
- bench.gp a gnuplot script to visualize the results
Examine the program (try to understand the lookup functions, at least). It creates a random permutation of numbers [0..n)
that is used both for (random access) pointer chasing, as for random table value lookup. Both of these are typical
database activities and occur in hashing (chasing a bucket-chain) resp. positional join.
Compile the program with maximum optimization (gcc -O6 bench.c); the resulting program will simply be in a.out
Run the bench.sh script. Pass the Mhz of your CPU as its only parameter (eg. 2000 for a 2GHz CPU). You can see the speed
of your CPU in /proc/cpuinfo. You must do this correctly for the results to be a correct count of the cycles per
per iteration (memory access).
Running the script will produce chase.dat and fetch.dat. Visualize with: gnuplot -persist bench.gp
A laptop must be set to maximum power setting for the experiment to make most sense.
What do these graphs tell you about the memory hierarchy in your machine (how many cache levels, and their sizes)?
How many concurrent memory loads do you think the fetch experiment generates on your machine, and why do you think this? (hint: think about what is different in chase and fetch regarding the CPU's ability to speculate ahead)
There is a program that implements the radix-cluster and radix-decluster algorithms, using them for fetching columns after a join.
Whereas for one side of the join (typically the outer side) one can use a naive fetch strategy, since the join result respects the order
of the outer relation, the situation is more difficult for the columns of the other (inner) join relation. Here, the matching keys
are present in the join result out-of-order, Radix-Cluster can first be used to re-order the tuples on keys to make fetching cache-friendly.
In doing so the routine fetchjoin_clustered() also attaches the desired position in the end result. Subsequently, the fetched (desired-position,val)
tuples can be fed into radix-decluster to put them into the desired position.
This program has too many options already -- with apologies!
usage: a.out datasize-bits MHz windowsize-bits verbose-ntuples ncols (pass-bits)*
Note: modern Intel architectures that allow many outstanding misses (see fetch vs chase) seem to have reduced the benefit of radix-cluster/decluster
over naive fetching somehwat. Still, if one must project multiple columns and chooses good parameter settings, radix-cluster/decluster should
typically be faster than naive ("memory wall") fetching. Of course, this algorithm can also be used on the disk I/O level, but we won't be going
there on this occasion.
- datasize-bits: the log2 of the problem size
- MHz: the MHz of your CPU. Setting this correctly means that the reported number is a number of cycles spent per tuple.
- windowsize-bits: Radix-Decluster inserts into an insertion-window, of which you specify the size (as the log2 of the number) here.
NOTE If this size is set to 0, a different experiment is performed!! This experiment just performs NAIVE fetchjoins
for all projection columns without radix-* algorithms.
- verbose-ntuples: if non-zero, the program displays a table with the first verbose-ntuples input and result values
- ncols: typically a join must take care of more than one projection column. This is relevant, as the radix-cluster needs to be
done only once; whereas fetchjoin/radix-decluster must be done for all columns. In other words, more the projection columns there are, the better amortization is obtained for the radix-cluster phase.
- pass-bits*: one or more specifications for radix-cluster telling it in how many passes to cluster, and how many bits to use for each pass. If you pass one number here, single-pass radix-cluster is performed, if you pass three numbers, three-pass radix-cluster is performed with each pass using the indicated number of bits.
Please examine the algorithm (and the presentation) and provide a short explanation and rationale behind the radix-cluster and radix-decluster algorithms.
Recall that radix-cluster deteriorates when the fan-out is too large and then needs multiple passes.
Recall that radix-decluster needs an insertion-window that fits the cache yet the window needs to be a multiple of the number
of clusters in order to ensure that each round multiple (sequential) values from the same cluster get inserted, thus generating
efficient bursty multi-cacheline (sequential) access in each cluster.
You may note that the radix-cluster implementation uses recursion to sub-cluster one-by-one each found cluster after each pass. An alternative would
have been to use a number of sequential passes (no recursion needed). What could be the advantage of using recursion here?
Focus on the setting with datasize-bits between 20 and 25 and ncols=8. Determine the optimal settings on your machine. Compare with the
naive (non radix-*) approach, by setting the windowsize=0 (HACKy interface, sorry).
Create a formula that derives the optimal #passes, bits per passes and windowsize given (N=problemsize, C=cache-size-in-bytes, W=data-item-width-in-bytes, and P=number of projection columns)
Here, we focus on a single cache level (most likely L2).