Since a real AQM contains a considerable amount of I/O operations, we added to our benchmark a `realistic' amount of reading data from and writing output to disk. With this code we studied the I/O performance of the Cray T3E for our AQM. We implemented the I/O both in the master/slave approach, i.e., one PE performs all necessary I/O and takes care of the distribution and gathering of the data, and using parallel I/O (more).
We ran our benchmark on a 32x32 horizontal grid with 32 layers
in the vertical (so with concentration vectors of dimension [32,32,32,66]).
For the explicit message passing we make use of PVM routines.
As normalizing computational unit we take the performance
on 1 CPU of the C90 (compiler option -Zv). Our benchmark runs there at a
speed of 500 Mflop/s, which is half the peak performance.
The table shows first the performance
on multiple processors of the C90, once for the complete model using autotasking
to divide the workload over the processors, i.e., parallelizing at loop level,
and once using the PVM program.
On the Cray T3E the scalability is even superlinear with
the number of processors (due to cache effects), but here the
performance on 1 PE is only 4% of the peak performance.
One can calculate from the figures in the table that it needs 16 PEs of the T3E
to outperform 1 processor of the C90.
Experiments we performed on a 64x64 horizontal grid show that the
scalability with the modelsize is also perfect, i.e., a run on N PEs for
the 64x64 grid is as expensive as a run on N/4 PEs for the 32x32 grid.
Finally, the results for a cluster of workstations,
the SGI O2's coupled with an ATM network.
Here one can clearly see the effect of the use of `virtual' memory:
the memory on these workstations is not large enough to
contain a complete model, so the computer is more swapping than calculating,
resulting in a wall clock time which is four times as large as the CPU time
(and has a clear day/night rhythm: during daytime it is about 6 times as large).
The speed-up in wall clock time using two O2's instead of one is therefore
huge:
a factor 5.6. The performance is then approximately the same as for 1 PE of the
T3E. However, the scalability of a cluster of workstations is much less.
The CPU time can also decrease sometimes superlinearly due to cache effects,
but this does not show up in the wall clock time.
There are many possible strategies to perform I/O on a parallel computer.
For the input of the meteo data we chose the most simple approach:
do the input from a single PE.
Clearly, this is the most portable approach:
it is possible on every parallel architecture and it is not necessary to
divide the data up into multiple files before reading.
The disadvantage is that it does not scale. Apart from reading the data
we have to reorder the data according to the domain decomposition and
to scatter the data across the PEs.
For the output we tried three different approaches:
For the input we see an anomalous behavior for 4 processors, where the time to read the data takes more time than the combined time for input and parallel asynchronous output. We have no explanation for this, yet. But the time to do the input is obviously not significant.
For the portable output we see a combined input/output time of 70 s, say. Since the combined I/O time is almost three times as expensive as the CPU time on 64 PEs, we tried the following two other approaches.
For the parallel output every processor writes to its own private file so we might hope for a parallel speed-up limited by the maximum of the number of PEs and the number of disks, in our case eight. However, we only see a performance gain of 30-50% probably due to the synchronous I/O and the resulting I/O contention. On 64 PEs the combined I/O time still is four times as expensive as the CPU time.
It can be seen that the output and computations plus necessary communications do not overlap completely. The reason for this is unclear: it should have been possible looking at the amount of I/O and the sustained average I/O rates. Doing a separate experiment, where we only wrote the output files without doing any computation/communication at all (just executing a `sleep' statement), we were able to overlap output and `computations' completely.
Again we might have hoped for a parallel speed-up limited by the maximum of the number of PEs and the number of disks, but we do not see a performance gain going from 4 to 8 PEs. We also see that the combined I/O time slightly increases as the number of PEs increases. This must be due to I/O contention. On 64 PEs the combined I/O time is about as much as the CPU time. This seems acceptable, especially since 64 PEs give rise to small subdomains with a maximum amount of I/O. Moreover, it is the most efficient way to handle the output.
The final conclusion is probably the most disappointing one: although the Cray T3E has a scalable I/O architecture, the I/O does not scale. Of course this is due to the limited number of disks, but on the other hand we should have seen a speed-up going from 4 to 8 PEs when doing output.