Thanks to Kai Schmerer (mailto:Kai_Schmerer@zdnet.com) of ZDNet.de (http://www.zdnet.de), I got some "hot" results from the Calibrator running on Pentium 4/1500 base machines with three different memory configurations: - P4/1500 with PC800 Rambus Single Channel on MSI 850 pro - P4/1500 with PC800 Rambus Dual Channel on Asus P4T - P4/1500 with PC600 Rambus Dual Channel on Asus P4T The results are available at http://www.cwi.nl/~manegold/Calibrator/v0.9e/Pentium_4-1500-Rambus.results/ Additionally, Kai also provided me with some results on an Athlon-Thunderbird/1000 with PC1600-Double Data Rate SDRAM (Gigabyte 7DX) (see http://www.cwi.nl/~manegold/Calibrator/v0.9e/Thunderbird-1000-DDR.results/ ) Here's some discussion of the results: P4/1500 with PC800 Rambus Single Channel on MSI 850 pro ======================================================= - The delay-loop for measuring the miss-latency takes only 51 cycles for 100 pointer/integer-additions instead of usually 100 cycles on PIIIs and Athlons. Maybe we see the benefits of the "double-pumped ALU", here. - L1 access latency is just 2 cycles, while it is 3 cycles on the PIIIs and on the Athlons. - L2 access (L1 miss) latencies look as follows: miss-latency replace-time PIII-Katmai 19 cycles 19 cycles Athlon-Classic 19 cycles 19 cycles PIII-Katmai-Xeon 15 cycles 15 cycles Celeron-Mendocino 8 cycles 8 cycles Duron 8 cycles 17 cycles Athlon-Thunderbird 8 cycles 17 cycles PIII-Coppermine 4 cycles 4 cycles Pentium-4 24 cycles 16 cycles - Memory latency of 325ns ("replace-time") resp. 468ns ("miss-latency") is twice to three times as large as with PC100-SDRAM on BX-boards (replace-time: 123ns to 156ns; miss-latency: 138ns to 170ns). An ASUS CUSL2 (i815e) with PC133-SDRAM even achieves 105ns/105ns! Athlon boards achieve 222ns/215ns (ASUS K7) resp. 180ns/180ns (ASUS A7V). - For main-memory reads, "miss-latency" is larger than "replace-time". As with SDRAM, a read from "cold" memory seems to take significantly longer than a read from "hot" memory; however, the difference between "hot" and "cold" is almost 50% here (i.e., with Rambus), while it is only about 10% with SDRAM. - I cannot explain (yet?), why with L2 hits, "miss-latency" is also about 50% larger than "replace-time". On PIIIs and on the original Athlon, both are equal. On the Thunderbird and the Duron, "replace-time" is about twice as big as "miss-latency"; I attribute this to the doubled bus-traffic due to the exclusive L2 cache. P4/1500 with PC800 Rambus Dual Channel on Asus P4T ================================================== - Using the second Rambus Channel (or due to the different board?), memory-latency reduces by ~50% and a "replace-time" of 163ns now almost reaches SDRAM-performance; the "miss-latency" of 215ns, however, is still significantly slower than SDRAM-performance. - Further, we notice that with strides up to 32 Bytes, there is no difference between L2 hits (L1 misses) and memory accesses (L2 misses). Maybe, the Pentium 4's hardware prefetching is doing a pretty good job, here. (Well, the Calibrator has a quite regular memory access pattern, but it walks "backwards" through the memory...) If hardware prefetching is working for strides up to 32 Bytes, why doesn't it also work for 64 and 128 Bytes? - With strides up to 64 Bytes, "miss-latency" for memory accesses is now smaller than "replace-time". Why? With 128 Bytes, "miss-latency" is again larger than "replace-time" (cold/hot). P4/1500 with PC600 Rambus Dual Channel on Asus P4T ================================================== - Reducing the bus speed to 2/3 increases the memory latency by 1/3, as expected(?). Athlon/1000 with PC1600-Double Data Rate on Gigabyte 7DX ======================================================== - The memory access "replace-time" of 163ns is just between those of BX-boards and other Athlon-boards, respectively. - Surprisingly, "miss-latency" is smaller than "replace-time" not only for L2 hits, but also for memory accesses.