From surfnet.nl!newsfeed2.news.nl.uu.net!sun4nl!oleane.net!oleane!proxad.net!enews.sgi.com!fido.engr.sgi.com!news.corp.sgi.com!mash.engr.sgi.com!mash Sun Sep  3 07:11:57 MET DST 2000
Article: 9415 of comp.arch
Path: cwi.nl!surfnet.nl!newsfeed2.news.nl.uu.net!sun4nl!oleane.net!oleane!proxad.net!enews.sgi.com!fido.engr.sgi.com!news.corp.sgi.com!mash.engr.sgi.com!mash
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: NUMAflex essay [extremely long, ~30 pages]
Date: 3 Sep 2000 01:17:38 GMT
Organization: Silicon Graphics, Inc.
Lines: 1734
Message-ID: <8os8ri$p9j$3@murrow.corp.sgi.com>
NNTP-Posting-Host: 163.154.3.73

A while ago, in response to some questions about SGI's NUMAflex,
I said I'd write up some info on the architecture & especially on the
design process that led there, and alternatives considered but not taken.
I only had time to write all this once, for multiple uses, so please
excuse the occasional marketing-stuff that snuck in amongst the
technology and history.
=============================================================================

NUMAflex Modular Design Approach
A Revolution in Evolution
John R. Mashey
On behalf of the team, I hope!
8/30/00

0. INTRODUCTION
1. BACKGROUND AND OVERALL CHRONOLOGY
	1.1 HISTORICAL BACKGROUND
	1.2 ORIGIN 2000 (LATE 1996)
	1.3 "FLINTSTONES" PROJECT (AS OF JANUARY 1998)
2. SGI'S NUMAFLEX TM) DESIGN APPROACH - BRICK-AND-CABLE MODULARITY
3. ORIGIN 3000 + ONYX 3000 - FIRST NUMAFLEX FAMILY
4. THE NEXT FAMILY - ITANIUM-BASED NUMAFLEX
5. FURTHER MIPS & IA-64 NUMAFLEX FAMILIES (2001-2006)
6. NUMAFLEX COMMENTARY
7. SUMMARY
8. ACKNOWLEDGEMENTS
9. REFERENCES

0. INTRODUCTION

SGI's "NUMAflex" (TM) modular design approach builds computer families
with unusual scalability and evolvability characteristics. It partitions
CPU, I/O, and other functions into small, 19" rackmount computing "bricks",
then combines them via efficient, high-speed cache-coherent cabled interconnects,
rather than large backplanes.

This particular approach enables a fundamental change in the way *scalable*
computers are designed and evolved, implying an unusual degree of interaction
among development process,  systems architectures,
innovative physical packaging, and resulting system products.

For mid-range and high-end systems, the typical development process at
most computer vendors tends to be a synchronous, "big-bang" process that
produces a complete new system design about every 3-4 years, followed by modest
upgrades until the next major design.  People upgrade CPUs and disks often,
I/O busses rarely, and backplanes hardly ever.  

The NUMAflex development process is a more asynchronous,
"continuous-creation" approach that offers frequent moderate
improvements, trying to incorporate the faster development cycle of clustered,
low-end systems into the design of larger scalable systems.  The NUMAflex
roadmap includes multiple, and often major, upgrades every year.

Most larger computers have big backplanes and strong coupling of multiple
technologies, such as interconnects and I/O. 

By contrast, the NUMAflex design approach eschews big backplanes in favor
of small bricks and high-speed interconnect cables.
The NUMAflex approach uses specific interface standards and components
shared among NUMAflex families, combined with strategies for evolving them.

The first NUMAflex family is the MIPS/IRIX-based
Origin 3000 & Onyx 3000, soon to be followed by an IA-64/Linux family.
Unlike past SGI families (Power Series, Challenge, Origin 2000),
the new family is *not* only another set of new products, but rather,
the first of several families that share a common design approach, and quite
often, common physical components.

NUMAflex designs have several obvious analogies.  Physically, they resemble
modern modular consumer audio systems.  Philosophically, they follow the
classic UNIX shell-and-pipeline approach that handles a huge variety of
problems by using common connections among a set of modular tools.

At SGI, NUMAflex resembles the recent strategies for IRIX and MIPS CPUs.
IRIX has shifted from "big-bang" releases to quarterly releases
with much smoother transitions, and happier customers,
and MIPS-chips have thankfully reverted to a more disciplined
evolutionary process in place of multiple big-bang designs.

1. BACKGROUND AND OVERALL CHRONOLOGY

1.1 HISTORICAL BACKGROUND

TIMELINE AND CLOSE-RELATIVE SYSTEM GENEAOLOGY

1988		1993		1996		2000
		Cray T3D------>	T3E-----------\						
						NUMAflex (Origin 3000, ....)						
     	 -->DASH-------------->	Origin2000-----/
	/			^	
PowerSeries--->	Challenge------/ 

Power Series & Challenge were classic backplane-bus-based SMPs. Stanford's
DASH prototype [LEN95a] combined 4P SGI 4D/340s with directory controllers
and mesh network to build a research ccNUMA system.  Some key DASH people (such
as Dan Lenoski and Jim Laudon) joined SGI, and then worked on the Origin 2000.
NUMAflex in general, and Origin 3000 in particular, includes alumni of both
Origin 2000 and T3E, and although more closely related to Origin 2000,
was influenced by both predecessors, but also  acquired some radically
different characteristics as well.

T3D & T3E are shared-memory MPP (Massive Parallel Processing),
NUMA (Non-Uniform Memory Access) systems.  Each CPU can address all memory,
using normal cached access to local memory, and remote memory is accessed
using uncached memory references, managed via software conventions and
additional special hardware.  T3s were designed to retain the
high CPU counts of message-passing MPPs, such as TMC CM-5s, but offer
more convenient global shared memory.

In common usage, if a system includes hardware to maintain synchronization among
all of the caches in a system, as seen by ordinary memory accesses from any CPU,
it is called *cache-coherent*. Although it would be more precise to say
hardware-cache-coherent (as opposed to software-cache-coherent), people don't.
Hardware technology is only getting good enough to build reasonable systems
that have both large T3E sizes and (hardware) cache-coherency, and lacking
this hardware, T3s are not normally labeled cache-coherent.

Origin 2000 and Origin 3000 are cache-coherent NUMA (ccNUMA) systems. Origin 3000
is SGI's *second* ccNUMA product, but is (properly) labeled a *third-generation*
ccNUMA, acknowledging Stanford DASH's important contribution as the first
along this general line of development.

TERMINOLOGICAL CONFUSION AND UNCONSCIOUS OBFUSCATION
It is common in computing, as elsewhere, that the meaning of a term changes
over time, causing confusion among the unwary.  In particular, a term
used to make a distinction may change meaning if that distinction becomes
less important, and a general term for a large set may acquire a more
specific connotation if one member of the set is especially successful.

MPP (Massive Parallel Processing) has generally meant large systems with
many CPUs, but it often takes on the connotation of mesasge-passing,
since the early MPPs used that approach.  

SMP once meant Symmetric MultiProcessing (all CPUs have equal capabilities),
as opposed to Assymmetric MultiProcessing (where perhaps only one CPU could
run the operating system, or only one had I/O device attachments).
Over time, system design favored SMPs, so the symmetric-vs-asymmetric
distinction became less useful.

Later, SMP tended to mean Shared-memory Multiprocessing, as opposed to
message-passing or shared-nothing systems.  By *this* definition,
most multiprocessors are SMPs, whether built around a common shared bus
(many minicomputers and microprocessor-based servers), a fixed set of
crossbars (as in many mainframes and vector supercomputers, and in
Sun E10000, HP N-series, various IBM RS/6000s).  With this usage,
Origin 2000s and other ccNUMAs should be called SMPs, but ....

Later yet, the term SMP has meant "SMP, but with bus or small crossbar",
as opposed to distributed shared memory systems (NUMA, ccNUMA, COMA).
Even more specifically, since most SMPs are small bus-based SMPs, many
people use the term SMP to mean bus-based SMP.

There is related evolution in terms like NUMA and ccNUMA.
If all of the memories in an SMP have the same memory access times from
all CPUs, it is labeled UMA (Uniform Memory Access). Most small SMP
designs are UMAs, and any system that is Non-UMA is labeled NUMA.
NUMA systems with (hardware) cache-coherency are usually called ccNUMAs.
Any ccNUMA is a NUMA, but a NUMA need not have cache-coherency.

As happened with bus-based SMPs, the unadorned term NUMA is often meant
as ccNUMA, simply because most NUMAs in the world are actually ccNUMAs.

Finally, COMA (cache Only Memory Architecture) systems (like those of KSR)
are certainly NUMAs, and logically are ccNUMAs as well, since they do
have hardware cache-coherency. However, sometimes people place COMA on
the same level as ccNUMA, i.e., in COMA-vs-ccNUMA arguments.

The moral of all this is to ask people what they mean when they use these terms!

TOP-LEVEL TECHNICAL SUMMARY OF RELATED SYSTEMS
				Peak/raw
Year	System		Arch	Connect	#CPU	Max	I/O
			Type	MB/sec		Mem GB
1988	SGI PowerSeries	Bus SMP	128	2-8P	.256	SCSI, ENET, VME
1991	Stanford DASH	ccNUMA	2*60	48P	.256	(same)
1993	SGI Challenge	Bus SMP	1500	2-36P	16	SCSI, ENET, VME
1993	Cray T3D	Hybrid*	2*300	2048P	128	Attached
1996	SGI/Cray T3E	Hybrid*	2*500	2048P+	4TB	GigaRing (SCI variant)
1996	SGI Origin 2000	ccNUMA	2*800	2-512P	1TB	XIO, PCI (64b, 33Mhz)
2000	SGI Origin 3000	ccNUMA	2*1600	2-512P	1TB	PCI (64b, 66Mhz), XIO

(*Hybrid is as discussed above: like MPP in scalability, but with
shared-memory).

For consistency, peak/raw bandwidth numbers are given along one connection,
either the shared bus in an SMP, or along one connection in MPP or ccNUMA,
expressed as 2*N, since all the cases here are full-duplex.
Sustained numbers are 50-80% of peaks -  YMMV -Your Mileage May Vary!

END OF THE ROAD FOR BIG-BUS SMPS

SMP bus bandwidth rose sharply from 1988 to 1993 - at least 10X
difference in peaks, but efficiency improved, so in the case above,
1988's sustained 60 MB/sec grew to 1993's 1200 MB/sec.

Unfortunately, from 1993-2000, 1200 MB/sec  only improved
by 2-3X (2600-3200 MB/sec) for big-bus (rackmount-width) SMPs,
falling far behind the growth in CPU performance, and
yielding the oft-made complaint:

	SMP busses don't scale

They used to, but they don't any more, but they still managed to take over
a huge market, displacing the classic minicomputer architectures, and
even attacking the low-end of the supercomputer business.  But now...

Big shared-bus backplanes have gotten about as wide (128-256 bit data)
and as fast (83-100Mhz) as they are likely to get any time soon.
Skew and bus-loading problems make it more and more difficult to do much better,
especially in the larger systems, whose backplanes may have limited
clock-rates compared to their smaller siblings.  For example, Sun's E6500
bus runs at 90MHz, using the same boards as the smaller E3500, whose bus
achieves 100Mhz.

This issue has led people to build MPPs (many), and then shared-memory ccNUMAs
of various flavors (KSR (COMA), Convex/HP, SGI, Sequent, DG, Compaq).
Vendors no longer appear to be building new big-bus SMPs.

Instead, one way or another, people use switch-based designs, with narrower
point-to-point interfaces running at much higher clock rates.  Busses that
appeared in 1996 have by now been scaled to 100MHz, while point-to-point designs
were already at 400MHz in 1996, and 800Mhz in 2000.

Of course, mainframes and vector supercomputers have long used crossbar
switches, various ccNUMAs and MPPs have used cabled memory interconnects,
although usually to connect fairly large hardware aggregates. Some early
ccNUMAs have used SCI (Scalable Coherent Interconnect), although this
appears to be slowly disappearing in more recent designs.  (SCI was an
ambitious, probably too ambitious, attempt to create industry-standard
mechanisms for ccNUMA and local networking. See [SCI00a].)
 
In any case, crossbar switches, and sometimes, high-speed cables, have
been propagating downward in price into the broader systems market.

THE SHAPE OF THINGS TO COME

Going forward, I conjecture that computing will be dominated by 2 types
of computers, whose block diagrams look relatively similar:

(a) COMMODITY SBC (SINGLE-BOARD COMPUTER) - memory, fixed-I/O, 1P, 2P or maybe 4P, 
often clustered.  Evidence is accumulating that 1-2P per shared-bus
seems to be a design sweet spot. Many CPUs and chipsets permit 4P/bus,
but quite often, it is found that 2-3P saturate the bus for many workloads.
Anyway, the most typical designs have 1-2P and 1-2 PCI busses
per memory-control ASIC:

	P	P
	|_______|
	    |
MEMORY-----ASIC
	    |
	   I/O

Of course, people sometimes integrate the controller ASIC with CPU,
and multiple CPUs/die and CPU+memory integrations are coming, but for
a while, the block diagram above is quite common.

(b) SCALABLE SYSTEM built from similar nodes - 1-2P per bus,
memory, I/O (either as part of node, or connection to remote I/O), and
some kind of ccNUMA port that integrates multiple nodes together in
a consistent shared-memory environment.  Much of the ASIC design can be
similar to (a), but needs the ccNUMA port, and extra logic to do
cache-coherence, and the memory must provide bits to hold the coherency
information (wider  DIMMS or extra directory DIMMs are typical).

	P	P
	|_______|
	    |
MEMORY+----ASIC+-----ccNUMA interconnect
	    |
	   I/O

Hence, clearly a scalable system needs more hardware, and costs more/CPU
(to handle the domain of problems that want larger systems).  ccNUMA
designers vary wildly in preference, ranging from those who prefer
"light" nodes (2-4P) that resemble (or even are) commodity SBCs, to those
who prefer "heavy" nodes, using crossbars to connect 8-16P before going to
a ccNUMA interconnect.  The trend seems toward light nodes.

Of course, many people build clusters of commodity computers,
or similar collections of non-shared-memory systems, like IBM SPs,
and these offer some of the same characteristics as NUMAflex designs,
but without producing arbitrarily-sized shared-memory systems,
which are the focus of the remainder of this writing.

DEVELOPMENT INTERVALS & "INFRASTRESS"
For whatever reasons, 3.5-4 years is a typical interval between one
top-of-the-line system and the next. For example, the following are typical:

--------SGI--------	--------Sun----		----DEC/Compaq---------
Year	System		Year	System		Year	System
1988Q4	PowerSeries		
1993Q1	Challenge	1992	SC2000		1992	DEC 10000
1996Q4	Origin 2000	1996	UEx000		1996	DEC 8400 (GS140)	
2000Q3	Origin 3000	2000?	next?		2000	Compaq ES320

The usual process creates chassis designs used for 4-6 years,
and allows for CPU board upgrades and disk upgrades.
I/O-bus upgrades happen occasionally, but are often awkward,
and serious interconnect upgrades are rare. Occasionally, modest
increments in bus bandwidth have been provided. In practice,
multiple technologies are coupled in such ways that it is difficult to
re-use very much between major generations.  Technologies change at
different intervals, and at different rates, and are not aligned:

Technology	Interval	rate
CPU		~year		1.6X	[noticeable speed grade each year]
Disk		~year		2X capacity (now; used to be slower)
DRAMs		~3 years	4X capacity, costs go down meanwhile
Chassis		~4-6 years	you bought it, you have it
I/O busses	varies		varies	[difficult to predict, political]

The different rates of change cause enough stress on system infrastructure
(and its designers!) that it inspired a talk I gave often, starting in 1997:
"Big Data and The Next Wave of InfraStress", of which an online version
can be found in [MAS99a].

The subtle goal for NUMAflex was *not* just to build SGI's next scalable product,
but to change the big-bang dynamics and timing into a more
continuous process able to incorporate new technologies faster.

But, to understand NUMAflex and Origin 3000, it helps to review the
predecessor Origin 2000, in a bit more detail. 

1.2 ORIGIN 2000 (LATE 1996)

Technically, this is a 2-bristled (2 nodes/Router), 2P/node,
directory-based ccNUMA using hypercube topologies up to 64P,
and then fat-hypercubes above 64P, using extra MetaRouter cabinet(s).

The discussion is ordered from the bottom-up, for reasons described later:
	ASICs, system module, larger systems, assessment.

1.2.1 ORIGIN 2000 ASIC summary

ASIC	Description			Major ports	Peak B/W, MB/sec
								Each port
HUB	CPU Node Crossbar		1 SYSAD <-> 2 MIPS CPUs	800
					1 Memory		800
					1 XIO			2*800
					1 NUMAlink*		2*800
XBOW	I/O Crossbar			8 XIO			2*800

ROUTER	Node interconnect crossbar	6 NUMAlink*		2*800

and, on most XIO cards:
BRIDGE	XIO <-> 64b, 33Mhz PCI		1 XIO			2*800
					1 64b 33MHz PCI		267

* NUMAlink: the original marketing name was CrayLink, but with the spinoff
of the Cray Vector unit, the name has been changed to NUMAlink. 
NUMAlink and XIO are simultaneous bidirectional packet-oriented channels,
where each direction is 16 data bits wide, running at an effective 400 MBaud,
to get the peak 800 MB/sec each direction.  See [GAL96a] for details.		
				
1.2.2 ORIGIN 2000 "MODULE"

An "8P+12I/O Module" is a file-cabinet-size box, with a big midplane:

2 XBOW I/O ASIC crossbars, each with 8 ports
	2 ports to CPU nodes
	6 ports to XIO cards; a PCI shoebox could replace some XIOs.
		XIO cards often used a BRIDGE ASIC that converted
		XIO to 64b, 33MHz PCI on-board.
	Total: 12 XIO slots.
4 slots for CPU node cards, each of which includes:
	1 HUB ASIC
		2 MIPS R1x000 CPUs
		Memory DIMMs, including directory for smaller systems
		Directory DIMMs added for larger systems
2 slots for 6-port Router boards, each with one Router chip.
	A 4P system might save cost using a Null Router, and an 8P system
	could use a Star Router, but modules expected to be part of a
	larger system need both Router boards.
Half-a-dozen SCSI disks
Power supplies, blower, etc.

ORIGIN 2000 NODE BLOCK DIAGRAM

	P	P	
	|_______|	
	    |	
MEMORY-----HUB--- (NUMAlink)
	    |	 
	   I/O (XIO)


ORIGIN 2000 MODULE BLOCK DIAGRAM

	P	P	P	P	P	P	P	P	
	|_______|	|_______|	|_______|	|_______|	
	    |		    |		    |		    |
MEMORY-----HUB---\	/--HUB--MEM	MEM-HUB---\	/--HUB-MEM
	    +	  Router    *	   	    +	   Router   *
	   I/O	 / | | \  I/O		   I/O	  / | | \  I/O
	    +	        \-- * ------------- + ---/	    *
	    +		    *		    +		    *
	    ++XBOW+++++++++ * +++++++++++++++  		    *
	    / / | | \ \	    ***********************XBOW******
						  / / | | \ \

The Routers offer 6 NUMAlink ports outside the module, and the XBOWs
provide 12 XIO slots.  Each Router supports a pair of CPU nodes, one of
which is connected to each XBOW.

The same module can be a deskside machine or rackmount; a deskside Onyx2 uses
a different cardcage to incorporate a graphics unit in place of 2 CPU nodes.

An Origin 200 is a smaller box, limited to 2P internally, which uses the
same ASICs, but has very different physical packaging, PCI I/O,
with optional external XIO box.  Two can be cabled together into a 4P system.

ASIC/CPU EFFICIENCY
A full Origin 2000 module has 8 CPUs, and 8 main ASICs (4 HUBs, 2 XBOWs,
and 2 Routers), ignoring I/O cards. A useful first-order metric for cost is
the overhead ratio of ASICs/CPU, about 1:1 for the big ASICs in a full module,
and a bit worse in partially-populated modules. The 1:1 ratio was an
improvement on the previous Challenges, which ran about 3:1.

Lower ratios are *not* automatically better, as less ASICs may mean
less pins and less bandwidth, but *usually* lower ratios mean less cost,
and really large ratios may imply very high cost.

This is actually a special case of the more general issue of fixed-costs
in scalable systems.  In many systems, the customer buys some chassis,
then fills it with CPUs and I/O, implying that the worst price-performance
occurs when the chassis is minimally filled.  The best price-performance
occurs when the chassis is filled, except, perhaps, where bandwidth
limitations (as in big-bus SMPs) cause the later CPUs to suffer in performance.
Thus, people like to have incremental, pay-as-you-go costs, rather than large
fixed costs that must be amortized.

A big backplane is expensive, and represents a larger unit of failure,
than does a smaller backplane with less chips.  Engineers are always
jiggling the boundaries, as the tradeoffs keep changing, and different
design groups can rationally prefer different partitioning.
The next generation of ccNUMA systems illustrate this, as systems with quite
similar topologies display radically different packaging and ASIC counts.

1.2.3 ORIGIN 2000 LARGER SYSTEMS

A rack can hold 2 modules, connected by 2 NUMAlink cables, giving 16P/rack,
with 24 XIO slots, and 4 Routers.
By adding racks and cables, one builds larger and larger hypercubes
(either complete or partially populated), up to 64P.  Then, these
are connected via MetaRouters (additional racks containing only Routers),
as fat hypercubes, to build larger systems.

Hypercubes were chosen for scalable bisection bandwidth, which generally
grows in proportion to the number of CPUs, and for low latency.
See [LEN95a] for the details and comparisons of interconnect topologies.
Local restart memory latency is ~330ns.  Remote restart latency, to
furthest node in 128P system, is about 1.2 microseconds.  Theoretical average
remote latencies (assuming random memory references scattered equally)
are lower, since the furthest-away node is the furthest away, and in a
hypercube, most nodes are of course closer than the furthest.
Effective latencies, as experienced by actual programs, are lower yet,
probably more like 400-600ns (in same ballpark as Sun E10000), because:

- MIPS CPUs are aggressive speculative CPUs that overlap memory accesses,
  hiding latency for many codes.  Many vendors build speculative CPUs,
  because they successfully hide some latency, but one must be careful
  in interpreting latency numbers, as some benchmarks, on purpose,
  defeat the speculative execution features. 

- The theoretical average latency is about as bad as an OS can do,
  whereas the IRIX OS has become well-tuned at allocating memory "near"
  the CPU9s) using it.

- OS code is normally replicated into multiple nodes, so that OS instruction
  cache misses are "closer" than average.

- For special cases, programmers use "cpuset" directives to control memory
  allocation more explicitly.

All of these serve to reduce the actual latency as seen by programs.

Latency citations are often not much better than the old mips-ratings -
unless somebody is very specific, it's hard to know what they mean,
and being really specific rapidly turns a quick note into a thesis.

SGI usually cites "restart latency", or interval from detection
of primary cache miss through restart of the instruction, including CPU
and refill overhead, but with no contention from other CPUs or earlier
cache misses (back-to-back).  Back-to-back latency (as measured in lmbench)
is worse, and other contentions are worse, and very complicated.
For discussion, see [HRI97a].

1.2.4 ORIGIN 2000 RETROSPECTIVE ASSESSMENT

I've lost track of the numbers, but I think there are 30,000 or so Origins
out there (including the smaller Origin 200s, which use same ASICs), of which:
Size	Quantity
512P	1		(196GB main-memory system @ NASA Ames)
256P	6-8 (?)		(?, because 256Ps sometimes get split into 2*128)
128P	200+

There are hundreds of 64P systems, and thousands of the smaller ones.

GENERALLY-WORKABLE ccNUMAs IN 1996
ORIGIN 2000s were directory-based ccNUMAs that actually worked when
people spread jobs across multiple nodes, and acted more like UMA
(Uniform Memory Access) SMPs than did many ccNUMAs of the time.
The workability was due to several attributes:

(a) Remote:local latency ratios were generally under 4:1, usually more like
	2:1 as seen by typical programs, which means many people didn't
	need to worry too much about data placement, especially in
	comparison with some ccNUMAs where the ratio could hit 10:1.
	Of course, one can keep this latency ratio low (good) by increasing
	the local latency, but that is *not* a good idea.  Much better
	is to pay fanatical attention to latency everywhere.

(b) The hypercube topology scaled up bisection bandwidth, more-or-less
	proportional to the number of CPUs, and hence many workloads scaled
	fairly well.  Bisection bandwidth is the bandwidth obtained by
	slicing the machine in half and computing the total bandwidth across
	the bisection.  In bus-based SMPs, both total memory bandwidth,
	and bisection bandwidth (= bus bandwidth) remain constant, implying
	that an increase in number of CPUs causes a proportionate decrease
	in per-CPU bandwidths (memory bandwidth and bisection bandwidth).
	In ccNUMAs, the total memory bandwidth is normally proportional to
	the number of CPUs, but the scaling of the bisection bandwidth
	varies according to the toplogy, i.e., hypercube and 3D-torus scale
	fairly well, simple-ring stays constant, like an  SMP bus.
	Bisection bandwidth doesn't matter much if a system is running a
	workload of mostly-independent transactions that stay in
	their local nodes, but large-compute or large-I/O jobs can suffer badly
	if the interconnect gets overloaded, just as people found with
	big-bus SMPs, when adding more CPUs simply did not help.  SGI systems
	are required to deal with high bandwidths, so simple rings were not
	an option. 

(c) The node size was small (2P), and the same interconnect was used at all
	levels. This avoided the kind of performance dropoff sometimes
	found with big-node ccNUMAs, where the inter-node latency difference
	was so high that some computer centers did not permit users to run
	jobs that spanned multiple nodes, because performance dropoff
	for remote memory accesses degraded performance seriously. In
	effect, such machines were run as clusters of independent systems,
	in which case the ccNUMA hardware was not very useful.

(d) IRIX tried to optimize data placement, and tools were provided for
	user directives to override the defaults.

Early on, customers seriously worried about data placement,
but most discovered they need not worry about it much.

Of course, some Origin 2000 users *do* worry about data placement,
typically for one of several reasons:

(a) They are going for all-out performance on big parallel CPU jobs,
especially those with higher CPU counts.

(b) They are doing I/O tasks that require careful balancing of
memory bandwidths. Each node's memory system peaks at 800 MB/sec
(and about 600 MB/sec sustained), but we have seen sustained single-file
read rates above 4 GB/sec, which requires careful striping across about 8
memories. These issue shows up in big-I/O and media-streaming applications.

Some people wanted more local memory bandwidth, but it turned out that
the ccNUMA interconnect bandwidths, for most people, were actually a bit
higher than needed.  However, in making the transition from SMP to
ccNUMA, we wanted to make it as smooth and surprise-free if possible,
if necessary, providing a bit more bandwidth than necessary, and then
expect to fine-tune later designs.

Anyway, in actual practice, most users didn't worry about it, and certainly,
most third-party software vendors haven't. Many people just ran the existing
Challenge SMP binaries on Origins as well.

Customers liked the ability to start low, buy incrementally,
and build big; they liked being able to disassemble systems and reconfigure.

Some of them loved the massive I/O capabilities:
	7 GB/sec from one file system [although one customer I knew wanted 15]
	4+ GB/sec to/from one file
	1 TB backup in an hour
	7-TB indvidual files actually used by real customers.
	Disk farms in use at 10-100+ TBs, with no fsck.

While such abilities are relatively straightforward on an Origin 2000,
they simply do not work at all, or not without serious data special-casing
and data partitioning on clusters of machines.  Also, the high-speed I/O
pipes were crucial for feeding graphics units, handling digital media
applications, and doing high-speed networking, not just in handling
large numbers of disks.

Most people liked the minimal Non-Uniformity, so they could treat the machine
more like UMA SMP.  People often think Non-Uniformity is a goal.  It has *never*
been the goal of any SGI NUMA designers, but rather viewed as a necessary
price to obtain cost-effective scalability, and a reasonable price to pay
as long as kept under tight control.

If we could get UMA everywhere, cheap, we'd love it.
Until then, we obey the laws of physics, and have NUMA.

1.2.5 BUT CUSTOMERS ALWAYS WANT MORE

HAVE IT THEIR WAY
	(a) Many customers filled every CPU slot (8 per rack), and
	    2-3 (of 24) I/O slots and complained about the wasted I/O slots.
	(b) A few customers filled every I/O slot, used only half of the
	    CPU slots, just enough to get all of the I/O connected, and
	    moaned about having to buy CPU boards they didn't really need.
	(c) Some customers wished for *more* disk in the main chassis.
	(d) Some customers didn't want *any* disks in the main chassis;
	    they just wanted separate Fibre Channel disk racks, and they
	    complained that SCSI disk slots were a waste.

I once visited a customer whose basement was filled with Origins (good!),
but all 4 complaints were voiced to me on the visit, by different people.

TECHNOLOGY EVOLUTION & MIS-MATCHES

People do not expect major in-chassis upgrades for single-board computers,
but they demand them for their scalable systems. This "InfraStress"
issue was discussed earlier, but the following is more specific.

EXAMPLE 1: the Origin 2000 is an XIO machine with a few concessions to PCI.
The  XIO interface is very fast, with many good characteristics, but
overkill for some uses, and the cards naturally cost more than PCI cards.
PCI 64b, 33Mhz simply wasn't fast enough for some SGI needs, but 66Mhz
wasn't really there yet, and cards weren't available.  We did have a
PCI shoebox to take care of people with standard-bus needs, but the
packaging remains a bit awkward [the shoebox sticks out a few inches.]

A year earlier, and Origin 2000 might not have had any PCI,
and a year later, it might have been mostly PCI, with a few XIO slots.

Likewise, Origin 2000s spanned co-existence/changeover between SCSI disks
and FibreChannel disks; had it come out 2 years later, it might have
used only FC disks, perhaps.

This sensitivity of design to such timing is always painful, and of
course, truly awful things happen when some part of a schedule slips,
especially for reasons outside one's control.

EXAMPLE 2: Sun servers are transitioning from SBUS to PCI, leading
to the following sort of comment (from some Sun Web Page):

"Q. PCI boards do not package well in Enterprise servers compared to
SBus boards.  So, why would I recommend SCI PCI over the SBus version?

A.  You are correct and that is why we are keeping SCI SBus in the
product line.  However, our next generation of servers will not support
SBus.  Therefore, customers who will want to cluster today's generation
of Enterprise servers with tomorrow should consider using SCI PCI."

That is not a knock on Sun, it is a typical inter-generation transition
problem that occurs again and again with I/O.

Customers not only keep old I/O devices "forever"
(some SGI customers still have many VME cardcages out there),
but still want access to the newest devices.  I/O is also awkward, in that
it's as much political as technical, the standards come from shifting
industry coalitions, and the schedules are often unpredictable.

REDUNDANCY & RESILIENCY
Some people wanted more redundancy, but others wouldn't pay for it.

In most cases, for a single system size it is cost-optimal to build
a design optimized for that size, although doing so may not be optimal
across a product line.  Computer designers and marketeers are always
arguing about the number and sizes of distinct boxes to be built.

A unit that is too big may create a large unit of failure,
and a unit too small may impose high costs for modularity. For example,
if every replaceable unit needs its own N+1-redundant power supplies,
something built of small units ends up with "redundant redundancy".

While an 8P+12 I/O unit was not that big, especially compared to the
larger servers around, it was still too large a unit of failure for
some customers, even with some redundant internal paths.
With 2 XBOW ASICs, each connected to 2 CPU nodes, there were at
least 2 CPU<-> I/O paths, but still, a node failure required a reboot.

People wanted support for partitioning, and that *almost* worked,
but not quite. Under certain circumstances, reset signals propagated
into other partitions, a Bad Thing.  It was good enough to debug partitioning
software, but not (the required) 100% safe to release.

AGGLOMERATED SYSTEMS
The partitioning issues were just a few among the myriad of details that one
must learn to get right in a system constructed from an agglomeration of
relatively-independent modules.  System controllers aren't simple,
but need to cooperate with others.

Methods that work fine in uniprocessors or small SMPs, or even some big-bus
SMPs may fall apart when scaled up. In the "good old days", a master CPU
could check each slot and see what's there.  In a modular system that supports
a wide range of topologies, and may have broken links, it's not so easy.

Algorithms that used to work may become impractical due to
scaling issues.   Elapsed times may become dominated by serialized code.
For example, UNIX "fsck" is a disaster on a 10-TB filesystem,
which is why people use journaled filesystems like XFS.
Likewise, early on, the reboot time for the 512P + 196GB-main-memory
Origin 2000 at NASA AMES was .... 2 hours. (It's much less now).

In general, leading-edge, high-end systems get to have more
"close encounters of a strange kind" with scaling issues that sound
like fantasy to most people, but later come to afflict many systems.
Just 10 years ago, many people considered the ideas of 64-bit micros
and more-than-4GB-memory as strange, but we've seen 8GB desktops already.

Anyway, although these agglomerated systems overlap in market space with
bus-based SMPs, they add all sorts of exciting new issues to solve.

INTERNAL ELECTRICAL IMPLEMENTATION ISSUES

In some cases, the multiplicity of Origin 2000 physical realizations
caused hassles or overhead.  For example, node cards used a tricky connector,
over which ran both NUMAlink and XIO onto traces that connected with Routers
and XBOWs, which then had different connectors for NUMAlink cables
and XIO cards. To run an XIO cable (XTown) (to a graphics unit) required
an XIO card with an "XC" chip to provide differential signals.
All of this was done for good reasons, but people wished for less different
flavors, especially since high-speed-signal engineering is nontrivial.

STATUS - LATE 1997	

The Origin 2000 was introduced in late-1996, and its smaller sibling,
the Origin 200, in early 1997. Hardware engineers were already considering new
ASIC designs.  There was, of course, serious churning around due to digestion
of the (mid-1996) Cray merger, changes in the executive ranks, etc.
The high-end MIPS roadmap had been changed, with two new designs
cancelled in favor of extensions to R10000/R12000. Intel's IA-64 was
to be incorporated into systems.

We also had experience from T3Es, whose sweet-spot was larger than Origin 2000,
but whose engineering philosophies were more similar than one might expect -
both groups shared fanatic attention to latency, bisection
bandwidth, and serious I/O,  even though many other details were different.

We had considerable software efforts going towards scalability,
attempting every 9-12 months to double the size of the largest useful
configuration.

So, we continued to believe strongly in directory-based ccNUMAs,
but we'd learned a great deal from real practice, and we knew we would
have a long overlap of MIPS and IA-64 systems.

At that point (late 1997), we were going to have 2 product lines:
	- Flintstones: MIPS/IA64; 2-128P [hardware primarily Mountain View]
	- Aqua: MIPS; 64-1024P [hardware primarily Chippewa Falls]
	
So, we'll see how we got from there to where we are now, but first:

REMINDER OF AN UNPLEASANT FACT OF CURRENT COMPUTER DESIGN

Once upon a time (15 years ago), people built microprocessor systems
that included many PALs and wires, and they could be changed and fixed fairly
easily. Boards often shipped to customers with fixup wires soldered on them.

These days, people don't have this luxury, but work for
years on monster CPUs and ASICs that have more gates than whole boards had
just a few years ago, cannot be fixed with soldering irons, and
require gate counts, clock rates, or special circuits not yet obtainable
from FPGAs.  It can easily take a few months for a complete turn on a CPU
or aggressive ASIC, so verification tests become ever more important.

Thus, ASICs & CPUs have to be started long before one knows *all* of the details
of system partitioning, packaging, configurations.  If new requirements cause
major changes, especially to interfaces between ASICs (or with CPUs),
people intone the dread words "Major Schedule Slip", meaning year(s).

On the other hand, given a flexible set of ASICs, one may be able to build a wide
variety of systems.  For example, Origin 2000 and 200 use the same major
ASICs, but are otherwise rather different. For this reason, I keep
presenting ASICs first, then system design, because the ASICs tend to
acquire more inertia earlier in the process.

This will help explain what happened, and I've often noted that there is
often more insight to be gained from hearing about paths considered,
but then rejected, than in just knowing the path chosen.

1.3 "FLINTSTONES PROJECT" (AS OF JANUARY 1998)

At that point, SGI was fairly far along with the designs for several ASICs,
and people were developing concepts for system partitioning and packaging.

There were to be both MIPS and IA-64 flavors, although whether we
were doing one or the other, or both, and if both, in which order,
sometimes changed according to executive decisions!
In such cases, good engineers keep on slogging away, and opt for
very flexible designs, Just In Case.
[There must surely be a Dilbert to this effect.]

In a net posting, I won't describe all of the variants, but it
is worth looking at our clearly-defined view of the world in early 1998.

1.3.1 ASIC SUMMARY (AS OF JANUARY 1998,  STILL TRUE)

ASIC	Description			Major ports	Peak B/W, MB/sec
								Each port
Bedrock	CPU Node Crossbar		2 SYSAD <-> 4 MIPS CPUs	1600
					1 Memory		3200
					1 XTown2 -> XBridge	2*1200
					1 NUMAlink3		2*1600
XBridge	I/O Crossbar			2 XTown2 (XIO+)		2*1200
					4 XIO ports		2*800
					2 64b 66MHz PCI		533
Router	Node interconnect		8 NUMAlink3		2*1600

XXXXXXX	Itanium bus <-> MIPS bus	Not yet public

Bedrock fills the same role as the Origin 2000 HUB, except has:
	4X CPU-bus bandwidth (for 4 CPUs, not 2) =
		2  MIPS SYSAD busses, each 2X faster
	4X memory bandwidth
	2X ccNUMA interconnect bandwidth (NUMAlink -> NUMAlink 3)
	1.5X I/O bandwidth (XIO -> XTown2)

XBridge resembles an XBOW with 2 builtin BRIDGES, and can be used to
provide either XIO or PCI ports. Three XBridges ganged together the
right way can supply 6 64b 66Mhz PCI busses.

The differing improvement ratios reflected costs, physics, and studies of
Origin 2000 performance.  For example, many wanted more memory bandwidth,
but relatively few had saturated the ccNUMA or I/O interconnect.  It
would have been nice to have made the XTown2 bandwidth 2*1600 MB/sec,
but that was not feasible.

Given these, it is possible to build all sorts of systems, but the
basic CPU nodes must look like the following, with Routers and I/O
placed wherever makes sense for system partitioning.

		MIPS				Itanium

	P	P	P	P	P	P	P	P	
	|_______|	|_______|	|_______|	|_______|
	    \		/		    \		 /
	     \	       /		XXXXXXXX     XXXXXXXX	
	      \	      /		   	      \	       /
MEMORY---------Bedrock---NUMAlink 3	MEMORY--Bedrock---NUMAlink 3
	   	  |				   |
		 XTown2				  XTown2

The NUMAlink 3's (optionally) connect Bedrock <-> Bedrock, or Bedrock <-> Router.
The XTown2's connect to (optional) XBridge chips.

Numerous topologies are possible.  For example, one might have 2 CPU
nodes per Router, leaving 6 ports free to connect to other Routers,
or 4 CPU nodes/Router, leaving 4 ports free.  It is good if the smallest
machines (4P or 8P) can avoid paying for Routers. [In Origin 2000s,
several special Router cards were used to achieve this, but it was irksome
to need the variations, or to change router boards as systems scaled up.]	 
				
1.3.2 FLINTSTONES MODULE - BAMBAM

This was the replacement for the Origin 2000 module, would have
been fairly similar, and naturally have been named Origin 3000 or 4000.
It was proposed as a 10U (17.5") box:
	8 CPUs, connected to 2 Bedrocks
	3 XBridges, giving either 12 PCIs (3 4-PCI shoeboxes),
		or 8 PCIs and 2 XTown2s
	6 disks
	2 removable media (Device Bay)
	1 Router, with 2 ports connected to the 2 Bedrocks, 6 ports free
	Miscellaneous other items as needed; power, fans, etc.

Ignoring I/O, a full BamBam would have had 2 Bedrocks, 1 Router, and 3
Xbridges, or 6 ASICs / 8 CPUs, a .75:1 ratio, better than Origin 2000's 1:1.

However, if one recalls complaint 1.2.5 (a)-(d) above, nobody is ever
satisfied with a fixed CPU-I/O ratio, and SGI not only has Big-CPU + Big-I/O
customers, but also Big-CPU-hardly-any-I/O customers,
so the following was added, to avoid wasting money on unused I/O:

1.3.3 FLINTSTONES CPU MODULE - PEBBLES
	5U box
	8 CPUs, connected to 2 Bedrocks
	1 Router, with 2 ports connected to the 2 Bedrocks, 6 ports free

	These have a ratio of 3 ASICs: 8 CPUs, or .375 (good).

1.3.4 FLINTSTONES - BIGGER SYSTEMS

It is clearly possible to build Flintstones systems similar to Origin 2000s,
by combining BamBams, and CPU-rich configurations could be gotten by
adding Pebbles boxes.

With 6 free router ports per box, there were of course a myriad of potential
topologies, limited mainly by one's imagination and willingness to
handle the cabling variations in the field.

1.3.5 BROADER CONTEXT AND OTHER PROJECTS

In addition to the teams working on Itanium and MIPS, another
group was getting started on Intel "McKinley" designs.

Looking forward over the next few years, one could observe that:

(a) The various CPUs had different (sometimes, radically different)
power, cooling, and packaging requirements, with different CPU busses.
Among other effects, it is extremely difficult to optimize dense-packed
CPU boards to handle such differences effeciently, especially when
specifications change.

(b) 64b, 64MHz PCI satisfied most needs, but we still had to support XIO
for a few cases, as well as legacy re-use.  Later, we might want PCI-X;
and then we might have either Intel's NGIO or the rival Future I/O, with no way
of knowing who would win, or whether both would be required.
As it happened, thankfully, they merged to create one - InfiniBand.
>From past experience, we knew that I/O standards efforts were unpredictable.

(c) There was no end of argument about I/O mixtures for BamBam.
Almost every integrated system design I've been involved with has
had these arguments.  There is only so much physical space, and there are
always serious fights over every bit of it.

(d) There was no end of argument about the amount of redundancy required,
and where it would be, and who would or wouldn't be willing to pay for it.

(e) Any optimizations we came up with for delivery at a specific date
seemed to rapidly become non-optimal over the following few years,
especially given the rapid changes to I/O standards.

1.3.6 WHAT HAPPENED THEN WAS ...

We didn't actually build early-1998's Flintstones at all, although if we had,
it might have shipped a bit earlier.  The name did persist, but what we
built was very different from Origin 2000 or January-1998 Flintstones.

In mid-January, a more modular design was suggested, and went through some
frequent iterations and input from numerous people.  Also, in March
Aqua and Flintstones were combiend into one project.  By June 1998, we had
mostly settled on something that was radically different, not-yet-complete,
but recognizable in the systems finally built.

The ASICs mostly stayed the same, of necessity,
but almost everything else changed.

In January 1998, it was not at all clear that the new approach would work,
Physical and low-level electrical design issues often get short shrift
in publicity, but in this case, people continually solved very subtle
and difficult problems, without which this whole approach was infeasible.

However, something much more important happened during 1998, although it
wasn't yet quite so obvious.

We created an overall design approach (now named NUMAflex), that
encompasses multiple generations and families of machines, and that
gives better hope of adapting to uncertainty.  We generated plans out
through 2006 looking at evolutionary scenarios.  We modified some earlier
designs to avoid causing problems for later ones, and we modified later
designs to make better re-use of earlier pieces.  Hence, instead of
having several independent projects, we got one big project, with
subprojects that shared many common pieces. 

[This has been done before, of course, but it hasn't been very common
in the large microprocessor-systems business, and especially not at SGI,
so it was a major philosophical change.] 

We changed to a design philosophy that emphasized frequent
improvements, rather than having 3-4-year development efforts optimized for
a delivery date, but then inevitably becoming suboptimal.  High-end systems
suffer from this, across the industry, because it takes big efforts to
do an entire new design.   

NUMAflex is a long-term-oriented design approach,
not just targeted at the first family, Origin 3000.  If it were *only* that
family, some optimizations would be *quite* different.  Hence, this whole
approach is a large bet that a continuous design process will work better
than the "big-bang" approach.

Also during 1998, we decided that Linux-IA64 would be used rather than
IRIX-IA64 for the IA64 versions.  That is a whole separate story, based on
strong input from major third-party software vendors.

2. SGI NUMAFLEX (TM) DESIGN APPROACH - BRICK-AND-CABLE MODULARITY

All of the NUMAflex systems are assembled from "bricks" (i.e., particular
sorts of "cyber-bricks"), using high-speed cabled interconnects, rather than
normal big backplanes.  People seeing an Origin 3000 often look for the big
backplane.  THERE ISN'T ANY. Bricks are rack-wide, mostly 3U-4U high
(5.25" - 7"), connected via cables.

Brick-and-cable is the implementation essence of all NUMAflex systems.

As mentioned earlier, a good physical analogy is that of modern,
modular audio components that plug together and evolve independently.
By wonderful contrast, our marketing folks showed an old wooden TV+record player
entertainment center - beautiful wood, but not very upgradeable, no CD!

FIRST DO NO HARM - we wanted to keep the scalability we had in Origin 2000,
and then do better:

2.1 INDEPENDENT RESOURCE SCALABILITY - CPU+memory, I/O, and storage can be
  scaled (relatively) independently in the varying ratios desired by customers,
  without wasting slots, floor space, power, and cost.
  In early NUMAflex systems, the number of I/O bricks is no more than
  the number of CPU bricks, but we expect to relax that restriction later.

2.2 INDEPENDENT RESOURCE EVOLVABILITY - CPU+memory, I/O interfaces, storage,
  and even interconnects can be evolved at their own natural
  rates, still work together, and offer many years of cost-effective,
  yet compatible upgrades.  Numerous subtle problems must be solved to do this.
  MIPS and Intel IA-64 NUMAflex families offer long roadmaps,
  but most hardware elements are shared, both between MIPS and
  IA-64 families, but also between generations.

The *crucial* enabler of the two attributes above is the (somewhat unusual)
partitioning of I/O, wherein a CPU brick only pays for an XTown2 connector.
If I/O is to be connected, one adds an XTown2 cable, and an I/O brick
that internally converts XTown2 to the specific I/O implemented in that brick.
VERY IMPORTANT: CPU BRICKS ARE COMPLETELY DECOUPLED FROM THE END I/O BUSSES -
no CPU brick contains any manifestation of XIO, PCI, PCI-X, InfiniBand, etc.

I/O is only paid for as I/O slots are filled, and CPU decisions are
decoupled from I/O evolution, so that CPU and I/O evolution could become
more asynchronous, for hardware. [There is *always* A Minor Matter of Software.]

This approach is clearly very different in packaging from the integrated
I/O of the Origin 2000 and many older SMPs. It is less different
from designs that have separate I/O ASICs plugged into CPU backplanes,
with cables to remote PCI boxes, like Compaq ES320.  The subtle, but
important difference, is that no hardware component in the SGI CPU brick
depends on the choice of the I/O busses employed elsewhere.

By 2001, we should have seen the whole new round of ccNUMA designs,
and then can see the different ways people have sliced this problem.
Meanwhile, we are quite happy to have the line drawn
where we put it, that is, with CPU bricks bearing little cost for I/O
and having no dependence on the type of I/O connected.  This allows
great flexibility of evolutionary paths, but also, upgrade paths for
individual installed machines.

2.3 PERFORMANCE - Of course.  Bandwidths up (~2X), latencies down (~.5X),
  CPU performance up about 30% (at same clock rate) in Origin 3000,
  and later NUMAflex machines improve more. 

2.4 PRICE/PERFORMANCE - it is not good enough to build fast, but expensive
  systems.  These systems use commodities wherever they can, and get some
  unusual economies of scale by minimizing the number of distinct entities,
  given the immense range of configurations. We're happy to have had
  big backplanes disappear. It is convenient that more elements are
  reasonably FEDEXable.

  An obvious improvement is that we reduced the typical 1 big ASIC per CPU
  ratio in the O2000 (8 CPUs, 4 HUBs, 2 XBOWs, 2 Routers) down to .3-.6,
  depending on I/O richness.  [8 Bedrocks, 2 Routers,
  and likely 4 Xbridges typical for 32P system: 14:32, or .44]. 

  MIPS/IRIX, and IA-64/Linux systems share most hardware elements,
  and even if drivers are different, the overhead of device qualification
  and support is far less than doing it twice.

2.5 STRONG RAS FEATURES - the use of independently-powered bricks
  solved numerous problems and ended many arguments, once we became
  convinced that it was actually possible at the speeds expected. Becoming
  convinced took serious efforts by many people!  There is an obvious
  pluggable connection at the end of each cable.  There are less
  different kinds of connectors, cables, and power supplies. 
  PCI cards are warm-pluggable, without moving anything else around.
  Power supplies and fans are N+1-redundant and hot-pluggable.

  I/O bricks can be cabled to 2 separate CPU bricks, rather than being
  incapacitated if their attachment to a unique CPU node fails. Although
  dual-attach is common in mainframes, it is rarely found in micro-based
  systems, even systems introduced in 2000, most of which either integrate
  I/O with CPU boards, or if cabled, have no redundancy in path from
  I/O box to CPU+memory system.
  
2.6 SERVERS AND GRAPHICS - of course, SGI always does this, but with
  even more flexibility, given the increased modularity of the bricks.

2.7 FAMILIES - A NUMAflex family is typically defined by the
  CPU brick, i.e., one may mix different speeds of the same flavor,
  may upgrade within the family with no software-visible changes, etc.
  Typically, different C-brick ASICs create different families, due to
  necessary changes in addressing, cache-coherency protocols, etc.
  For example, early MIPS, Itanium, and later IA-64s are clearly different
  families.

  Some bricks (like I/O & Routers) and cables are used in common by multiple
  families.  Early I/O bricks will live a long time, even if newer
  systems begin using newer I/O bricks, but the general process is one
  of incremental overlap, rather than big-bang changes.

  Some elements (racks, power supply, etc) are common among all families.

  Anyway, the whole point is to maximize the re-use across families,
  avoid premature obsolescence, and give a lot of flexibility with
  regard to I/O and CPU evolution, and in general,
  PROTECT CUSTOMER INVESTMENT.

2.8 UNUSUAL FLEXIBILITIES
  With Origin 2000s, people sometimes built up a big configuration,
  then later split it up into several pieces placed in different
  locations.  NUMAflex systems are like this, but much more so.
  It is also easier to put the bricks down submarine hatches,
  or into other constrained spaces.
  Smaller bricks are inherently more ruggedized than larger modules,
  so they are directly suitable for some applications that required
  ruggedizing of Origin 2000s.

So, what are the details?

3. ORIGIN 3000 + ONYX 3000 = FIRST NUMAFLEX FAMILY

Technically, these are 4-bristled ccNUMAs (4 nodes/Router), 4P/node,
using hypercube topology up to 128P, and then fat cubes up to 512P.
In CPU-rich configurations, they use about half the floor space per CPU
as the Origin 2000s. One can have a complete 32P, CPU-rich configuration in
one rack, or 128P in 5 racks (4 CPU, 1 I/O).

An Origin 3000 rack resembles a rack of losely-coupled small computers,
but as bricks are added, they form a tightly-integrated system,
whose close joining is symbolized by the wavy vertical lines on the front door.

3.1 BRICKS
Each brick has hot-plug, N+1 fans at the front, pulling air through to the back.
This was a change from the big blower used in Origin 2000.

BRICK	HEIGHT: U=1.75"	DESCRIPTION
C-brick	3U	2-4  MIPS R12000A CPUs, 1 Bedrock ASIC,
		512MB - 8GB of memory
		1 NUMAlink connector at back (2 * 1.6 GB/sec)
		1 XTown2 connector at back (2 * 1.2 GB/sec)
		LCD, Level 1 system controller, etc.
		Local memory bandwidth  (peak) = 3.2 GB/sec, regardless of size
			[This is strangely important: bandwidth does *not* 
			depend on the number of DIMMS, unlike, for example
			Compaq GS320, where full quoted bandwidth only
			occurs for some memory sizes, whose minimum is
			4GB for a 4P node.]
		Local memory (restart) latency = 180ns.

R-brick	2U	1 8-port Router (or 6-port in smaller systems)
		8 * (2* 1.6 GB/sec) ports
		Latency:
		CPU -> cable -> Router -> = about 30ns for 1m cable,
		a bit more for longer cables.
		A good rough estimate is that Origin 3000 latency tends
		to be about 50% of Origin 2000 latency, at same CPU counts.
		Local memory latency is better, per-hop latency is better,
		and 4-bristled topology with 8-port routers costs less hops.

I-brick	4U	Base I/O, every partition needs at least one
		Connects to 1-2 C-bricks
		1 XBridge ASIC
			2 XTown2 ports (can dual-attach, either for
				resiliency, or bandwidth, or second one
				can attach to G-brick)
			2 64b PCI busses
				1 66MHz, 2-slot
				1 33MHz, 4-way:
					3 slots, plus local I/O
					(1 serial, 2 1394s, 2 USBs, Ethernet,
					RTI/RTO (Real-Time Sync ports))
		2 FC disks (FC controller uses a slot)
		1 DVD/CD-ROM [1394]
		Each partition in a system needs one I-brick.

P-brick	4U	PCI brick, 12 64b 66MHz slots on 6 busses
		3 XBridge ASICs ganged together
			XBridge #1 uses all ports
				1-2 XTown2 to C-brick(s)
				2 PCIbusses, with 2 slots apiece
				2 XIO to Xbridge 2
				2 XIO to XBridge 3
		Xbridge #2 & XBbridge #3 supply 2 PCIs each,
			with 2 XIOs to Xbridge #1, and 2 XIOs unused.

		Sustained PCI bandwidth could get as high as
		6 * 400 MB/sec, or 2400 MB/sec, which would likely
		saturate 1 XTown2 (2*1200 MB/sec peak),
		but of course, the P-brick has 2 XTown2 ports,
		so bandwidth-intense applications would use both.
		This also leaves headroom for later (faster) PCI-X versions.
		Thus, the customer can have:
		- Low-cost, connectivity applications - 1 port connected
		- Redundant, connectivity applications - 2 ports connected
		- High-bandwidth applications - 2 ports connected

	All I & P-brick PCI cards are (in hardware) warm-plug, with horizontal
	insertion/removal from the rear, using a plastic carrier that
	pushes the PCI card down as needed.
	
	[Big-system hardware designers are not fond of PCI's insertion at
	right angles to the external connectors. Note, for example, that
	Compaq's GS320 documentation says nothing about warm-plug PCI.

X-Brick	4U	XIO Brick, 4 XIO slots (like Origin 2000's)
		1 XBridge ASIC
			2 XTown2 ports to outside
			4 XIO slots (PCIs unused)

G-brick	18U	Graphics (Infinite Reality 3)
		1 XTown port, connectable to XTown2 port in
		I, P, or X-bricks. A pleasant improvement: it no longer
		needs a card to convert XIO to differential XTown.
		An Onyx 3000 is an Origin 3000 + G-brick(s).
		Of course, 18U is a giant "brick" and the graphics
		backplane is the only big one in the system, as this
		brick was essentially brought forward from Onyx2. 

D-brick	4U	JBOD Disk brick [RAIDs are normally in other racks]
		1-12 FC disks

Powerbay 3U	(Not really a brick, although similar size).
		1-6 hot-plug power supplies, yield 48V, which runs to
		C, R, I, P, X bricks, each of which is responsible
		for its own internal power conversions [they vary].
		They use industry-standard power supplies.
		Tall rack has 1-2 of these, short rack has 1.

Other items	Bricks have Level 1 system controllers.
		Each CPU rack in bigger systems has an L2 syscon as well,
		which runs the door's display, communicates with other
		L2s and system console (if any) via Ethernet.

	Every brick is independently powered & removable, with the degree of
	pluggability being up to software, which improves over time.

3.2 CABLES
	The XTown2 (5 meters max) and NUMAlink 3 (3 meters max) cables
	are otherwise identical, with the same connectors.
	
	There are no big backplanes or mid-planes:  Routers plus
	cables are the equivalent, in effect a virtual backplane.

	The Origin 2000's multiplicity of connectors and backplane traces
	has been simplified into one connector type for high-speed signals.

Following is a block diagram for the CPU + Router part of a 16P system;
an I-brick plus 0-3 {I-, P-, and X-bricks} would be connected to XTown2 ports.
This shows 4 C-bricks, 1 R-brick, and 4 NUMAlink 3 cables:

	C-brick 1     .	C-brick 2     .	C-brick 3     .	C-brick 4
		      .		      .		      .
	P   P	P   P .	P   P	P   P .	P   P	P   P .	P   P	P   P
	|___|	|___| .	|___|	|___| .	|___|	|___| .	|___|	|___|
	    \	 /    .	    \	 /    .	    \	 /    .	    \	 /
        MEM--BEDROCK  . MEM--BEDROCK  . MEM--BEDROCK  . MEM--BEDROCK
	    / 	|     .	   /	|     .	    /	|     .	    /	|	
	XTown2	|     .	XTown2	|     .	XTown2	|     .	XTown2	|
	............................................................
	Cables	|		|______     ____|		|
		|_____________________ \   /  __________________|
				      \	| | /
	............................................................
				       ROUTER
	R-brick			      / | | \
	............................................................


3.3 RACKS
	Tall rack - 74" high, 30" wide, 50" deep - 39U configurable space
	Half-rack - 34" high, 24" wide, 42" deep
	They have the usual cool-SGI colors & plastic skins, and the tall rack
	takes care of the (serious) cabling issues.

	Of course, some SGI customers will use their own racks, especially
	the ones who put them into vans, airplanes, other embedded
	systems, or the aforementioned submarines.

3.4 MODELS - HAVE IT YOUR WAY ... WITHIN LIMITS OF SANITY
	It is obvious that one can make up zillions of different combinations,
	but anybody who does that too much will be sorry later.
	Hence there are standard configurations and ways to put these
	things together, with a lot of leeway left if somebody wants to
	buy $100M of something different.

	On occasion, a "molecular" notation is useful in describing
	configurations, and it is no accident that brick names have
	distinct letters. Treat the numbers as subscripts.

3.4.1 SGI ORIGIN 3200, 2-8P EXAMPLES
	This is normally in a half-rack; molecular notation follows,
	with ASIC/CPU ratios in []; remember that a C-brick typically has 4 CPUs.

	CI	2-4P, minimal system (has 1 powerbay) 	[1:1 - .5:1]
	CID	CI plus disk-brick
	CID2	CI plus 2 disk trays

	C2I	8P, still routerless			[3 ASICs, .375:1]
	C2ID	8P with JBOD disks			
	C2IP	8P, but more PCI			[5 ASICs, .625:1]			

	C2IG	8P, graphics ... Onyx 3200, in tall rack (the 18U G-brick!)

3.4.2 SGI ORIGIN 3400 OR ONYX 3200 4-32P EXAMPLES
	1-rack, with all CPUs, I/O, and storage
OR	2-racks = 1 CPU-rack, rest in second rack
OR	3-racks (or more) = 1 CPU-rack, 1 I/O rack, and more racks for
		G-bricks and disk
	
	The basic approach build up a C4R group: a Router and 1-4 C-bricks.
	For smaller systems, I/O goes in the remainder of the first rack.
	For larger ones, I/O starts in the second rack. A few examples:

	C4R2IPD	16P & some I/O & disk; 1 rack.	 [.63 ASIC/CPU]
		[These come configured with 2 Routers]
	C8R2I	32P, 2 disks, a few Ethernets; 1 rack [.34 ASIC/CPU]
		This is naturally a system liked by CPU-heavy users, but
		actually certain kinds of e-commerce customers do it also.
	C8R2IGD	A 32P Onyx3000, 2 racks, one disk tray.
	C8R2IP7D*  32P, 2 racks plus disk racks [1 ASIC/ CPU]
		ignoring the I-brick:
		84 PCI slots, 16.8 GB/sec I/O
		I/O configuration; 2 racks plus more for disks
		Using an SGI TP9100 storage unit [9 12-drive chassis/rack],
		and supposing you have 73GB drives, and you do single attach,
		you get 84*870GB = 73 TB, in 10 disk racks, dwarfing the
		2 racks for CPU & I/O.

3.4.3 ORIGIN OR ONYX 3800 16-512P EXAMPLES
	Here, for cabling sanity, CPU racks are kept "pure".

	C32R8I	128P, 2 disks, some Ethernets. 5 racks. [.31 ASIC/CPU]
		Definitely for CPU-heavy applications.
	C32R8IP31 128P, plus disks; looks like you could get
		372 PCI slots (ignoring I-brick), or 323 TB of disk.
		[For some of our friends who make 7-TB files, that's
		only 46 such files, so don't laugh...]
	C128R44..  512P, and some racks have an extra R-brick at top
		to connect with other racks.

NOTE: all of the above are *hardware* possibilities, with no guarantees that
software has tested and supports all such configurations.  Sales literature
and IRIX release notes describe the actual limits, which in practice,
rise over the years.  In practice, while small configurations may in
fact max out their I/O limits, we have not generally seen large Origin 2000s
fill every I/O slot.

3.5 SOFTWARE, PARTITIONING, ETC

Origin 3000 and Onyx 3000 run IRIX, binary-compatible with the earlier
systems, and from a user viewpoint, essentially identical.  Of course,
IRIX is already comfortable with ccNUMA systems that handle huge memories,
large CPU counts, and big I/O systems.  As usual, large numbers of
software people labored mightily on this project, especially to
make sure that these systems appeared minimally different from the Origin 2000.
Many improvements were made under the skin, especially for
tolerating and recovering from errors.  As is often the case, but
sometimes frustrating to old OS people (like me), some of the most
painstaking and difficult OS work produces no cool-looking visible feature,
but rather improves performance or reliability or lets people stop needing to
work to tune applications as much, i.e., the better the job, the more
invisible it is! This is always unsung-hero(ine) work, and there was a great
deal of it done this time. In fact, the Origin 3000 was released with the
same standard mainline OS release available on other platforms, a major
improvement over the common habit of distinct releases for new hardware.  

Work continues on being able to warm-plug PCI cards, I/O bricks,
and later, CPU bricks, in that order. The hardware for all this works fine,
but ... It's a Minor Matter of Software.

The systems can be partitioned, with each partition requiring an I-brick.
When partitioned, they act like clusters that happen to have high-speed
memory-to-memory links.

Systems can be clustered, and customers can arrange the same hardware in
a myriad of ways.  I believe that the usual argument of single system
image versus cluster is a diversion.  Rather, most environments are likely
to be clusters of machines of the appropriate sizes, where the issue is
workload-dependent sizing.

Workloads of independent, CPU-intense jobs, with minimal
data sharing, and very predictable resource requirements, have been
the most widely successful in clusters.  For example, high-energy physics,
some chemistry problems, final graphics rendering for films, some
Web applications, and some transaction processing applications fit this well.
In addition, some individual codes have been successfully parallelized to run on
clusters of small systems, typically using MPI message-passing.
On the other hand, if jobs are less independent, do more I/O, share more
data, and are more unpredictable, or not very amenable to rewriting for
message-passing, then people sensibly prefer larger systems.  For example,
if one needs to run 1000 copies of a single 32-bit, 240MB,
integer-CPU-intense job, in throughput mode, then one might usefully buy
1000 256MB PCs. If however, the job changes to require 300MB,
somebody should open up 1000 PCs and replace DIMMs.  More subtly,
some jobs vary dramatically in their memory requirements, either by
phase, or according to the specific input, causing people to wish for
bigger systems to absorb the dynamic variations.

Some customers are rightfully happy with hundreds or thousands of Linux PCs.
Some SGI customers run clusters of 128P systems, and would like
clusters of 512P systems, if they had the budget.  For others, the
optimal size is somewhere in between 1P and 128P - I've personally seen or heard of
Origin 200s and 2000s used in clusters of 2P, 4P, 8P, and 32P elements,
but the other sizes probably happen as well. For example, I know one customer
whose ideal is a cluster of systems, each with 32P, 16GB memory, 2 disks.
Each system makes an in-core copy of a database, 8GB in size, because the
running system simply cannot afford many disk accesses.

The NUMAflex bricks, of course, adapt even better to these multiple size
adaptations, than the Origin 2000s did,  given independent resource scalability,
and the smaller increments of CPU and I/O.

4. THE NEXT FAMILY - ITANIUM-BASED NUMAFLEX

This has not been announced yet, so I cannot say much.
However, the architecture is easy to describe: it's the same, except:

(a) The C-bricks have 4 Itaniums, with 2 XXXXXXXs, and a Bedrock,
	and the rest of the hardware is identical.
(b) The OS is Linux, not IRIX, and IRIX scaling tends to precede
	Linux scaling to larger configurations.

5. FURTHER MIPS & IA-64 NUMAFLEX FAMILIES (2001-2006)
	
Again, not released, but a few hints are OK.  We've got scenarios,
going out years, with improvements every year.

(a) New CPU bricks will work with existing I/O bricks & Routers,
as long as that makes sense.

(b) New I/O bricks will appear, which will generally work with existing
CPU bricks, until that doesn't make sense any more, i.e., at some point
there will likely be I/O bricks that work with newer CPU bricks,
but not the oldest ones.

(c) Some bricks will work with existing interconnects, but allow
for faster ones, and when they (and faster Routers) appear, it will
become possible to upgrade the interconnects at least once,
possibly several times. That's very exciting to me, because
running out of interconnect has usually been the final straw in
ending a system architecture's life.

(d) There is room for some uncertainty, as for I/O busses.  We tried to
make design decisions that allowed for changing our minds about bricks,
and sometimes topologies, without bothering other elements.

Again, I cannot emphasize enough that NUMAflex is a design approach
for multiple generations of strongly-related system families.
Origin 3000 and Onyx 3000 are the first products from a new,
and hopefully improved, model of development.

6. NUMAFLEX EARLY ASSESSMENT

6.1 NUMAflex separates CPU and I/O more than is usually the case with
mid-range systems, and the relatively small size of the bricks is unusual
for scalable systems.  One does find echoes of mainframe channel I/O,
of cluster-of-PCs appearance, and of KSR's modules.

This approach utterly depends on:
	(a) Being able to do high-speed, low-latency cabled interconnects.
	(b) Being able to do cache-coherency well.

At first, it seemed weird to do this, and there was quite a bit of rational
skepticism, because people were much more used to producing integrated
boxes, including I/O and power.  

After the months of work that led to 1/98's Flintstones, it took
6 months (January-June 1998) for discussion and analysis
to make sure this could work. At one point, we went through
iterations where CPU bricks were 5U high, 5U wide, 2-across in rack,
and that might allow a reasonable deskside "stack", but they just didn't
work out.  An amazing variety of cabling designs were examined and rejected.
It was non-trivial to get cabling systems that work for the big machines
without sacrificing a lot of cost in the smaller ones.

C-brick designs went through many iterations, as they were heavily
over-constrained by combinations of trace lengths, cooling issues,
and physical packaging, which differ among the various CPUs.
Having separate bricks at least isolated the problems, which certainly reduced
the inter-design constraints.

6.2 But once this approach is incorporated, it seems to work well.
It is well-matched to the "asynchronous design style" where different
teams can be working at different rates on different bricks.
We think it improves resilience to "surprises" arising from events outside
our control, like changes in oncoming I/O standards.

6.2.1 The I/O ASICs are *NOT* in the C-bricks. If a new I/O bus appears:
	(a) Create an XBridge or other variant to support it.
	(b) Make up some new kind of I/O brick.
	(c) Start shipping new machines that can include the new ones.
	(d) If it makes sense, the new I/O bricks can also be shipped to
	    add to installed-base systems, thus upgrading their I/O systems.
	(e) If the new brick clearly subsumes some older brick,
	    one can stop making the older brick as convenient, but the old
	    I/O interface doesn't just disappear, or require hard choices
	    between old and new (recall the Sun Sbus versus PCI issue).
	
This eliminates	the agonizing decisions needed all too often across
I/O bus switchovers, or trying to mix them in same machine, or scrambling
to avoid obsoleting customer investments.

Of course, taste is required to avoid Quality Assurance explosions.
The hardest issue has been understanding, of the myriad possibilities
possible, which ones actually make sense to offer at first, and which
others might be considered upon demand.

Anyway, having now done this, we find we *really* like using one standard
I/O connection available at the CPU brick, with minimal cost burden there,
and then adding bricks that use that interface, and let them bear the
cost and conversion burden. We are happy that we can use the bricks to
build fairly modest-sized machines as well as big ones.  We wish we
could build the very smallest machines, but the minimum machine
uses 10U of rack space (CI = 7U, plus power bay's 3U).
Getting smaller would require a separate dedicated design
(akin to the Origin 200), but even the short-rack design appears to
offer a size and price-point lower than usual for a high-end technology,
as we believe that many competitive systems in the Origin 3000 class will likely
have one-rack minima.

We of course have more ideas on how to do all this even better.

6.2.2 CPU bricks allow headroom.
	The various CPU chips have rather different packaging, power, and
	heat characteristics. NUMAflex design allows the issues to be attacked
	separately: if one needs a C-brick with faster fans, so be it.
	Each C-brick worries about its own voltage and power needs.
	If the board layout for one CPU is radically different from that
	of another, so be it - there is no backplane and airflow they
	must share.

6.2.3 Fans & blowers
	Every brick has its own fans, rather than having a giant blower
	that might be awkward to replace, and the cost of all this
	is incremental as bricks are added, rather than upfront cost.

6.2.4  Engineers love to start from scratch, but we've learned that it may be
better to not have to do that all of the time.  Nevertheless, it is a
wrenching change for many engineers and marketers, whose natural instinct is
to make *their* product great, to then take a broader view of making
a whole series of products great, even if it means compromises in their own
pieces.  I salute the number of people who were able to make that change,
because it is never easy to think this far ahead, especially when the
immediate problems are difficult in their own right.

6.2.5 There are numerous subtle issues dealing with the manufacturing,
configuring, and marketing of brick-style systems, since the very word
"system" doesn't necessarily mean the usual thing.

6.3 PERFORMANCE

As of this writing, there are few public benchmark numbers,
and various submitted results are working their way through approvals.
But a few notes are possible.

6.3.1 Comparison of 400MHz R12000A in Origin 2000 and Origin 3000

As one would expect, the two systems, both with 8MB caches, perform
about the same on cache-resident codes, but the Origin 3000 performs
noticeably better on codes with higher cache-miss rates, given 2X
better bandwidths and .5X better latencies.

Of the SPEC CPU benchmarks (SPECint2000, SPECfp2000, SPECint_rate2000,
SPECfp_rate2000), we usually consider SPECfp_rate2000 most useful.
SPECint2000 and SPECint_rate2000 get good hit rates in 4-8MB caches,
so reveal little about the performance of the memory system.
The uniprocessor benchmarks (SPECint2000, SPECfp2000) are not very useful
for multiprocessor comparison, as they completely ignore contention
among CPUs.  That leaves SPECfp_rate2000, which uses multiple CPUs,
stresses the memory system, and whose CPU-scaling curves are useful
in understanding performance dropoffs with increasing CPU counts.
To avoid misleading interpretation of the results, it is a good idea
to compare similar sorts of systems when possible, i.e., smaller systems,
or clusters thereof, should almost always have better price/performance
on workloads for which they are suitable, including SPECfp_rate,
but people continue to buy large scalable systems, because their workloads
include jobs that have additioanl requirements.

The Origin2x00 has always had relatively flat SPEC*rate curves,
with no drastic dropoffs as the number of CPUs is raised, and the
Origin 3x00 is quite similar.

Following are the public SPEC*rate numbers, followed by the normalized
SPEC*rate/CPU numbers, which allow easier comparisons across machines with
differing numbers of CPUs.

Peaks:	SPECint_rate2000, SGI3x00 vs SGI2x00, unofficial estimates marked "E"
	1P	2P	4P	8P	16P	32P	64P	128P
SGI3x00	-	-	-	-	65.3E	130.15 	259.04	-	
SGI2x00	-	7.79	15.38	30.51	-	124.51	-	476.71

Peaks:	SPECint_rate2000, normalized per CPU, unofficial estimates marked "E"
	1P	2P	4P	8P	16P	32P	64P	128P
SGI3x00	-	-	-	-	4.1E	4.1 	4.1	-	
SGI2x00	-	3.9	3.8	3.8	-	3.9	-	3.7

The SGI3x00 is only 6-7% faster here.  There is little significance
here to the difference between 3.9 and 3.8.

Peaks:	SPECfp_rate2000, unofficial estimates marked "E"
	1P	2P	4P	8P	16P	32P	64P	128P
SGI3x00	-	-	-	-	66.9E	133.8 	265	-		
SGI2x00	-	6.7	13.2	26.2	-	105.5	-	406.63

Peaks:	SPECfp_rate2000 per CPU, unofficial estimates marked "E"
	1P	2P	4P	8P	16P	32P	64P	128P
SGI3x00	-	-	-	-	4.2E	4.2 	4.1	-		
SGI2x00	-	3.4	3.3	3.3	-	3.3	-	3.2

Here the difference is 25-30% higher for SGI3x000, and we have seen
memory-intensive real codes that were substantially better.  SPECfp_rate is
much more influenced by the actual memory system than is SPECint_rate.

6.3.2 SGI3x000 versus other comparable machines on SPECfp_rate2000

As of 8/30/00, it is difficult to get good sets of recent and consistent
numbers, especially for comparable larger server systems.  SPEC2000
benchmarks are relatively new.  The only Sun number is for 480Mhz 450,
the HP nubmers are for the N-series, and the HP "SuperDome" is not announced yet.
The best immediate comparison is with Compaq GS systems, which are large ccNUMAs
that overlap with the middle of the Origin 3000 range.

The following are taken from [SPE00a], augmented by a few unofficial
estimated results for smaller Origin 3000 CPU counts. The IBM SP's are
the 375MHz High Node numbers.

Peaks:	SPECfp_rate2000, unofficial estimates marked "E"
	1P	2P	4P	8P	16P	32P	64P	128P
CPQ GS	5.2	-	-	-	73.3	147.8	-	N/A			
SGI3x00	-	-	-	-	66.9E	133.8 	265	-	
HPN4000	-	7.84	14.4	23.04	N/A	N/A	N/A	N/A
SGI2x00	-	6.7	13.2	26.2	-	105.5	-	406.63	
IBM SP	-	-	14.5	28	51.7	-	-
Sun 450	-	-	11.13	N/A	N/A	N/A	N/A	N/A

Peaks:	SPECfp_rate2000/CPU, unofficial estimates marked "E"
	1P	2P	4P	8P	16P	32P	64P	128P
CPQ GS	5.2	-	-	-	4.6	4.6	-	N/A
SGI3x00	-	-	-	-	4.2E	4.2 	4.1	-		
HPN4000	-	3.9	3.6	2.9	N/A	N/A	N/A	N/A	
SGI2x00	-	3.4	3.3	3.3	-	3.3	-	3.2
IBM SP	-	-	3.6	3.5	3.2	-	-	-
Sun 450	-	-	2.8	N/A	N/A	N/A	N/A	N/A

So, doing the best that I can to compare similar systems,
400MHZ SGI 3x000 systems deliver about 90% of the SPECfp_rate2000 performance
of 731MHz Compaq GS systems with identical CPU counts, at least in
the 16-32P range.  Presumably, the charts will get filled in over time.

Detailed price comparisons are beyond the scope of this writing,
especially for systems as configurable as Origin 3000 and Compaq GS.
In CPU-rich configurations, I think Origin 3000s use about 50% of the
floor space of same-CPU-count Compaq GS systems, and I think they
are priced at roughly 50% of the GS prices.

If that is *actually* true, or even close, I'm quite happy!

7. SUMMARY

We think that the NUMAflex design approach blends some good attributes of
small system design (iteration speed, lower cost) into the design model for
larger systems. It will take a while to know if we're right, particularly
because some of the hoped-for improvements show up in cost-savings
and time-to-market issues later in the life-cycle.  Of course, this
approach changes the very nature of the life-cycle, since the system
life-cycle is converted to a series of overlapping life-cycles of
bricks, cables, and racks. 

>From the numerous possible styles of modularity,
the NUMAflex design approach chooses a specific kind:
	- Small bricks, connected primarily by high-speed, full-duplex
		source-synchronous, cache-coherent interconnect cables,
		that can be used to create shared-memory nodes of
		various sizes, with good enough bandwidth and
		latency to be usually treated like SMP UMAs
	- I/O busses split into separate bricks, with no
		I/O-specific manifestations in other bricks
	- Practical systems from small to very large, using the
	 	same elements across the entire range of sizes.

This approach supports independent resource scalability (at any one time)
and independent resource evolvability over time.  We think it will allow
much faster evolution of systems, and we think it will help RAS.
We know it helps amortize effort across MIPS and IA-64 systems,
given the commonality of components.  Although it is too early to be sure,
we think this will pay off in seriously-improved customer investment protection.

In 1996, when we announced the Origin 2000, we claimed there were strong
rationales for building scalable systems as switch-based ccNUMAs,
and that others would go this way, and in fact, more (albeit, not yet all)
vendors are doing so.

In 2000, I conjecture that the pressure from quickly-evolving smaller systems,
and customer desires for economic scalability and evolvability,
will tend to drive scalable systems over time to ccNUMAs that look more like
NUMAflex-style designs.  We probably won't really know until around
2003/2004, given the usual life-cycle. 

8. ACKNOWLEDGEMENTS

Shifting the design approach to NUMAflex, and getting the first products
out the door has taken huge efforts by a large cast of engineering,
manufacturing and marketing people spread across Mountain View, Chippewa
Falls, and Eagan.  Amazingly, in the midst of some extremely difficult
years at SGI, people managed not only to ship an innovative product,
but to make a major positive change in the entire product development approach,
converting big bangs to more continuous evolution.

NUMAflex, SGI, and SGI Origin are Trademarks of SGI.  Linux is a Trademark
of Linus Torvalds.  Others are Trademarks of ther respective organizations.

9. REFERENCES

[COM00a] Compaq "Wildfire" (AlphaServer GS website)
http://www.compaq.com/AlphaServer/gs320/index.html

[GAL96a] Mike Galles, The SGI SPIDER Chip, Proc. Hot Interconnects IV, Stanford,
August-15-17, 1996. 141-146.  This describes the Router used in Origin 2000.

[HRI97a] Cristina Hristea, Daniel Lenoski, John Keen,
"Measuring Performance of Cache-Coherent Multiprocessors Using
Micro Benchmarks", Proc SC'97.
http://www.supercomp.org/sc97/program/TECH/HRISTEA/INDEX.HTM

[LEN95a] Daniel E. Lenoski & Wolf-Dietrich Weber, Scalable Shared-Memory
Multiprocessing, Morgan Kaufmann, San Francisco, 1995.
This is a good all-around reference.

[MAS97a] John R. Mashey, "Big Data and the Next Wave of Infrastress",
Proc. 1999 USENIX, Monterey, CA.
http://www.usenix.org/events/usenix99/invited_talks/mashey.pdf
In particular, page 12 shows intervals for large servers.

[SCI00a] SCIzzL main Web page:
http://www.SCIzzL.com

[SPE00a] SPEC CFP2000 Rates
http://www.specbench.org/osg/cpu2000/results/rfp2000.html

[SUN00a] Sun Website on mid-range bus-based SMPs
http://www.sun.com/servers/midrange/

SGI Origin and Onyx 3000 Websites:
http://www.sgi.com/features/2000/july/3000/
http://www.sgi.com/origin/3000/
http://www.sgi.com/onyx3000/


-- 
-John Mashey EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-2663
USPS:   SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time, non-conflicting,local, consulting elsewise.