Article: 66302 of comp.arch From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Shared Memory vs. Message Passing (Dead Horse Flogging) Date: 3 Jan 1998 21:04:40 GMT Organization: Silicon Graphics, Inc. Lines: 151 Message-ID: <68m958$9b6$1@murrow.corp.sgi.com> References: <68d5od$b59$1@lyra.csx.cam.ac.uk> <68k919$m11@news2.zippo.com> <68kgpu$fe2$1@murrow.corp.sgi.com> In article <68kgpu$fe2$1@murrow.corp.sgi.com>, mccalpin@frakir.engr.sgi.com (John McCalpin) writes: |> When I last looked at this (February 1997), 63 of the top 75 |> commercial science/engineering codes on SGI systems were available |> in shared-memory parallel versions, with most of the remaining 12 |> having shared-memory parallel versions in development. Only about |> five of these apps have message-passing versions now. |> that HPF is the answer (maybe OpenMP is), but that is a different |> soap box. See http://www.openmp.org/ for info on the latter. Since this issue comes up again & again, and sicne there has been a lot of confusion in this thread (i.e., calling Beowulf any kind of a ccNUMA is very confusing, since the nodes don't have coherent caches), and since the problem domain has been muddy, how about let's try again, looking at the interaction of system design and application nature. 1. System: Start where it all starts: bandwidth and latency Imagine a log-log-scale chart: vertical axis, in GB/sec runs from .0001 (1 MB/sec) to 1000 GB/sec horizontal axis runs from 1ns at left to 10000 seconds on right, and gives time-to-first-data, or minimal latency. [Of course, total latency is data-size dependent, and ~ minimal latency + size/bandwidth; starting with the chart plus a given size, you could produce another chart showing total latency for size.] Now, plot points like: a) memory access speed in systems (like what lmbench would tell you) like Intel SHVs, DEC 8400s, Sun Ultras, Origins,T932s. (Earlier T932 bandwidth was 800 GB/sec, which is why the cahrt had to go to 1000 GB/sec :-)) b) "special" network interconnects like ServerNet, SP2 switch, DEC memory channel c) general networks Ethernet, FDDI, ATM, HIPPI d) disks (1, and striped) (a) Fills the upper left corner, in particular, the points lie in the box of 500 MB/sec plus, and 1-1.5 microseconds. (b) Is more complicated, but lie in the middle zone down and to the right of the (a) box. (c) and (d) form a vertical stack in the range of 1 MB/sec to 7 GB/sec, and ~ 1-10 milliseconds (either a disk access, or a user-user message). (7 GB/sec off disk was recently achieved on an Origin, there may be higher numbers). "Money can buy bandwidth, but latency is forever" 2. Without knowing much else about a system, if you can place it on this kind of graph, you know something about what is practical or not. 2.1 Shared memory programming is practical in the box at the upper left. And often message-passing is simulated on top the shared memory. Programs can: - synchronize often, since the latency is low. - share memory - move memory around, since the bandwidth is high 2.2 As you leave the box to the right and down, you get higher latency and less bandwidth, and the more and more careful you must be to have applications whose inherent data needs can survive with less frequent interactions and/or low bandwidth. Note that the bandwidth ratio from memory system to general network is ~100-1000, and the latency ratio is at least a 1000X for the general networks, less for the special-purpose ones. Note that both bandwdith and latency limits can cause problems. As you move horizontally to the right, you keep the same bandwidth, but each synchronization takes longer (which may also make the effective bandwidth decrease, unless transfers can be efficiently pipelined thorugh all of the layers of software.) This has often surprised people who moved up (in network bandwdith), and discovered they got little more performance, because their applications were latency-bottlenecked, not bandwidth bottlenecked.) At any point, if you move vertically down, bandwidth decreases, which doesn't bother synchronization and short transfers, but directly increases effective latency for long transfers. 2.3 Put all of this together: You can do shared memory progrmaming in the upper left, or you can do message passing. If you have two tasks that don't need to share much data, or synchronize fairly often, then either choice works, in which case message passing is better, since it hides state better. As the amount of naturally-shared state increases, shared-memory programming looks more and more attractive. Despite useful efforts to make networked computers act like they have shared memory, this is hard: 100X1 differences in bandwidth and latency are difficult to make invisible, because the penalty is so high. Hence, if the application is inherently a network one, you do message-passing. Finally, if you have a large parallel computation in which the component computations have a high ratio of computation to communication, then you can get good results from a network of individual computers. There are computations tha work well as clusters of multiprocessor shared-memory systems, where the natural decomposition of the problem wants a smaller number of bigger chunks of compute power, rather than a larger number of smaller chunks [in particular, use of larger chunks of memory avoids bad fragmentation/limits problems It is extermely awkward if the "natural" node size should be X, but the actual node size is .9X; burns PhD-student years quickly.] As John McCalpin says, there is data about what third-party software vendors actually do, and as Carlie Coates says, big scalability is hard, so there is no magic. Generalizing from a large number of customer discussions, what seems to cover many cases. a. Start with each node as a shared-memory multiprocessor, with enough memory to run any of the problems you want to run. b. If that's not enough, cluster such nodes together, being careful that the node size is big enough to keep the computation-communication ratio high. [Some CFD codes partition well this way, for example]. If you have a problem class that's really well-partitioned, and whose sizes don't vary too much, you can use a loosely-coupled NOW or POPC farm. 3. Comments on specific kinds of code - Client/server applications are obviously message-oriented. - Office-productivity applications that used to be more message-based seem to be going more shared-memory. Way back when, UNIX vi and spell were separate things, with no knowledge of each other's internal state (good). On the other hand, with current word processors and integrated speel-checkers, it is porbably easier to do these with threads that share memory, due to the interactive nature of the resulting highlighting and fixing, and possibilities for checking upon entry than as a batch process. - Operating systems and DBMS use combinations of message-passing and shared-memory: the former for clean interfaces, and the latter for performance. 4. Summary Shared-memory, threads-based programming may cause problems if not done cleanly, but if anything, there is going to be plenty more of it, if for no other reason than the the existence of low-cost SMPs. -- -john mashey DISCLAIMER: EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389 From sun4nl!EU.net!news.maxwell.syr.edu!logbridge.uoregon.edu!enews.sgi.com!news.corp.sgi.com!mash.engr.sgi.com!mash Mon Jan 5 09:32:17 MET Article: 66315 of comp.arch Path: cwi.nl!sun4nl!EU.net!news.maxwell.syr.edu!logbridge.uoregon.edu!enews.sgi.com!news.corp.sgi.com!mash.engr.sgi.com!mash From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Shared Memory vs. Message Passing (Dead Horse Flogging) Date: 5 Jan 1998 03:26:49 GMT Organization: Silicon Graphics, Inc. Lines: 88 Message-ID: <68pjtp$lni$1@murrow.corp.sgi.com> References: <68p8r4$i9a$1@murrow.corp.sgi.com> <68pd88$nan@news1.zippo.com> NNTP-Posting-Host: mash.engr.sgi.com In article <68pd88$nan@news1.zippo.com>, lindahl@cs.virginia.edu (Greg Lindahl) writes: |> > >But what customers actually use these codes for the majority of their |> > >computing? |> > |> > All the big ones (except Uncle Sam). Although some large customers |> > do run their own codes, it is my understanding that it is the commercial |> > codes that are the primary reasons that the machines are bought and |> > used. |> |> Well, if academic/goverment and commercial users are completely |> different, then it doesn't seem useful to me to lump them together |> when discussing the market. The reasons that commercial users might |> favor shared memory don't mean squat to academics, for whom message |> passing might be far superior. Academic, government, and commercial users are hardly "completely different". But first, try looking at: http://www.ncsa.uiuc.edu/Apps/SoftwareTable.html which is a list of HPC software applications at NCSA to illiustrate 2b-2d below. Consider what a place like NCSA does: 1) There are people who want to research large parallel work, and write/tune their own code, and sometimes take over big chunks of the parallel machines. 2) There are people who just need to run a bunch of smaller jobs, either on individual CPUs, or with moderate parallelism. Of these: 2a) Some have their own code. 2b) Some use code that came from other places in academia. 2c) Some use binaries that originated in academia, but then got migrated into commercial code. 2d) Some use code that started as commercial code. Of course, desktops may be perfect adequate for the smaller/shorter runs of these things. For someone to accomplish their work, they may well need programs from any and all categories, and personally, I think it is very good if programs are able to migrate. If academic codes are restricted to programming environemtns never found in industry, both sides lose. 3) Government sites that I've talked to have mixtures of 1) and 2), although some government sites exist primarily for 1) and 2a), and of course, some cannot tell you exactly waht they're doing, or even what their agency's name is :-) 4) Academic sites certainly have a mixture of 1) and 2), especially academic sites that server researchers whose research is *not* in computer algorithms and architecture, but are in chemistry, physics, mechanical engineering, etc, and part of whose work simply needs access to current commercial software. An all-too-common disaster is that of choosing a large computer for a university based on the wishes of people who want to do research on paralel processing, but justified by the idea that it will serve many others in the university. A common result: "Well, A & B love the machine, as they're the only ones who are using it, and they'll tell you how great it is ... but nobody else has gotten much use from it." (I've heard that more than once...) 5) Industry sites tend not to do so much of 1), although there are always exceptions. Some oil&gas folks seem write a lot of their own code and do big parallel tuning. 6) Anyway, in practice, it appears that many large machines must be able to run commercial codes for at least part of their workload, to be viable, and they are certainly helped if there exist smaller versions that are compatible. I've heard of cases where an industrial organization tried some things out on some central site, and then bought a smaller version for its local use, reserving the central research site for especially large jobs. That is good academia/industry partnership. Some large parallel machines can be justified 100% by a moderate number of dedicated, locally-written, applications, or perhaps, by offering a large parallel resource to a larger community. -- -john mashey DISCLAIMER: EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389