Path: uni-berlin.de!fu-berlin.de!postnews1.google.com!not-for-mail
From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 14 Jun 2004 14:29:31 -0700
Organization: http://groups.google.com
Lines: 488
Message-ID: <ce9d692b.0406141329.69add720@posting.google.com>
References: <f76a7c17.0405260946.345a607e@posting.google.com> <ZWjxc.26680$Gx4.25304@bgtnsc04-news.ops.worldnet.att.net> <ca4sa9$q8e$1@gaudi2.UGent.be> <ybtk6yfgoph.fsf@isis.visi.com> <caagtt$ck$1@osl016lin.hda.hydro.com> <caaih6$ju1$1@news1nwk.SFbay.Sun.COM> <caao9i$l64$1@osl016lin.hda.hydro.com> <2b7c7ad3fb5703a806336b430ca61e2e@news.teranews.com> <cabkp9$ij8$1@osl016lin.hda.hydro.com> <ce9d692b.0406111534.1bc3e50d@posting.google.com> <caesbu$c0h$1@gaudi2.UGent.be>
NNTP-Posting-Host: 69.105.45.85
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: posting.google.com 1087248571 28510 127.0.0.1 (14 Jun 2004 21:29:31 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Mon, 14 Jun 2004 21:29:31 +0000 (UTC)
Xref: uni-berlin.de comp.arch:167017

"Nicolas Capens" <nicolas_capens@hotmail.com> wrote in message news:<caesbu$c0h$1@gaudi2.UGent.be>...
> Hi John,
> 
> > I'm even less optimistic.  Why is anyone assuming *any* use of C on
> > WIZ, much less wonderful optimization?
> >
> > IMHO, the WIZ ISA, such as it is, is about the *least appropriate*
> > target for C that I've seen in 20 years, rivalled in inappropriateness
> > mainly by some 1950s and 1960s ISAs designed well before C existed.
> >
> > The issue was well-discussed in the early 1990s in comp.arch.
> >
> > - Doing the code generator would be no fun, and the best you could do
> > would
> >   be seriously uncompetitive.  This ISA is exceptionally hostile to C.
> 
> Starting point: to execute a conventional 3-operand instruction (MIPS-like),
> WIZ has to execute three instructions. Since it can do these in parallel,
> it's no worse than an 8086. In fact it's better since the 8086 doesn't have
> 3-operand instructions. This obviously limits it to one 'operation' per
> clock cycle, at most. But let's keep assuming it executes three instructions
> (register transfers) in parallel, VLIW.

Let's not.  The WIZ ISA, such as it is, is clearly *not* VLIW, or even
LIW.
The implementation is in-order single-issue, out of-order completion,
just like MIPS (and many other) chips of the 1980s.  I suppose it
might be possible to imagine a superscalar WIZ, although the ISA, such
as it is, isn't designed to make that easy.

> 
> Now we have a lot of freedom to move these instructions around. We can
> eliminate many of them, and we can schedule them to hide latencies. No ISA
> has this much freedom, not MIPS nor x86. So it can't possibly be worse.
> Since MIPS and x86 are reasonably well suited for C compilation, how can WIZ
> be "exceptionally hostile"?

Breathtakingly wrong.  Most RISCs were designed to allow code
rearrangement and code-scheduling; MIPS certainly did, and we were
hardly the first.
The WIZ ISA, such as it is, makes this harder, not easier.


> > "WIZ" [see below] IS A 32-BIT WORD-ADDRESSED CPU.
> > 8-BIT AND 16-BIT DATA ITEMS HAVE NO ADDRESSES.
> 
> Is it?
Yes.

> > Note: re "WIZ": I'm not very interested in discussions in which a car
> > is described as having terrific speed, great gas mileage, and low
> > power, and besides, it could easily be turned into a submarine or
> > airplane as needed, or in fact anything that anybody might think of,
> > and would be better.  So, it seemed like the best I could do to

> So, since Steve Bush defined "WIZ" this way, it means it is not extendable?
> These problems are easy to solve. Just define partial registers. The ISA
> sais very little about exactly how the available registers should be used.
> Steve just presented a minimal implementation, only to show that it works.
> We're past that stage now, fortunately...

OK, if these problems are easy to solve, why don't *you* do it?
They are certainly *possible* to solve, and they should be far easier
than to solve all the other problems.  This issue has been discussed
plenty in this newsgroup, and the answers can be found.

Design a proper feature-set, and a clear specification, post it on the
net.
I'll even offer to take the time to critique it, for workability &
performance.

> Please, before you say "WIZ can't do this", think about how it can be
> extended to do it. You're crawling over the floor and see a wall, so you
> think you can't get over it. But in fact if you stood on your feet, you'd
> see it's just a hurdle.

Do you realize how condescending that reads?

> > With a bunch of redesign, one could probably fix some of this,
> > although at the expense of a bunch more complexity.
> 
> How much complexity it requires remains to be seen... I'm optimistic about
> it and you're pessimistic. Fair enough.

I'm sad to see people waste their own time, or get others confused.

You've done some reasonable-looking graphics & software work.
Is there some reason you *don't* want to do a thesis in something
related to an area where you actually have expertise and might
actually contribute to forward motion (which real research is supposed
to do)?

Gent's course syllabus looks OK, but, from your comments, it's not
clear whether you've yet completed the computer architecture course
(that uses H&P).
Prof. de Bosschere certainly has wide-ranging relevant interests and
has supervised plenty of masters students.  Have you bounced WIZ off
him yet?

There are so many *interesting* things to be done, rather than trying
to start with something so retrograde.

> > But, this is just the tip of the iceberg - WIZ is just *filled* with
> > retrograde mis-features, and I'll try to summarize them in a later
> > posting.

Which will take a while, since there are *so* many problems.  Maybe
it's worth doing, mostly as a "why we don't do these bad things any
more" lesson.
=======================

But just for starters: the whole WIZ approach to registers and
functional units is *painfully* wrong for a CPU [as opposed to a bunch
of I/O devices.]  The pain is from experience, not from sitting around
theorizing.

Bear with me for the first part of this, whose connection with WIZ may
not be apparent to everyone.  Basically, a small part of the MIPS ISA
that was probably a mistake is used *everywhere*, only worse, in WIZ.

I will pick a real ISA as an example: MIPS-I (but it applies to later
ones as well), and a real implementation R2000/R2010: because:

a) The ISA, in one part of its integer definition, uses a feature from
which much can be learned about the problems with the WIZ approach,
and was probably a mistake, although it was still much cleaner than
WIZ.

b) Whose basic structure (in-order single-issue, out-of-order
completion, which WIZ labels as "asynchronous") is what WIZ is trying
to do.

c) That was shipped in 1986, with software that did most of the things
that you and Steve Bush seem to postulate as WIZ-enabled advances,
i.e., in this case:
- global-optimizing compilers for C, FORTRAN, Pascal
- code-generation with knowledge of latencies
- assembler that scheduled and reordered code appropriately.
[and many of the techniques were hardly new then]

d) For which I've looked at a lot (tens of thousands of lines) of
source & generated code of real applications and operating systems
(not miniscule toy code).  Even better, I got to see a lot of
micro-architecture simulation results over the years.

e) Which is well-documented, implemented by numerous teams, and widely
studied, as per H&P.

f) And of course, whose ISA & implementations I helped design :-)

WHAT WE DID WITH INTEGER MULTIPLY / DIVIDE

The MIPS-I ISA provided a separate integer multiply/divide that is
somewhat similar to the overall WIZ approach:

DIV   rs, rt   signed divide rs by rt;   set LO = quotient, HI =
remainder
DIVU  rs, rt   unsigned divide rs by rt; set LO = quotient, HI =
remainder
MULT  rs, rt   signed multiply rs by rt; LO = low 32-bits, HI = high
32-bits
MULTU rs, rt   unsigned multiple rs by rt; LO = low 32-bits, HI = high
32-bits

MFHI  rd       copy HI to rd
MFLO  rd       copy LO to rd

MTHI  rs       copy rs to HI
MTLO  rs       copy rs to LO

Usage is as follows:
1) A DIV* or MULT* is issued, using two GP registers as input;
the CPU pipeline continues to run.
 
2) The instruction will execute for many cycles,
implementation-dependent,
but in R2000, 10-32 clocks.

3) By definition, none of these instructions can cause exceptions,
which is *REALLY* important if you want to keep precise exceptions on
an out-of-order-completion design.  For instance, the muldiv unit
doesn't check for divide-by-zero, since one can do:
   DIV R1,R2
   BEQ  R2,divide-by-zero
and you don't need the check when dividing by a constant, or when the
compiler can know that a variable is non-zero.

3) When the instruction completes, the LO and HI registers are set.

4) One or two MF* instructions are issued to fetch LO or HI, or both.
If the muldiv unit hasn't finished computing its results, the MF
instruction stalls until it is finished.

5) However, if you issue another DIV* or MULT* instruction, before
using MF* to get the result(s), it restarts the muldiv unit. 
Actually, the MIPS-I ISA specs a 2-instruction hazard, i.e., if either
of the 2 instructions preceding DIV* or MULT* is MF*, the result of
the MF is undefined.  In fact, there are various other hazards
regarding MF* and MT*, which the assembler of course knew how to deal
with.

WHAT COMPILER DID IN PRACTICE
1) As long-latency instructions, try to schedule DIV* or MULT* early.
2) Then fill in with other operations as possible.
3) Then, use MF* to retrieve the result.

HOW WELL DID IT WORK
1) The integer muldiv performance was better than on some RISCs, but
not because of the overlap, but because we had actual hardware, and
some didn't.
There was plenty of discussion of the plusses and minuses of various
approaches 10-15 years ago in comp.arch; as usual, it depends on the
choice of benchmarks.

2) On an in-order-issue machine, it proved fairly dfficult to schedule
much other useful work in the shadow of the DIV*/MULT* latencies.  I
don't remember the particular numbers, but in practice, if you issued
a DIV*/MULT* opcode, fairly soon you were going to be stalled in an
MF*.
[And this was from a compiler system whose optimization was extremely
well-respected in the industry.]

WHAT WE DID WITH THE REST OF THE INTEGER ALU
1) The straightforward thing, with 1-cycle
add/subtract/shift/logicals, and with fanatic attention paid by most
CPU designers to make back-to-back ALU operations go fast, i.e., like
bypass networks.

WHAT WE DID WITH MIPS FLOATING POINT

This is another set of FUs with ISA-spec'd semantics that allow
correct in-order issue, out-of-order completion.

1) The various FP units [ADD, MUL, DIV] were (fairly) independent as
well, had longer latencies than the simple integer ALU operations.  It
was well-known that there were likely to be implementations with
various latencies and repeat rates, although the compilers always knew
these, and scheduled accordingly.  However, correct execution never
depended on this, since:

3) Unlike the integer muldiv unit, the FPU was scoreboarded and
interlocked.
I.e., there might well be multiple operations in progress, but if the
CPU attempted to issue an operations that the FPU wasn't ready to
accept, the main pipeline stalled until the FPU was ready.  Likewise,
any attempt to read an FP register that was the target of an executing
FP operation would stall.

4) In addition, the FP unit used a clever trick to retain precise
exceptions in an in-order-issue, out-of-order-completion CPU:
GSCA: hansen floating point patent

Basically, this is a requirement of *any* long-latency independent FU
in this type of design: either it can't cause exceptions at all, or if
it can:
a) It stalls the issue logic until it is sure there will be no
exception.
b) Hopefully, there is some quick check it can make [as in Hansen's
scheme], otherwise.

HOW WELL DID THAT WORK?
Fine.

ASSESSMENT OF MIPS ISA FOR MULDIV

1) Design assumptions
   1a) ISA: We wanted hardware integer muldiv, as we had enough
programs to
       believe they were useful, and we didn't want to assume an FPU
       (many chips do integer MULTs in the FPU).
   1b) ISA: programs could make use of all 64-bits produced by
DIV*/MULT*.
       I.e., full 64-bit product (2 registers) for MULT*,
       both quotient & remainder for DIV*.  You get remainder "for
free"
       IMPLICATION: unlike all other MIPS ALU instructions, these are
       the only ones that generate 2 results, not just 1 [and that
proved
       a painful special case in the MIPS R10000].
   1c) Implementation: we didn't have scoreboarding on integer
register file.

Given those, what we did was OK.

2) However, in retrospect, I wished we'd done something more like
Alpha's integer multiply instructions, which are normal 3-operand
instructions, and normally deliver the low-order register [which is
what most compiled code uses], and use a separate instruction [UMULH]
to deliver the high-order register.

Alpha doesn't have integer divide hardware, but if one wanted it, the
same approach could be taken, i.e., just produce the quotient, and if
desired, separate opcode for remainder.

Of our assumptions:
 1a) was an OK choice.
 1b) We knew some people who loved 32x32->64 bit multiplies, and it
was
     occasionally useful to get both quotient and remainder ... but in
practice,
     most compiled code took little advantage of this.
 1c) To have had an integer scoreboard would have been culturally
hard,
     and maybe would have cost precious schedule ... but could have
been done.

WHY DO I WISH WE HAD DONE IT DIFFERENTLY?

1) The regular MIPS integer register file was very clean: 32
registers, of which only the zero register was different in hardware. 
While there are occasionally requirements for registers that have
hardware differences, from a compiler writers' point of view,
especially when doing serious optimization, the nice thing about RISC
was that we got rid of a lot of the weird special cases and dedicated
resources that plagued many older architectures.
For example, having done compiler work on 68Ks, I can attest to the
irritation of the split between A & D registers ... and the 68K is
relatively clean.
Some of the earlier architectures were far worse [and IA32 certainly
retains plenty of such issues, but at least it's a general-register
machine.]

2) But, the MIPS muldiv unit introduced two registers, LO and HI, that
weren't quite like the rest.  In general, coupling function units and
registers this way is awkward, as seen in plenty of code.  The problem
is that the LO & HI registers are unique named resources that must be
allocated to a single muldiv operation at a time.  While it would be
silly to have 2 separate muldiv units for general applications, and
not worth the cost, it would be really awkward with this structure,
because you essentially would need to add another HI/LO pair, plus
instructions to deal with them, and compiler pain.

On the other hand, many MIPS implementations have multiple integer
and/or FP ALUs, with binary compatibility, and they work just fine.

Terminology notes for next section:
A register value is DEAD if you can write over it without disrupting
the
computation, i.e., it has no further use.  Otherwise it is LIVE.

Normal calling conventions for register-oriented machines:
- dedicate some registers, i.e., as stack pointer, frame pointer (if
needed)
- spec some registers as CALLER-SAVE, i.e., not safe across a function
call,
  so if caller has LIVE data in any of them, it better save them
before call.
  In some ISAs, it is natural to use some of these for the first N
registers,
  i.e., the caller fills them with LIVE data, but considers them DEAD
on
  return.
- spec other registers as CALLEE-SAVE, i.e., safe across function
calls.
  it's up to the callee to save and restore any such registers.

3) The awkardnesses of things like HI and LO

- They're different from the regular GP registers.
  In C, on MIPS:
     func(a+b)   does ADD 1st-arg-register,a,b, then calls func
                 and really nice, even if a+b took multiple cycles,
                 (suppose it were FP), there would be no stall until
                 func actually accesses 1st-arg-register, which in
fact
                 gives plenty of cycles.
  whereas
     func(a*b)   does a MUL a,b, followed by MFLO
1st-argument-register
                 to get the data to the right place.
                 I.e., there is *less* parallelism
   
- They're high-speed registers, and as such precious ... but they're
  really only useful for muldiv operations, but they have to be saved
  and restored "just to make sure".
  Every bit of my experience says that there better be a really good
reason
  for a dedicated resource, as:
  - Their usage tends to cause serialization bottlenecks.
  - They end up having to be saved/restored, and otherwise managed,
    often with extra data movement, and often when actually
unnecessary,
    because the code doing it can't know whether the data is actually
LIVE.

- In practice, you would never start a DIV*/MULT* before a call,
without
  using MF* to retrieve the data.  I.e., while these theoretically
could be
  CALLER-SAVE, in practice that makes no sense, because any code that
does
  a MUL*/DIV* has to use MFHI,MFLO to save the registers, and then
MTHI, MTLO
  later to restore them.  If the first MFHI stalls, then you stall
anyway.
  However, in most cases, saving/restoring them at call is a waste,
  because they are usually DEAD, because the data that anybody cares
about has
  been copied already.

- Interrupt routines, except for very lightweight ones, end up having
to save and restore HI & LO, even though, most of the time, it turns
out they are DEAD anyway, but there's no way for a kernel to know
that.  [yes, we could have introduced "valid" bits or some such thing,
but the complexity wouldn't have been worth it.]

-  In the R10000, they caused a bunch of special cases that caused
moaning
   from designers.

BOTTOM LINE
If I had to do it again, I'd do this feature more like Alpha's.
Most of the user-level MIPS ISA is relatively clean and simple,
but the muldiv unit is an oddity in the ISA that remains to this day.
[Some day, I'll write the integrated "MIPS-I: what worked well, what I
wish we'd done different, and why" document.  While muldiv isn't #1 on
my list, it's fairly high...

But, I hope it's clear that *lots* of people perfectly well understand
the idea of an ISA whose semantics allow for in-order issue,
out-of-order-completion.
It isn't new in WIZ, it wasn't new in MIPS, and that was nearly 20
years ago.
Also, lots of people understand code scheduling and optimization, and
they aren't new either.
Finally, lots of people understand that just throwing lots of
functional units at something doesn't automagically create great
imrpovements.  In some cases,
not even an *infinite* number of FUs, with perfect lookahead, helps
much.
(Wall's 1991 ASPLOS paper "Limits of Instruction-Level Parallelism"
being one of the classics).

NOW - WIZ

1) MIPS has one integer unit with awkward properties [-]
   An operation is issued
   + (1) Two operands are fetched [OK]
   + (2) Multi-cycle execution doesn't stall the instruction issue
   - (3) Two results are produced, not one
   - (4) The results sit in special registers
   - (5) The special registers need to be managed, they cannot be
allocated
     the same way as the regular registers by compilers.
     They often end up being saved/restored unnecesarily
   - (6) Unlike the FPU, one cannot issue another operation to the
unit,
     and just stall if it happens to be busy for any reason, because
     a new issue to it overrides any current execution.

2) Essentially every WIZ unit has similar properties, but worse,
since:
   - (1) Operands must explicitly be moved to the unit, which in
fact...

   - (7) Creates even more special registers

and then, the absolute disaster
 ----(8) Does this, not just for operations (like mul or div) that
         are inherently multi-cycle, but for the simplest ALU
operations,
         that most CPU designers work very hard to assure efficient
         back-to-back dependent operations.  I.e., adds, shifts,
logicals.

and hence, after decades of moving away from computers where most
registers had dedicated functions [i.e., the typical Accumulator / MQ
/ index registers of the 1950s designs like IBM 70x, 709x]...

  ---(9) The WIZ style makes most of its registers special-purpose,
and
      some of them (COUNTERs) have side-effects that must be coped
with.

This has nothing to do with clocked-vs-asynchronous designs [there is
plenty of confusion already.]  It doesn't have much to do with
implementation issues [there are plenty there], or with all the other
missing things that real-world processors need.  Fundmantally, the
connection of FUs with visible register state, in the way WIZ does it,
makes it *harder* to improve ILP, not easier.

SUMMARY
1) Compiler people will not be keen on WIZ.

2) Neither will operating systems people. 

3) And there are many more mis-features to make me believe 1) and 2),
but I'm done for now.