Msg #177 / 1-189  Time: 22 Dec 94  15:50:24
From: Doug Little
To  : Robert Cooper
Subj: 68040 vs 80486
---------[N.ST.PROG]-----------------------------------------------
The probe droid left hyperspace with a message from Robert Cooper...
 RC> On 18 Dec 94, Doug sent this Borek :
 RC>
 DL>> Oops, maybe I should have looked here before dropping my lengthy
 DL>> 68040 message in N.FALCON.MISC?
 RC>
 RC> Feel free to forward it into this area, not all of us get
 RC> N.FALCON.MISC and we'd hate to miss out. :}

Okeydokey, here it is in it's widescreen, digitally remastered, director's 
uncut form...

-----------------------------------------------------------------------
Msg #214 / 1-368  Time: 18 Dec 94  21:40:20
From: Doug Little
To  : xxx
Subj: MC68040 vs 486/P5
---------[N.FALCON.MISC]-----------------------------------------------

??>> As I have explained in other messages, a 68040 is basically a 68030
??>> with a stripped-down floatingpoint unit.........

    Hi guys, just noticed this message and thought I'd drop in a few
notes of my own (we have done some work on both chips and might be able
to add something to the situation, hopefully to resolve it once and for
all).

-------------------------------------------------------------------------
There now follows a complete rundown on EXACTLY WHY the 68040 is
something we should all encourage. :)
-------------------------------------------------------------------------

What your friend says is basically correct as far as compatibility is
concerned (040 is an 030 with a 'reduced' FPU) but in the real world of speed 
and efficiency this statement does not do the 68040 the justice it
deserves.

The 040 has a number of additional enhancements which make it a more
efficient chip than it's predecessor, and basically every other processor
in it's class. I will explain why.

>>>>>>>

The 68030 carries around 300,000 transistor gates on it's footprint. The
68040 has approximately 1,200,000. Admittedly much of this goes on the
new cache and the FP math unit, but this still leaves a load for use by
the processor itself.

The 68030 relies on a small (but significant) amount of MICROCODE or
'gunk' as it's referred to in the chip world. This means intstructions
are not streamlined and are not as fast as they could be. The 68040
contains no microcode - all of the instructions have been rebuilt and
hard-wired for speed. Few CPU processors can boast of this!

The execution unit is now fully parallel with a multi-stage pipeline
eliminating most of the hold-ups inside the 68030. This means
that different instructions are being fetched, decoded & executed
simultaneously in up to 5 stages inside the pipeline (it could be 7, I
forget). This is what gives the DSP56001 such a high instruction
throughput - but unlike the DSP, the 040 pipeline is dynamic. There are
no ILLEGAL instruction combinations that will clog the pipe!

The floating point unit is NOT to be compared with the 68881/2 as it is
WAY ahead of either of these chips as far as technology is concerned. The 68040 
FPU, although not as exhaustive as the 68882, is MUUUUUUCH faster.

A typical example is that a 68882 floating-point multiply can take
between 50 and 100 clock-cycles to graduate (I have the exact number in a book 
somewhere). The 68040 can manage it in just 4 cycles. The 68882
takes well over a hundred cycles to perform a divide - the 68040 takes
about 9. This kills chips such as the 486 which can't even come close to
this sort of efficiency, and is the main contribution to the fact that
the 68040 outperforms the Pentium at the same clock rate. In actual usage
the 68040 leaves the Pentium standing as the Intel chip it is so far still 
running old 486 code and the RISC part is mostly ignored.

The cache consists of 2 x 4k blocks instead of the 2 x 256byte blocks
present on the 68030. This makes a HELL of a difference. The chip has a
write-cache with delayed-write/snooping ability which means that data
written to RAM can hang around inside the cache until a device (screen,
blitter, DMA sound etc.) looks for it. Only then will it be written
across the bus and suffer a bottleneck. This means that bus-referencing
is drastically reduced to the point where it only interferes with large
block-transfers, leaving almost everything else significant completely
inside the processor. Also, the chip snoops ALL busmasters, not just the
caches (although the I/O ports are ommited from snooping, but that's a
computer engineering problem...). Motorola claim a 97% hit rate on the
cache on the 68040 - it's a bit of a 'seasonally adjusted' figure but it
is very high.

This would suit the Falcon fantastically as you could have a crippled
16-bit bus and still get the rediculous MIPS throughput to be expected of
the 68040. In addition to this, the write-cache & bus-snooping system
means that self-modifying code is no longer a problem - compatibility
with the 68000 is actually HIGHER on this front which can't be bad news.

You might wonder what happens when a large block transfer occurs - as it
inevitably will. Well, Motorola has covered this also with a complete
read/write burst access to & from the cache.

When the cache is enabled, any data required by the processor can be
accessed quickly if it's in the cache already - at the cost of a tiny
overhead. This is a cache 'hit' and is usually good news. If the data is
not present however, the small overhead in addition to the time taken to
finally get the data from the bus is higher than a straight
memory-reference. This is a cache 'miss' and is bad news. What the 040
does (and what the 030 should do but doesn't) is grab a chunk of 16 bytes
in one go and then cache-check the whole lot. This is more efficient than
checking each of the smaller 4byte chunks one at a time. This is known as
'burst access' and the 040 is fantastic at it.

As an added note, the 486 has no write-cache and no bus-snooping and
cannot perform burst-writes back to memory. Just for information's sake!

One of the most significant improvements on the 030 is the arithmetic
unit. The 030 takes about 18 cycles to perform a multiply instruction (on
a good day), whereas the 68040 takes just 1. This is MUCH faster than the
486, and is similar to the Falcon DSP. This is known as a single-cycle
multiply, or a 'parallel multiplier-array'.

There are new instructions including 'move16' which is complete overkill
as it allows you to move chunks of 16 bytes around in one go. This is a
rare and amusing addition.

Now the juicy bits...

When compared to the 68030 (yes, I will offer figures), the 040 performs
around 15 times faster AT THE SAME CLOCK RATE. That is 16Mhz.

The optimisations made to the instruction units were based on statistics
involving hundreds of millions of lines of code from real-world programs.
Few processors can claim this...

The 68040 offers a MIPS rating almost as high as it's clock rate, about
26MIPS at 32Mhz - that's basically RISC throughput on a CISC
architecture. And Motorola are VERY proud of this as they well should be.

A 68040 is easily capable of this sort of insanity:

    move.l  ([label.w,pc,d0.l*8]),([a4,d4.l*2],offset.w)

A bog standard RISC chip (i.e. Console or otherwise) would need to do
something like this to achieve the same effect:

    move    #label,a2
    move    pc,d1
    add     d1,a2
    move    #8,d2
    mult    d2,d0
    add     d1,d0
    move    d0,a0
    add     d4,d4
    add     d4,a4
    move    (a4),a4
    move    #offset,d5
    add     d5,a4
    move    (a0),(a4)

Good RISC chips could probably get away with half this many instructions,
but you get the idea. If the 68040 and a RISC variant both have roughly
equivalent MIPS ratings at the same clock rate, the 68040 only needs one
instruction where the RISC chip could need 10 to do the same job. That
makes the 68040 at least 5 times faster than the RISC at the same clock
rate - with room to breathe. Not bad for a CISC design!

In fact, these statistics make the 68040 the most streamlined, efficient,
powerful and advanced processor in it's class. Motorola claimed that it
struck at the heart of RISC with a single blow - and yet the new 68060 is 3.5 
times faster than the 68040 at the same clock rate...

The 68040 boasts one of the most extensive sets of addressing modes (as
shown above) available to any chip. The sick thing is that it doesn't
actually inhibit speed and still takes less time than performing the 
long-winded version using smaller instructions. This makes the chip more 
efficient even than the Pentium which has a much weaker set.

The actual throughput of the 68040 is an astounding 1.25 cycles per
instruction which compares favourably with 1.7 on the 486 (bearing in mind the 
'weaker' set of the 486. This is also faster than the ARM (1.3) or SPARC (about 
1.3). RISC-users might think 'hey - that's tame by our standards' but then it's 
not a (R)educed (I)nstruction (S)et (C)omputer chip. It can easily manage 8 
times as much in one instruction as RISC as it has a 'Complete' unrestrained 
and even over-the-top instruction set.

And anyway, it's internally double-clocked - A 32Mhz 68040 is actually
ticking away at 64Mhz.


I realise that the message quoted above was meant to keep things short
and sweet, but I'm sure people might balk at the thought of getting an
040 on that information. I hope by now that people will not look at the
68040 as a '68030 with a limited FPU built in' and will see it as it actually 
is - it could be the future of our machines!

Anyway, If you want any more statistics or other technical junk on the
Motorola chips / instruction sets then drop us a message!


Doug Little @ Black Scorpion Software - dlittle@nest.demon.co.uk

-!- JetMail 0.99beta9
 ! Origin: dlittle@nest.demon.co.uk [Black Scorpion] (NeST 90:90/0.2)


