We have intercepted a coded Imperial message from the Andersen system!
 KA> Hi Doug.
 KA> Some time ago you'd post a meessage about the 040 and it's spec's. Could
 KA> you please post it again ? (specially for me 8-) ).

No problem.

Here is the 68040 post - digitally remastered, with new special FX.

*-------------------------------------------------------------------*
| There now follows a complete rundown on EXACTLY WHY the 68040 is  |
| something we should all encourage. :)                             |
*-------------------------------------------------------------------*

The 68040 has a number of additional enhancements which make it a more
efficient chip than it's predecessors, and every other processor in it's
generation.

I will explain why....


The 68030 carries around 300,000 transistor gates on it's footprint. The
68040 has approximately 1,200,000. Admittedly much of this goes on the new
cache and the FP math unit, but this still leaves a large margin for use by the 
processor itself.

The 68030 relies on a small (but significant) amount of MICROCODE or 'gunk' as 
it's referred to in the chip world. This means instructions are not efficiently 
streamlined and are not as fast as they could be. The microcode itself is like 
a small hardware 'program' that performs a number of smaller operations in 
order to simulate a more complex instruction. The 68040 contains no microcode - 
all of the instructions have been rebuilt and hard-wired for speed. Motorola's 
68040 was the first commercial CISC chip to use this approach, mainly because 
they applied intelligence and foresight where everybody else was just beefing 
up their clockrates.

The execution unit is now fully parallel with a multi-stage pipeline which 
eliminates most of the hold-ups common to the 68030 architecture and of course, 
it's forerunners. This means that different instructions are being fetched, 
decoded & executed simultaneously in up to 5 stages inside the pipeline (it 
could be 7 - I don't have such details on me). This is what gives the DSP56001 
and other such RISC chips such a high instruction throughput - but unlike the 
DSP, the 68040 pipeline is dynamic. There are no ILLEGAL instruction 
combinations that will clog the pipe and therefore require re-organisation of 
the program.

The floating point unit is NOT to be compared with the 68881/2 as it is WAY 
ahead of either of these chips as far as technology is concerned. The 68040 
FPU, although not as exhaustive as the 68882, is very much faster.

A typical example is that a 68882 floating-point multiply can take between 50 
and 100 clock-cycles to graduate (The exact numbers are listed elsewhere). The 
68040 can manage it in just 4 cycles. The 68882 takes well over
a hundred cycles to perform a divide - the 68040 takes about 9. This is 
frightening when compared with chips such as the 486 which can't even come 
close to this sort of efficiency, and is the main contribution to the fact that 
a 68040 could well outstrip a Pentium, if operated at the same clock rate. 
Unfortunately, the 68040 is limited to lower clockrates (around 32Mhz) due to 
the double-clocked internal architecture. The 68060 later cured this problem.

In ACTUAL usage, the 68040 leaves the Pentium standing as the Intel chip is so 
far still running old 486 code and the RISC part is not really a RISC
instruction set - it's just a parallel instruction launcher designed to
increase the normal 486 throughput. (i.e. a fudge).

The cache consists of 2 x 4k blocks instead of the 2 x 256byte blocks present 
on the 68030. This makes a HELL of a difference. The chip has a write-cache 
with controlled-write/snooping ability which means that data written to RAM can 
hang around inside the cache until a device (screen, blitter, DMA sound etc.) 
requires for it. Only then will it be written across the bus and suffer a 
bottleneck. This means that bus-referencing is drastically reduced to the point 
where it only interferes with large block-transfers, leaving almost everything 
else significant completely inside the processor. Also, the chip snoops ALL 
busmasters, not just the caches (although the I/O ports are ommited from 
snooping, but that's a computer engineering problem...). Motorola claim a 97% 
hit rate on the cache on the 68040 - it's a bit of a 'seasonally adjusted' 
figure but it is very high.

This would suit the Falcon fantastically as you could have a crippled bus and 
still get the rediculous MIPS throughput to be expected of the 68040. In 
addition to this, the write-cache & bus-snooping system means that 
self-modifying code is no longer a problem - compatibility with the 68000 is 
actually HIGHER on this front which can't be bad news.

You might wonder what happens when a large block transfer occurs - as it
inevitably will. Well, Motorola has covered this also with a complete
read/write burst access to & from the cache.

When the cache is enabled, any data required by the processor can be accessed 
quickly if it's in the cache already - at the cost of a tiny overhead. This is 
a cache 'hit' and is usually good news. If the data is not present however, the 
small overhead in addition to the time taken to finally get the data from the 
bus is higher than a straight memory-reference. This is a cache 'miss' and is 
bad news. What the 040 does (and what the 030 should do but doesn't) is grab a 
chunk of 16 bytes in one go and then cache-check the whole lot. This is more 
efficient than checking each of the smaller 4byte chunks one at a time. This is 
known as 'burst access' and the 040 is fantastic at it.

As an added note, the 486 has no write-cache and no bus-snooping and cannot 
perform burst-writes back to memory. Just for information's sake.

One of the most significant improvements on the 030 is the arithmetic unit. The 
030 takes at least 18 (nop) cycles to perform a multiply instruction, which 
equates to 36 actual cycles, whereas the 68040 takes just 1. This is MUCH 
faster than the 486, and is similar to the Falcon DSP. This is known as a 
single-cycle multiply, or a 'parallel multiplier array'.

There are new instructions including 'move16' which is complete overkill as it 
allows you to move chunks of 16 bytes around in one go. This is a rare and 
amusing addition.

Now the juicy bits...

When compared to the 68030, the 040 performs around 15 times faster AT THE SAME 
CLOCK RATE. That is - 16Mhz.

The optimisations made to the instruction units were based on statistics 
involving hundreds of millions of lines of code from real-world programs. Few 
processors can claim this, other than Motorola's other recent chips.

The 68040 offers a MIPS rating almost as high as it's clock rate, about 26MIPS 
at 32Mhz - that's basically RISC throughput on a CISC architecture. And 
Motorola are VERY proud of this as they well should be.

A 68040 is easily capable of this sort of insanity:

    move.l  ([label.w,pc,d0.l*8]),([a4,d4.l*2],offset.w)

A bog standard RISC chip (i.e. Console or otherwise) would need to do something 
like this to achieve the same effect:

    move    #label,a2
    move    pc,d1
    add     d1,a2
    shl     #3,d0
    add     d1,d0
    move    d0,a0
    add     d4,d4
    add     d4,a4
    move    (a4),a4
    move    #offset,d5
    add     d5,a4
    move    (a0),(a4)

Good RISC chips can get away with half this many instructions, but you get the 
idea. If the 68040 and a RISC variant both have roughly equivalent MIPS ratings

at the same clock rate, the 68040 only needs one instruction where the RISC 
chip could need 5-12 to do the same job. That makes the 68040 at least 5 times 
faster than the RISC at the same clock rate - with room to breathe. Not bad for 
a CISC-oriented design!

In fact, these statistics make the 68040 the most streamlined, efficient, 
powerful and advanced processor in it's generation. Motorola claimed that it 
struck at the heart of RISC with a single blow - and yet the new 68060 is 3.5 
times faster than the 68040 at the same clock rate...

The 68040/68060 chips are actually RISC engines with full CISC instruction 
sets, if this puts it in perspective.

The 68040 boasts one of the most extensive sets of addressing modes (as shown 
above) available to any chip. The sick thing is that it doesn't actually 
inhibit speed and still takes less time than performing the long-winded version 
using smaller instructions. This makes the chip more efficient even than the 
Pentium which has a much weaker set.

The actual throughput of the 68040 is an astounding 1.25 cycles per instruction 
which compares favourably with 1.7 on the 486 (bearing in mind the 'weaker' set 
of the 486. This is also faster than the ARM (1.3) or SPARC (about  1.3). 
RISC-users might think 'hey - that's tame by our standards' but then it's not a 
(R)educed (I)nstruction (S)et (C)omputer chip. It can easily manage 8 times as 
much in one instruction as RISC as it has a complete, unrestrained and even 
over-the-top instruction set.



For exact values on instruction throughputs and statistics, drop Neil Stewart a 
message, as he investigated the subject a while back and may still have some of 
the notes.

Doug - dlittle@nest.demon.co.uk

