<HTML>
<! $Id: risc_doc.txt,v 1.16 1996/02/11 23:00:43 nat Exp $>
<HEAD>
<TITLE>GPU/DSP</TITLE>
</HEAD>
<BODY background="jaguar.gif">
<PRE>
# -------------------------------------------------------------------
# GPU/DSP                               (c) Copyright 1995 KKP & Nat!
# -------------------------------------------------------------------
# These are some of the results/guesses that Klaus and I (Nat!) found
# out about the Jaguar with a few helpful hints by other people, 
# who'd prefer to remain anonymous. 
#
# Since we are not under NDA or anything from Atari we feel free to 
# give this to you for educational purposes only.
#
# Please note, that this is not official documentation from Atari
# or derived work thereof (both of us have never seen the Atari docs)
# and Atari isn't connected with this in any way.
#
# Please use this informationphile as a starting point for your own
# exploration and not as a reference. If you find anything inaccurate,
# missing, needing more explanation etc. by all means please write
# to us:
#    nat@zumdick.rhein-main.de
# or
#    kkp@gamma.dou.dk
#
# If you could do us a small favor, don't use this information for
# those lame flamewars on r.g.v.a or the mailing list.
#
# HTML soon ?
# -------------------------------------------------------------------
# $Id: risc_doc.txt,v 1.16 1996/02/11 23:00:43 nat Exp $              
# -------------------------------------------------------------------

This contains some stuff, that is cryptic because I just incorporated 
third source knowledge. There's quite a bit I don't understand yet :) 
[nat/1996]
Please note the high bullshit content when it comes to the
description of the pipeline business. Although Klaus added a new
theory, which sounds pretty good. Now I just need to run some check
code....


1 RISCy Business
=-=-=-=-=-=-=-=-=

The RISC's has 2 register banks of 32 registers each. There are
the Current and the Alternative register bank. Register R31 is the
stack pointer and normally R0 is initilized to 0 (Zero).

The PC and the STATUS registers are mapped to memory addresses, and
modifiable by "the outside".

RW: G_FLAGS ($F02100)   GPU
~~~~~~~~~~~~~~~~~~~~~
RW: D_FLAGS ($F1A100)   DSP
~~~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------+---+----^------+-^--------+-+------+
1 |                unused               |aux| irq_pend  | irq_enab |m|flags |
  +-------------------------------------+---+-----------+----------+-+------+

flags:
   bit 0:   zero
   bit 1:   carry
   bit 2:   negative

   These are the GPU status flags that are set on arithmetic and logical
   instructions.

mask (m):
   bit 3:   IMASK

   Interrupt mask. If set all interrupts are disabled.

irq_enable:
   bit 4:  IRQ 0 enable
   bit 5:  IRQ 1 enable
   bit 6:  IRQ 2 enable
   bit 7:  IRQ 3 enable
   bit 8:  IRQ 4 enable
   
   You can enable any of the 5 interrupts by setting the appropriate
   bit. (?)

irq_clear:
   bit 9:  IRQ 0 clear
   bit 10: IRQ 1 clear
   bit 11: IRQ 2 clear
   bit 12: IRQ 3 clear
   bit 13: IRQ 4 clear

   When through with an interrupt processing, you probably have to clear
   the IRQ by clearing/setting the appropriate bit here. (?)

aux:
   bit 14:  register bank selection
   bit 15:  DMA

   Switching between the registerbanks is done like this:

      movei   #G_FLAGS,r1       ; Status flags
      or
      movei   #D_FLAGS,r1       ; Status flags
      load    (r1),r0
      bset    #14,r0
      store   r0,(r1)           ; Switch the GPU/DSP to bank 1

   Normally the GPU is running on Bank 1, since on an IRQ Bank 0
   becomes automatically active.

   bit 15 seems to control the way the GPU load/store instructions
   access memory. If set they run at DMA priority. If cleared ??


RW: G_MTXC ($F02104)    GPU
~~~~~~~~~~~~~~~~~~~~
RW: D_MTXC ($F1A104)    DSP
~~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------^--------^--------^-----+--+--------+
1 |                          unused                             | t|  size  |
  +-------------------------------------------------------------+--+--------+

size:
   bits 0-3:   size as a binary number

   Size of one row of the matrix.

type  (t):
   bit 4:      row order

   Specifiy whether your matrix is Row Major (0) or Column Major (1).


RW: G_MTXA ($F02108)    GPU      Metaxa ? 
~~~~~~~~~~~~~~~~~~~~
RW: D_MTXA ($F1A108)    DSP
~~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------^--------^--------^--------^--------+
1 |                                  address                                |
  +-------------------------------------------------------------------------+
   
  Points to the matrix in memory.


RW: G_END ($F0210C)     GPU
~~~~~~~~~~~~~~~~~~~
RW: D_END ($F1A10C)     DSP
~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------+--------^--------^--------^--------+
1 |                                  value                                  |
  +-------------------------------------+-----------------------------------+

   Configure the endianness of the GPU/DSP with this register. How ??
   Well write a $00070007 here.
   Default value: $00070007


RW: G_PC ($F02110)    GPU      
~~~~~~~~~~~~~~~~~~
RW: D_PC ($F1A110)    DSP
~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------^--------^--------^--------^--------+
1 |                                    pc                                   |
  +-------------------------------------------------------------------------+

pc:
   program counter of the RISC. I suspect that writing into this register 
   while the RISC is running is not the best idea...


RW: G_CTRL ($F02114)    GPU
~~~~~~~~~~~~~~~~~~~~
RW: D_CTRL ($F1A114)    DSP
~~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------^--------+--+-----^---+--+-^--------+
1 |                      unused                  | h| irq_lat | d| control  |
  +----------------------------------------------+--+---------+--+----------+

control:
   bit 0:   start the GPU / run status
   bit 1:   allow GPU to interrupt the 68K (?)
   bit 2:   generate a GPU type 0 (??) interrupt 
   bit 3:   enable single step
   bit 4:   perform a single step

This register controls the GPU/DSP (?). 
You can the RISC, stop it, put it in singlestep mode, or generate an 
host interupt. ("your master is calling")!

Setting bit 0 starts the GPU. When reading this register this bit will
tell you whether the GPU is running or not. You can stop the GPU by 
clearing this bit. (You can't go wrong starting the GPU with setting 
bits 0 and 4 (keeping bit 3 cleared), but bit 0 is sufficient!) 

Perform singlestepping by setting bit #3 and then stepping through the
instructions by setting bit #4 for each step.

dma (d):
   bit 5:   set external DMA ACK (?)

int_lat:
   bit 6:   IRQ 0 pending  VI-IRQ (VBLANK)
   bit 7:   IRQ 1 pending
   bit 8:   IRQ 2 pending 
   bit 9:   IRQ 3 pending
   bit 10:  IRQ 4 pending

   Clear or poll any pending interrupts with these bits. (?)

bus_hog (h) :
   bit 11:  hog mode on

   Allows the GPU to 'hog' the bus. When the GPU code uses a lot of 
   load/store instructions consecutively it could be that the OP does
   not get enough time to do its processing. Use with care.


Register R31 is used by the RISC's as stack pointers. They only 
seems to be used by interupts. See the section on interupts.


RW: G_HIDATA ($F02118)  GPU
~~~~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------+--------^--------^--------^--------+
1 |                                 high_lword                              |
  +-------------------------------------+-----------------------------------+

high_lword:
   The rest of the phrase that doesn't fit into a GPU register, when 
   using the "loadp" or "storep" instructions. Possibly also used by the
   MAC instructions for the "hi" byte. (See D_MACHI)


RW: D_MOD ($F1A118)     DSP
~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------+--------^--------^--------^--------+
1 |                                    mask                                 |
  +-------------------------------------+-----------------------------------+

mask:
   Mask to be used by the ADDQMOD and SUBQMOD instructions. Create your own
   circular buffers...


W: G_DIVCTRL ($F0211C)  GPU
~~~~~~~~~~~~~~~~~~~~~~
W: D_DIVCTRL ($F1A11C)  DSP
~~~~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------^--------^--------^--------^-----+--+
1 |                             unknown                                  |c |
  +----------------------------------------------------------------------+--+

   This register is write only
 
control (c)
   bit #0      division control

   If bit #0 is set, then the division operation will assume a unsigned (?) 
   16.16 integer fractional representation for the divide. 
   Else you get a straight 32 bit unsigned integer divide 
   (like on the 68000 DIVU).


R: G_REMAIN ($F0211C)   GPU
~~~~~~~~~~~~~~~~~~~~~   
R: D_REMAIN ($F1A11C)   DSP
~~~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------+--------^--------^--------^--------+
1 |               unused                |               value               |
  +-------------------------------------+-----------------------------------+

   This register can be read only.
   Remainder of the division operation. Guess: only 16 bits wide.


RW: D_MACHI ($F1A120)     DSP
~~~~~~~~~~~~~~~~~~~~~
 32       28        24        20       16       12        8        4        0
  +--------^---------^---------^--------^--------^--------^--------^--------+
1 |                          unused                       |      byte       |
  +-------------------------------------------------------------------------+

byte:
   high byte of MAC operations (??) 



############################################################################

Architecture:
=-=-=-=-=-=-=

Ingredients:

GPU/DSP: two load/store units
         one ALU
         one divisor unit
         various control logic for branching et.c.

The GPU and the DSP are both pipeline processor, employing a 
triple stage forwarding pipeline. The pipeline is:  (???)

Stage 1:   Load   (LAS1/LAS2)
Stage 2:   Arithmetic and Logic Unit
Stage 3:   Store  (LAS1/LAS2)


Load an Store Unit (LAS)
=-=-=-=-=-=-=-=-=-=-=-=

The LAS aren't just called LAS because they can Load and Store,
but because they can also Load and Store at the same time. 
To the same register that is... Therefore writing a register
back, still retains the register value in the LAS for usage by
the ALU again.


Arithmetic and Logic Unit  (ALU)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

add, mult, shift all 'atomic' instruction excecute one cycle



Registers
=-=-=-=-=

64 registers, each 32 bits wide, stored in two banks of r0..r31
interrupts always execute out of bank 0  (i.e. your code should
always execute in bank 1..)



              I M P O R T A N T     N O T I C E 
******************************************************************
                  Don't take it as gospel (yet).
*******************************************************************
              I M P O R T A N T     N O T I C E 

From the description about the execution units, the pipeline 
should work this way:

Instruction :   OP s,d    ; eg ADD r1,r2

ALU:
           +------+
     S1 -->|      |
           |      |---> D
     S2 -->|      |
           +------+

When instruction is in stage 1, S1 and S2 in the ALU is loaded from s & d.
When instruction is in stage 2, the ALU function OP is executed, D is ready
When instruction is in stage 3, the d is loaded with D

Now lets examine how a normal instruction stream is executed:

pipe  inst regs  operations     scoreboard
----+-----+----+-------------+------------------
t=0
(3)   nop        nop
(2)   nop        nop
(1)   nop        nop
( )   add r0,r1
( )   add r2,r3
( )   add r4,r5

t=1
(-)   nop
(3)   nop        store nop
(2)   nop        alu nop
(1)   add r0,r1  S1=r0, S2=r1      | +r1
( )   add r2,r3
( )   add r4,r5

t=2
(-)   nop
(3)   nop        store nop         | -
(2)   add r0,r1  ALU:ADD S1,S2,D
(1)   add r2,r3  S1=r2, S2=r3      | r1 +r3
( )   add r4,r5
( )   add r6,r7

t=3
(-)   nop
(3)   add r0,r1  r1=D              | (-r1)
(2)   add r2,r3  ALU:ADD S1,S2,D
(1)   add r4,r5  S1=r4, S2=r5      | r3 +r5
( )   add r6,r7
( )   add r7,r9


t=4
(-)   nop
(-)   add r0,r1
(3)   add r2,r3  r3=D              | (-r3)
(2)   add r4,r5  ALU:ADD S1,S2,D
(1)   add r6,r7  S1=r6, S2=r7      | r5 +r7
( )   add r7,r9


t=5
(-)   nop
(-)   add r0,r1
(-)   add r2,r3
(3)   add r4,r5  r5=D              | (-r5)
(2)   add r6,r7  ALU:ADD S1,S2,D
(1)   add r7,r9  S1=r7, S2=r9      | r7 +r9 (STALL???)
( )   nop

t=6
(-)   nop
(-)   add r0,r1
(-)   add r2,r3
(-)   add r4,r5
(3)   add r6,r7  r7=D              | (-r7)
(2)   stall      ALU:NOP
(1)   add r7,r9  S1=r7, S2=r9      | r9
( )   div r0,r1

t=7
(-)   nop
(-)   add r0,r1
(-)   add r2,r3
(-)   add r4,r5
(-)   add r6,r7
(3)   stall      store nop         |
(2)   add r7,r9  ALU:ADD S1,S2,D
(1)   div r0,r1  S1=r0, S2=r1      | r9 +r1



Here's a few more complex example: (Thanks, you know who!)

Ex 1:
   div r0,r1;     (r1 is not available now!)
   STALL STALL STALL*12
   add r1,r2;    (yay, we can use r1 again :-)

You could replace the STALLs with code that did not need to
access r1 and the divison wouldn't slow you down more than
any other instruction. (Of course a second division is 
impossible, when the DIV unit is already in use)

Ex.2:
   nop
   nop
   nop                         (LS1)    (LS2)     (ALU)
   add r0,r1                  (load  r0, load r1,  nop)
   add r2,r3                  (load  r2, load r3,  add r0,r1)
   add r4,r5                  (store r1, load r4,  add r2,r3
                              (load  r5,  nop   ,  STALL)
   add r6,r7                  (load  r6, load r7,  add r4,r5)
   add r8,r9                  (store r5, load r8,  add r6,r7)
                              (load  r9, nop    ,  STALL)
   add r0,r1                  (load  r0, load r1,  add r8,r9)
   nop                        (store r9,nop        add r0,r1)   
   nop                        (store r1,nop        nop)



1.0 Move instructions
=-=-=-=-=-=-=-=-=-=-=

       move    Rn,Rn
       move    PC,Rn
       movei   #xxxxxxxx,Rn

       load    (Rn),Rn
       load    (Rm+n),Rn    * Rm = R14 | R15 !
       load    (Rm+Ri),Rn   * Rm = R14 | R15 !
       loadb   (Rn),Rn      * load byte
       loadw   (Rn),Rn      * Load word
       loadp   (Rn),Rn      * Load Phrase (GPU only)
       
       store   Rn,(Rn)      
       store   Rn,(Rm+n)    * Rm = R14 | R15 !
       store   Rn,(Rm+Ri)   * Rm = R14 | R15 !
       storeb  Rn,(Rn)      * Store Byte
       storew  Rn,(Rn)      * Store Word
       storep  Rn,(Rn)      * Store Phrase (GPU only)
       
       moveta  Rn,Rn        * move to alternative register bank
       movefa  Rn,Rn        * move from alternative register bank


1.1 Logical Instructions
=-=-=-=-=-=-=-=-=-=-=-=-=

       or      Rn,Rn
       xor     Rn,Rn
       and     Rn,Rn


1.2 Bitoperation Instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

       bset    #,Rn
       bclr    #,Rn
       btst    #,Rn
       

1.3 Shift Instructions
=-=-=-=-=-=-=-=-=-=-=-=
       
       shlq    #xx,Rn
       shrq    #xx,Rn
       
       sharq   #xx,Rn
       
       ror     Rn,Rn 
       rorq    #xx,Rn
       
       
1.4 Arith. Instructions
=-=-=-=-=-=-=-=-=-=-=-=

       mult    Rn,Rn
       imult   Rn,Rn
       mmult   Rn,Rn
       imultn  Rn,Rn
       imacn   Rn,Rn
       resmac  Rn
       
       div     Rn,Rn          * exec seems to use max 4 i-cycles
       
       add     Rn,Rn
       addc    Rn,Rn          * add with carry
       addq    #xx,Rn
       addqt   #xx,Rn         * add quick, test result
       addqmod #xx,Rn         * add quick, take modulo
       
       sub     Rn,Rn
       subc    Rn,Rn          * add with carry
       subq    #xx,Rn
       subqt   #xx,Rn         * sub quick, test result
       subqmod #xx,Rn         * sub quick, take modulo
 
       cmp     Rn,Rn
       cmpq    #xx,Rn
 
       neg     Rn
       not     Rn
       abs     Rn
       
        
1.5 Program Structure Instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

       jump     CC,(Rn)
       jump     (Rn)
       
       jr       CC,xxxxxx
       jr       xxxxxxx
       
       nop


1.6 Condition Codes
=-=-=-=-=-=-=-=-=-=

Condition codes CC can be any of 

   CC (%00100) CS (%01000) EQ (%00010) MI (%11000)
   NE (%00001) PL (%10100) HI (%00101)  T (%00000).

They are used together with the jump instructions...


2.0 Restrictions
=-=-=-=-=-=-=-=-=

  'JR+MOVEI', 'JUMP+MOVEI', 'JR+JR', 'JR+JUMP', 'JUMP+JR', 'JUMP+JUMP',
  'JR+MOVE PC', 'JUMP+MOVE PC' 

    IMULTN must be followed by a IMACN (Error displayed)
    IMACN must be followed by a IMACN or RESMAC (Error displayed)
    RESMAC must be preceed by a IMACN (Error displayed)
    a NOP is inserted between LOAD+MMULT and STORE+MMULT (Warning displayed).
    I don't know if LOADB+MMULT, LOADW+MMULT, LOADP+MMULT, ... are valid or
    not. Currently, it's not tested...


3.0 Instruction Encoding
=-=-=-=-=-=-=-=-=-=-=-=-=

Most instructions are only 2 bytes long. This means that 4 
instructions can be pulled from RAM in one memory access!! This also
makes the code extremly tight, which is of optimum concern when 
writing cartridge based programs.
One more than 2 byte instruction is the movei #x,Rn which have the
32 bit constant just after the 2 byte instruction, this saves a lot
of time and space over other RISC's. The ARM forexample uses 4 32 bit
instructions to fill a register (8 bit at a time). The SPARC 2 32 bit
instructions.


3.2 Instruction Encoding
=-=-=-=-=-=-=-=-=-=-=-=-=

All instructions uses the top 6 bits to encode the instruction.

The 2 operand instructions split the remainder of the 16 bits into
2 5 bit fields, the source (quick or register) and the destination
register.


3.2.1 The Implied Instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

            iiiiii 0000000000 
              /\       /\
              ||       |_============== room for extensions
              ||
              \`======================= instruction

The Implied instruction are nop!



3.2.2 The 1 Operand Instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

            iiiiii 00000 ddddd  <====== destination register
              /\     /\
              ||     |_================ room for extensions
              ||
              \`======================= instruction

The one operand instructions are:

             neg    R0
             not    R1
             abs    R2
             resmac R3



3.2.3 The 2 Operand Instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Most instructions are 2 operand and follow this pattern. The register
to register instructions use the sssss and ddddd to specify source
and destination registers, as add r1,r0. In the quick to register
instructions the sssss field is used to hold a constant, as 
asl #3,r0 where the constans is between 1 and 32 and moveq #0,d2
where the constant is between 0 and 31.

            iiiiii sssss ddddd  <====== destination register
              /\     /\
              ||     |_================ source (quick or register)
              ||
              \`======================= instruction

Examples of 2 operand instructions are:

            move  R1,R2
            bset  #31,R2
            etc...


3.2.4 The movei Instruction
=-=-=-=-=-=-=-=-=-=-=-=-=-=

The movei instruction are very special! This instruction is the 
only 6 byte instruction, that is what makes it special.
The instruction word follow the general structure,

            iiiiii 00000 ddddd  <====== destination register
              /\     /\
              ||     |_================ room for extensions
              ||
              \`======================= instruction ($98)

but the 32 bit constant that is to be loaded into the destination
register followes the instruction

           +-------------+ +------------+ +------------+
           |   Movei Rn  | | Lower word | | Upper word |
           +-------------+ +------------+ +------------+


3.2.5 The Load & Store Instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Most instructions are 2 operand and follow this pattern.

            iiiiii ppppp ddddd  <====== destination register
              /\     /\
              ||     |_================ indirect register
              ||
              \`======================= instruction


3.2.5.1 Addressing Modes For Load/Store Byte/Word/Phrase
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

All load and store instructions support register indirect addressing,
which is written (Rn).
This means that you can load the memory location pointed to by a 
register into yet another register (or the same).


3.2.5.2 Addressing Modes For Load/Store Longword
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Together with the Load/Store longword instructions, there are other
addressing modes. Called:

  * indexed register indirect addressing, which is written (Rn+Rm),
  * register indirect addressing w. offset, which is written (Rn+xx),

In these addressing modes Rn _have_ to be R14 or R15!

fx:          load  (r1+r2),r0
             store r0,(r1+16)
                    

3.2.5.3 Load/Store Phrase (GPU Only)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

The GPU has an direct 64 bit (Phrase) interface to the main memory.
The loadp/storep instructions access this memorys full width.
The lower part of the phrase pointed to by the (Rp) goes from/to the
register specified, the other part of the phrase is in G_HIDATA
( 0xF02118 )  /* GPU Bus Interface high data  */

fx:          store r0,(rp)


3.2.6 The Program Control Instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Most Program Control instructions follow this pattern:

            iiiiii ddddd ccccc  <====== Condition Vector
              /\     /\
              ||     |_================ source (quick or register)
              ||
              \`======================= instruction

The ddddd field can either speify an offset (jr instruction) or 
a register containing a absolute address (jump instruction), all
jump instructions are conditional.


3.2.6.1 Condition Codes
=-=-=-=-=-=-=-=-=-=-=-=

Condition codes ccccc can be any 5 bit vector, here are some ready 
defined usefull values:

       CC (%00100    CS (%01000)   EQ (%00010)  MI (%11000)
       NE (%00001)   PL (%10100)   HI (%00101)  T  (%00000)

Examples of Program Control instructions:

            jump  mi, (r5)
            jr    ne, exit
            jr    t, loop   ; loop forever
            jr    loop      ; loop forever            
            jump  (r5)
            
            
3.2.7 Modulo Aritimetics (DSP only)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

The instructions addqmod and subqmod are modular with the size
specified in the D_MOD (0xF1A118) /* DSP Modulo Instruction Mask */
The mask register contains a mask that is applied to the register
after the add operation, as in the following two step

                    movei     #%111111,r1
              loop: addq      #4,r0
                    and       r1,r0
                    ...
                    jr        loop
                    
With the modulo register this can be written:

                    movei     #D_MOD,r3
                    movei     #~%111111,r1
                    store     r1,(r3)     ;; possibly need a (?) nop here
                    nop                   ;; because the D_MOD isn't
              loop: addqmod   #4,r0       ;; in the scoreboard (?)
                    ...
                    jr        loop                 

This is an obvious win! - you save a cycle each loop!

Instructions are 
                   subqmod, addqmod
                   

3.3 Instruction numbers
=-=-=-=-=-=-=-=-=-=-=-=

   Mnemonic  Mode     iiiiii sssss ddddd  hex  Notes
   --------------------------------------------------------------
     ADD     Rs,Rd    000000 sssss ddddd  $00
     ADDC    Rs,Rd    000001 sssss ddddd  $04
     ADDQ    #q,Rd    000010 qqqqq ddddd  $08  q is [32, 1..31]
     ADDQT   #q,Rd    000011 qqqqq ddddd  $0C  q is [32, 1..31]

     SUB     Rs,Rd    000100 sssss ddddd  $10
     SUBC    Rs,Rd    000101 sssss ddddd  $14
     SUBQ    #q,Rd    000110 qqqqq ddddd  $18  q is [32, 1..31]
     SUBQT   #q,Rd    000111 qqqqq ddddd  $1C  q is [32, 1..31]

     NEG     Rd       001000 00000 ddddd  $20

     AND     Rs,Rd    001001 sssss ddddd  $24
     OR      Rs,Rd    001010 sssss ddddd  $28
     XOR     Rs,Rd    001011 sssss ddddd  $2C

     NOT     Rd       001100 00000 ddddd  $30

     BTST    #q,Rd    001101 qqqqq ddddd  $34  q is [0..31]
     BSET    #q,Rd    001110 qqqqq ddddd  $38  q is [0..31]
     BCLR    #q,Rd    001111 qqqqq ddddd  $3C  q is [0..31]

     MULT    Rs,Rd    010000 sssss ddddd  $40
     IMULT   Rs,Rd    010001 sssss ddddd  $44
     IMULTN  Rs,Rd    010010 sssss ddddd  $48
     RESMAC  Rd       010011 00000 ddddd  $4C
     IMACN   Rs,Rd    010100 sssss ddddd  $50

     DIV     Rs,Rd    010101 sssss ddddd  $54

     ABS     Rd       010110 00000 ddddd  $58
                                          $5C
     SHLQ    #q,Rd    011000 qqqqq ddddd  $60  q is [32, 1..31]
     SHRQ    #q,Rd    011001 qqqqq ddddd  $64  q is [32, 1..31]
                                          $68 
     SHARQ   #q,Rd    011011 qqqqq ddddd  $6C  q is [32, 1..31]
     ROR     Rs,Rd    011100 sssss ddddd  $70  
     RORQ    #q,Rd    011101 qqqqq ddddd  $74  q is [32, 1..31]

     CMP     Rs,Rd    011110 sssss ddddd  $78
     CMPQ    #q,Rd    011111 qqqqq ddddd  $7C  q is [0..31]

DSP  SUBQMOD #q,Rd    100000 qqqqq ddddd  $80  q is [32, 1..31]
                                          $84
     MOVE    Rs,Rd    100010 sssss ddddd  $88
     MOVEQ   #q,Rd    100011 qqqqq ddddd  $8C  q is [0..31]    
     MOVETA  Rs,Rd    100100 sssss ddddd  $90
     MOVEFA  Rs,Rd    100101 sssss ddddd  $94
     MOVEI   #c32,Rd  100110 00000 ddddd  $98  followed by a 32 bit const
     
     LOADB   (Rp),Rd  100111 ppppp ddddd  $9C
     LOADW   (Rp),Rd  101000 ppppp ddddd  $A0
     LOAD    (Rp),Rd  101001 ppppp ddddd  $A4
GPU  LOADP   (Rp),Rd  101010 ppppp ddddd  $A8  Load Phrase
     LOAD  (R14+n),Rd 101011 nnnnn ddddd  $AC
     LOAD  (R15+n),Rd 101100 nnnnn ddddd  $B0
     
     STOREB  Rs,(Rp)  101101 ppppp sssss  $B4
     STOREW  Rs,(Rp)  101110 ppppp sssss  $B8
     STORE   Rs,(Rp)  101111 ppppp sssss  $BC
GPU  STOREP  Rs,(Rp)  110000 ppppp sssss  $C0  Store Phrase
     STORE Rs,(R14+n) 110001 nnnnn sssss  $C4
     STORE Rs,(R15+n) 110010 nnnnn sssss  $C8

     MOVE    PC,Rn    110011 00000 ddddd  $CC

     JUMP    CC,(Rd)  110100 ddddd ccccc  $D0
     JR      CC,q     110101 qqqqq ccccc  $D4  
 
     MMULT   Rs,Rd    110110 sssss ddddd  $D8
                                          $DC
                                          $E0
     NOP              111001 00000 00000  $E4

     LOAD (R14+Ri),Rd 111010 iiiii ddddd  $E8
     LOAD (R15+Ri),Rd 111010 iiiii ddddd  $EC
    STORE Rs,(R14+Ri) 110001 iiiii sssss  $F0
    STORE Rs,(R15+Ri) 110010 iiiii sssss  $F4
                                          $F8    
DSP  ADDQMOD #q,Rd    111111 qqqqq ddddd  $FC  q is [32, 1..31]


3.4   Move instructions
=-=-=-=-=-=-=-=-=-=-=-=

None of the move instructions affect the status flags of the GPU,
except when moving data into the status register itself.


3.5   Arithmetic instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

ABS      sets the carry flag if a negative value was transformed to a 
         positive else clears it

ADDQT    does not affect the status flags

DIV      takes 16 cyles to execute. Supposedly you can do either
         16.16 integer, fractional division or 32 bit integer
         division. The DIV does a 2 bit divide each cycle, hence 16
         cycles total for 32 bit. The remainder of the divison is 
         saved in a special register.

IMACN    they don't have a register write back (so they're easier to
         optimize) and you'll find if you get in the habit of using 
         them, you can normally structure your code a bit faster by 
         using them..


3.6   Logical instructions
=-=-=-=-=-=-=-=-=-=-=-=-=-=

SHLQ  affects the status flags
SHRQ  affects the status flags


4.0   Matrix multiplication
=-=-=-=-=-=-=-=-=-=-=-=-=-=

   MMULT    starting_register,destination_register
         
You have to setup the matrix control register and the matrix address
register before executing the MMULT instruction. The MMULT instruction
will then multiply the values contained in the interrupt (alternate ?) 
register bank (0) starting at register "starting_register" with the 
matrix pointed to by the matrix address register.

The values used for the multiplication are 16 bit values (sign extended ?)
and are arranged in this slightly peculiar fashion (for a 5x1 Matrix):


               Register bank #0  (RB)
         +-------------+------------+
         |      1      |      0     |  Rn
         +-------------+------------+
         |      3      |      2     |  Rn+1
         +-------------+------------+
         |    unused   |      4     |  Rn+2
         +-------------+------------+


These values are then multiplied with a 32 bit table in memory
and the results of these added together:


                  Main memory    (MM)
$020000  +--------------------------+
         |             0            | $020003
$020004  +--------------------------+
         |             1            | $020007
$020008  +--------------------------+
         |             2            | $02000B
$02000C  +--------------------------+
         |             3            | $02000F
$020010  +--------------------------+
         |             4            | $020013
         +--------------------------+


   result = MM0 * RB0 + MM1 * RB1 + MM2 * RB2 + MM3 * RB3 + MM4 * RB4;

The result 32 bit is stored into "destination_register" in the current 
bank (or bank 1?).


Brief internal description of MMULT:  strips program instruction fetchs 
and forces instructions straight into the pipeline.. getting a through put
of one (16 bit) multiply per tick.. (25 million per second :-)

Supposedly the MMULT is performed by inserting generated instructions 
into the instruction stream.  Supposedly for a MMULT the instructions
inserted are a leading IMULTN, the middle ones IMACN, and finally a 
RESMAC. 

[[ 
   ???  These have their operands modified in the manner described above. ???
   i.e. that funky packed thingy, two elements per register, that
   allows all of an eightxeight matrix to be stored in the secondary
   register bank and is the "raison d'e^tre" of the second bank.. (woosh)
]]

     
5.0 Interupts
=-=-=-=-=-=-=

The GPU and the DSP uses an interupt scheme that looks a lot like
the 56000's way of handling interupts.

In the lowest part of each processors memory the interupt entry
points are. There are 16 bytes for each interupt. This should
be enough to jump into the real interupt handler.

   ( If this works like on the 56000 it should be possible 
     to have Fast Interupts, where the CPU returns automatically
     when the 16 bytes have been executed and no jump 
     instructions have been executed ).

For the DSP it looks like this:

000000        Reset          (or DSP control interupt)
000010        I2S Interupt


Enable interupts I2S:

   movei #D_FLAGS,r1    ; load dsp flags to go to bank 1
   load  (r1),r0
   bset  #5,r0       ; enable I2S interrupt
   store r0,(r1)     ; save dsp flags


Handle i2s interupts:      [ NOTE: this code has been deobfuscated 
                                   and worsened since v1.11 ]
   .org  $10
   movei #i2s_isr,r30
   jump  T,(r30)
   nop                  ; pad to 8 words total
   nop
   nop

;; actual service routine
i2s_isr:
   movei #D_FLAGS,r30   ; get flags ptr
   load  (r30),r12      ; yup
   bclr  #3,r12         ; clear IMASK
   bset  #10,r12        ; clear I2S interrupt

   load  (r31),r28      ; get last instruction address
   addq  #4,r31         ; update the stack pointer

   addq  #2,r28         ; point at next to be executed

   ...                  ; so some stuff

   store r12,(r30)      ; restore flags   ;; smoother code if put
   jump  T,(r28)        ; and return      ;; after the JUMP!
   nop                  ;                 ;; using the pipeline   



BUGS:
=-=-=

There are also apparently some bugs in the GPU/DSP that you should be 
aware of:

1) INDEXED STORES NEVER STALL
  e.g
     div r0,r3
     store r3,(r14+6)

   should be

     div r0,r3
     or r3,r3
     store r3,(r14+6)

   Here the OR is used to 'touch' the register for the scoreboard. If you 
   wouldn't touch the r3 register you would most likely (but not always,
   think of those IRQs!) write the old value of r3 back.


2) TWO CONSECUTIVE WRITES TO THE SAME REGISTER MIGHT BE PROBLEMATIC

   Although writing code like this is a bug anyway, you should be careful
   that if you write to same reg with no intermittent read, and the
   second instruction finishes first garbage will result:

      load  (r3),r2
      moveq #3,r2

should be
      load  (r3),r2
      or    r2,r2
      moveq #3,r2


3)  NEITHER THE DSP NOR THE GPU CAN EXECUTE 'jr' OR 'jump' FROM EXTERNAL
    RAM

4)  NEITHER THE DSP NOR THE GPU MAY BE USED IN HIGH PRIORITY 

5)  A mmult INSTRUCTION MUST NEVER BE INTERRUPTED
    how very convenient...

6)  THE DSP (ONLY) MUST NOT DO AN EXTERNAL WRITE UNLESS PRECEDED BY AN 
    EXTERNAL READ THAT COMPLETES BEFORE THE WRITE STARTS.

    The saying goes, that this bug is only spurious and can remain 
    undetected for quite some time.
    Hint for external I/O use the Blitter (as always :))

   e.g.
   A:
      load  (r1),r2
      or    r10,r11
      store r11,(r3)

   B:
     load   (r1),r2
     or     r2,11
     store  r11,(r3)

   C:
     load   (r1),r2
     or     r2,r2
     or     r10,r11
     store  r11,(r3)

   [A] will no work but [B] will, this is because the result of the load 
   is required for the 'or' operation to be performed. To make [A] work, 
   change it to [C]...
</PRE>
<HR>
<address><a href="mailto:nat@zumdick.rhein-main.de">Nat! (nat@zumdick.rhein-main.de)</a></address>
<address><a href="mailto:kkp@gamma.dou.dk">Klaus (kkp@gamma.dou.dk)</a></address>
<P>
$Id: risc_doc.txt,v 1.16 1996/02/11 23:00:43 nat Exp $
</BODY>
</HTML>


