Forth UM/MOD for 6502 and 65C816/802

Up

2018, 2021-02-11
2022-10-05 Added: TOC, 816 variants
2023-05-05 Fixed 816 JK Forth: cycles and bytes comments corrected.
J. E. Klasek <j+forth klasek at>

Contents:

6502
65C816/802
1. 816 JK Forth
2. Garth Wilson Forth

6502

FIG Forth

F.I.G. Implementation FIG65ROM.AS

Wrong implementation: Bit 17 isn't handled after ROL 5,X (2nd line after L433).
Even with zero-page indexed addressing mode accessing arguments it is faster than transfering it to a zero-page scratch area (N area) which would take more overall time.

;                                       U/
;                                       SCREEN 24 LINE 1
;
L418      .BYTE $82,'U',$AF
          .WORD L386     ; link to U*
USLAS     .WORD *+2
          LDA 4,X
          LDY 2,X
          STY 4,X
          ASL A
          STA 2,X
          LDA 5,X
          LDY 3,X
          STY 5,X
          ROL A
          STA 3,X
          LDA #16
          STA N
L433      ROL 4,X
          ROL 5,X
          SEC
          LDA 4,X
          SBC 0,X
          TAY
          LDA 5,X
          SBC 1,X
          BCC L444
          STY 4,X
          STA 5,X
L444      ROL 2,X
          ROL 3,X
          DEC N
          BNE L433
          JMP POP

V-Forth

V-Forth 3.7 fix and fix in official release 3.8.
V-Forth is a FIG Forth derivate for the VIC-20 from Simon J. Rowe <srowe mose org uk>
Source: Git-Repo UM/MOD

Handles 17th bit in separate case, already very efficient, but release 3.7 had a false assumption, that the carry flag is always set, but is not in all cases.
Even with zero-page indexed addressing mode accessing arguments it is faster than transfering it to a zero-page scratch area (N area) which would take more overall time.

V-FORTH 3.8

CODE  UM/MOD     ( 31 BIT DIVIDEND-2, -3,  16 BIT DIVISOR-1  *)
                 ( 16 BIT REMAINDER-2,  16 BIT QUOTIENT-1    *)
  SEC 2 + LDA,  SEC     LDY,  SEC 2 + STY,  .A ASL,  SEC    STA,
  SEC 3 + LDA,  SEC 1+  LDY,  SEC 3 + STY,  .A ROL,  SEC 1+ STA,
  10 # LDA,  N STA,
  BEGIN,  SEC 2 + ROL,  SEC 3 + ROL,  SEC 2+ LDA,
          CS  NOT  IF,
             SEC,  BOT SBC,  TAY,
             SEC 3 + LDA,  BOT 1+ SBC,
             CS IF,  SEC 2+ STY,  BEGIN,  SEC 3 + STA,  SWAP  THEN,
          SEC ROL,  SEC 1+ ROL,
          N DEC,  ROT  0=
      UNTIL,    POP  JMP,
      SWAP  THEN,  BOT SBC,  SEC 2+ STA,  SEC 3 + LDA,  BOT 1+ SBC,  SEC,
      CS  NOT  UNTIL,
END-CODE

Fixed V-FORTH 3.7

 259 CODE  UM/MOD     ( 31 BIT DIVIDEND-2, -3,  16 BIT DIVISOR-1  *)
 260                  ( 16 BIT REMAINDER-2,  16 BIT QUOTIENT-1    *)
 261   SEC 2 + LDA,  SEC     LDY,  SEC 2 + STY,  .A ASL,  SEC    STA,
 262   SEC 3 + LDA,  SEC 1+  LDY,  SEC 3 + STY,  .A ROL,  SEC 1+ STA,
 263   10 # LDA,  N STA,
 264   BEGIN,  SEC 2 + ROL,  SEC 3 + ROL,  SEC 2+ LDA,
 265           CS  NOT  IF,
 266              SEC,  BOT SBC,  TAY,
 267              SEC 3 + LDA,  BOT 1+ SBC,
 268              CS IF,  SEC 2+ STY,  LABEL L442  SEC 3 + STA,  THEN,
 269           SEC ROL,  SEC 1+ ROL,
 270           N DEC,  SWAP  0=
 271       UNTIL,    POP  JMP,
 272       THEN,  BOT SBC,  SEC 2+ STA,  SEC 3 + LDA,  BOT 1+ SBC,
 273       L442 JMP,
 274 END-CODE

Correction:

 272       THEN,  BOT SBC,  SEC 2+ STA,  SEC 3 + LDA,  BOT 1+ SBC, SEC,

In case the 17th bit is set (checked in line 265) the subtraction is forced but even with 17th bit is set, the resulting carry flag is not alway set (because the 16 bit from which is subtracted could be less then the divisor). In such case the carry must be set which is ROLed into to quotient.

C64 Micro Prod. Forth

Even with zero-page indexed addressing mode accessing arguments it is faster than transfering it to a zero-page scratch area (N area) which would take more overall time.

More efficient implementation

A re-worked version of the orignal below. Saves 2 temporary byte in zero-page, handling the 17th bit in a separate case.

Standard assembly source:

;  UM/MOD 2018-04-14 J KLASEK AT )

	* = $1000

	TBITCOUNT = $74
	POP = $84

; In:
;   5,X   4,X      3,X    2,X      1,X    0,X
; +-------------+---------------+--------------+
; | Dividend L  |  Dividend H   |   Divisor    |
; +-------------+---------------+--------------+
; Out:
;   3,X   2,X      1,X    0,X
; +-------------+---------------+
; |  Remainder  |   Quotient    |
; +-------------+---------------+

UMMOD:
	LDA 4,X		; Swap dividend high and low
	LDY 2,X
	STY 4,X
	ASL		; shift dividend low
	STA 2,X
	LDA 5,X
	LDY 3,X
	STY 5,X		; dividend low in 4,X/5,X
	ROL		; shift dividend low
	STA 3,X		; dividend low in 2,X/3,X shifted left
	LDA #16
	STA TBITCOUNT
LOOP:
	ROL 4,X		; shift in bit from dividend low
	ROL 5,X
	BCC +		

	LDA 4,X		; 17th bit set, dividend is greater
	SBC 0,X		; so just subtract divisor
	STA 4,X
	LDA 5,X
	SBC 1,X		; subtraction might finish with carry cleared!
	SEC		; always set carry for quotient
	BCS ++		; save A later ...

+	SEC		; fits divisor into dividend high?
	LDA 4,X
	SBC 0,X
	TAY		; subtraction result temporarily put into Y/A
	LDA 5,X
	SBC 1,X
	BCC +++		; does not fit

	STY 4,X		; keep subtraction result 
++	STA 5,X		; entry from 17th bit case
+++	ROL 2,X		; take quotient bit from carry
	ROL 3,X		; and shift out dividend low
	DEC TBITCOUNT	; loop over 16 dividend bits
	BNE LOOP
	JMP POP

Forth assembly source:

( U/ FIXED-V2 2018-04-14 J KLASEK AT )
ASSEMBLER HEX
74 CONSTANT TBITCOUNT
CODE UM/MOD
4 ,X LDA, 2 ,X LDY, 4 ,X STY,
.A ASL, 2 ,X STA, ( SWAP LO/HI )
5 ,X LDA, 3 ,X LDY, 5 ,X STY, 
.A ROL, 3 ,X STA, ( AND SHIFT )
10 # LDA, TBITCOUNT STA,
BEGIN,
  4 ,X ROL, 5 ,X ROL, CS IF,
    4 ,X LDA, 0 ,X SBC, 4 ,X STA,
    5 ,X LDA, 1 ,X SBC, SEC,
    HERE 1+ DUP BCS, ROT ROT ( + LABEL)
  THEN,
  SEC, 4 ,X LDA, 0 ,X SBC, TAY,
  5 ,X LDA, 1 ,X SBC, >= IF, ( C = 1 )
     4 ,X STY, ( + )
     ROT DUP HERE SWAP - 1- SWAP C!
     5 ,X STA,
  THEN,
  2 ,X ROL, 3 ,X ROL,
  TBITCOUNT DEC,
0= UNTIL,
POP JMP,

Original

This version isn't wrong but just inefficient by using a 3rd byte just to cover the 17th bit handling. This results in rotating and subtracting 24 bits instead of 16 bits ...

( U/ ORIGINAL 2018-04-13 )
ASSEMBLER HEX
74 CONSTANT TBITCOUNT
75 CONSTANT TMSB   76 CONSTANT TMP
CODE U/ORIG
TMSB STY, ( ZERO )
4 ,X LDA, 2 ,X LDY, 4 ,X STY,
.A ASL, 2 ,X STA,  ( SWAP LO/HI)
5 ,X LDA, 3 ,X LDY, 5 ,X STY, 
.A ROL, 3 ,X STA,  ( AND SHIFT )
10 # LDA, TBITCOUNT STA,
BEGIN,
  4 ,X ROL, 5 ,X ROL, TMSB ROL,
  SEC, 4 ,X LDA, 0 ,X SBC, TAY,
  5 ,X LDA, 1 ,X SBC, TMP STA,
  TMSB LDA, 0 # SBC, TMP LDA,
  >= IF,
     4 ,X STY, 5 ,X STA,
  THEN,
  2 ,X ROL, 3 ,X ROL,
  TBITCOUNT DEC,
0= UNTIL,
POP JMP,

VolksForth

Source: Volks-forth CBM core on Github

This version is more more space-efficient because the separate case for a set bit 17 is munged together. The trick is that the 17th bit is kept in location N 6 +. Just in case the divisor does not fit into the dividend high the 17th bit is considered and if set, the subtraction result is taken either.
Because the are 17 rounds the dividend high, containing the remainder has to be shifted back by one bit. If the bit shifted out is set this indicates an overflow condition.
The parameter stack implemention (zero-page indirect indexed) here needs to store the values into the zero-page working area (N) because accessing the stack elements is not easily possible (in comparison to zero-page indexed).

Code um/mod  ( ud u -- urem uquot)
 SP X) lda  N 5 + sta                      ( take parameters from stack )
 SP )Y lda  N 4 + sta   SP 2inc            ( divisor )
 SP X) lda  N 1+  sta
 SP )Y lda  N     sta   iny                ( dividend high word )
 SP )Y lda  N 3 + sta   iny
 SP )Y lda  N 2+  sta   $11 # ldx  clc     ( dividend low word )
  [[ N 6 + ror  sec  N 1+ lda  N 5 + sbc   ( save 17th bit )
     tay  N lda  N 4 + sbc		   ( uubtract divisor from dividend high, result in Y/A )
    CS not ?[  N 6 + rol ]?                ( if divisor does not fit, is 17th bit set? )
      CS ?[ N sta  N 1+ sty ]?             ( store new dividend high )
     N 3 + rol N 2+ rol N 1+ rol N rol     ( shift dividend low into dividend high )
  dex  0= ?]                               ( all rounds finished? )
 1 # ldy  N ror  N 1+ ror                  ( fix remainder, if 17th bit set, there is )
   CS ?[ ;c: divovl ; Assembler ]?         ( an overflow! )
 N 2+  lda SP )Y sta iny                   ( store back on stack )
 N 1+  lda SP )Y sta iny
 N     lda SP )Y sta 1 # ldy
 N 3 + lda
 Puta jmp    end-code

Tali Forth

Source: Tali Forth on Github

For 65C02 (using STZ, BRA).
Also an implementation with an extra byte to keep the 17th bit.
Even with zero-page indexed addressing mode accessing arguments it is faster than transfering it to a zero-page scratch area (N area) which would take more overall time.
Similar to the implemention presented by Garth Wilson, Corrected Forth UM/MOD primitive.

; ----------------------------------------------------------------------------
; UM/MOD ( ud u1 -- u2 u3 ) ("UM/MOD")  32/16 -> 16
; Divide double cell number by single number and return the quotient u3 as
; the TOS and remainder as NOS. All numbers are unsigned. This is the basic
; division operation the others use. Based on FIG Forth code, modified by
; Garth Wilson, see http://6502.org/source/integers/ummodfix/ummodfix.htm
l_ummod:        bra a_ummod
                .byte $06
                .word l_udmod    ; link to UDMOD
                .word z_ummod
                .byte "UM/MOD"
.scope
a_ummod:        ; TODO see if we have enough stuff on the stack

                ; prevent division by zero. We currently do not check for
                ; overflow
                lda 1,x
                ora 2,x
                bne _notzero

                lda #$09        ; "Division by zero" string
                jmp error

_notzero:       ; we loop 17 times
                lda #$11
                sta TMPCNT

_loop:          ; rotate low cell of dividend one bit left (LSB)
                rol 5,x
                rol 6,x

                ; loop control
                dec TMPCNT
                beq _done

                ; rotate hi cell of dividend one bit left (MSB)
                rol 3,x
                rol 4,x

                stz TMPADR      ; store the bit we got from hi cell MSB
                rol TMPADR

                ; subtract dividend hi cell minus divisor
                sec
                lda 3,x
                sbc 1,x
                sta TMPADR+1
                lda 4,x
                sbc 2,x

                tay             ; use Y as temporary storage
                lda TMPADR      ; include bit carried
                sbc #$00
                bcc _loop

                ; make result new dividend hi cell
                lda TMPADR+1
                sta 3,x
                sty 4,x         ; used as temp storage

                bra _loop

_done:          ; drop on from the data stack, swap quotient and remainder
                inx
                inx
                jsr l_swap

z_ummod:        rts

65C816/802

816 JK Forth

Even with zero-page indexed addressing mode accessing arguments it is faster than transfering it to a zero-page scratch area (N area) which would take more overall time.

Notes:

Register A, X, Y are in 16-bit mode.
the first part of the commentary column contains the byte count separated by "/" followed by the cycle count with an optional cycle count in case of a branch taken separated by "|".
Above a "-" line is a summary in bytes and cycles of the part above.
Above a "=" line the overall summary can be seen - a distinction between normal and overflow case is shown here.

        #HEADER "UM/MOD", NOT_IMMEDIATE ; ( ud u -- rem quot )
UMsMOD  #CODE
                        ; +-----------+------------------------+
                        ; |  DIVISOR  |    D I V I D E N D     |
                        ; |           |  hi cell     lo cell   |
                        ; +-----------+-----------+------------+
                        ; |  0,X  1,X | 2,X   3,X | 4,X   5,X  |
                        ; +-----------+-----------+------------+
                        ;        Detect overflow or /0 condition:
                        ;        Subtract divisor from hi cell of dividend
        LDA     2,X     ;  2/5   C flag set means, divisor was not big enough
        CMP     0,X     ;  2/5   to avoid overflow.
        BCS     uoflo   ;  2/2|3 This also takes care of any /0 conditions.
        STY     IP      ;  2/4   Save IP because Y is needed a loop counter.
        LDY     #16     ;  3/3   We will loop 16x; but since we shift the
                        ;        dividend over at the same time as shifting
                        ;        the answer in, the operation must start AND
                        ;        finish with a shift of the lo cell of the
                        ;        dividend (which ends up holding the quotient).
                        ;        The hi cell of dividend will be kept in accum.
        ASL     4,X     ;  2/8   Initial lo cell of dividend shift.
                        ; 13/27
                        ; ------
ushft2:
        ROL             ;  1/2   Shift hi cell of dividend left one bit, also
                        ;        if 17th bit of dividend is set, divisor always
        BCS     umsb    ;  2/2|3 fits.
                        ;        Does divisor fit into hi 16 bits of dividend?
                        ;        Just compare (keeping A intact).
        CMP     0,X     ;  2/5   If carry got cleared, divisor did not fit.
        BCC     unotin  ;  2/2|3 determining if the divisor fit into the hi
umsb:                   ;        Carry always set here.
                        ;        16 bit of the dividend. 17th bit is always 0.
        SBC     0,X     ;  2/5|- Divisor fits, so actually do the subtract
        SEC             ;  1/2|- With 17th bit set carry might be zero because
                        ;        the low 16 bits could be less than the divisor.
unotin: ROL     4,X     ;  2/8   Move lo cell of dividend left one bit, also
                        ;        shifting answer in.
        DEY             ;  1/2   later gets pushed off the other end in the
        BNE     ushft2  ;  2/3|2 last rotation.
                        ; 15/ (31*quot_bit_1)
                        ;    +(25*quot_bit17_bit_1)
                        ;    +(23*quot_bit_0)
                        ;    -1
                        ; ------

        LDY     IP      ;  2/4   Restore IP
uleave:                 ;        Put quotient and remainder on stack where
        STA     2,X     ;  2/5   dividend used to be, and restore the SP.
        #INX2           ;  2/4   Leave 2 cells on stack.
        JMP     USswap  ;  3/24  Swap quotient and remainder and restore IP
                        ;  9/37
                        ; ------
                        ; 44/27
                        ;    +(31*quot_bit_1)
                        ;    +(25*quot_bit17_bit_1)
                        ;    +(23*quot_bit_0)-1
                        ;    -1
                        ;    +37
                        ; ====== Normal
uoflo:  LDA     #$FFFF  ;  3/3   An overflow or /0 condition occured:
        STA     4,X     ;  2/5   Both the quotient
        BRA     uleave  ;  2/3   and reminder to -1!
                        ;  7/61 (13+11+37)
                        ; ====== Overflow (intro+uleave)

Garth Wilson's Forth

Implemention presented by Garth Wilson, Corrected Forth UM/MOD primitive.

Notes:

Register A and memory will be accessed in 16-bit mode, index registers X and Y in 8-bit mode.
The MVN is used to shovel all the parameters to the N zero-page area with the side-effect of incrementing X which pops off all values from the parameter stack.
The use of the scratch area N is mainly for portabilty reasons. Accessing the parameters with zero-page index addressing would be faster, because the setup of the N area costs more then the addressing mode savings.

INDEX_16:       MACRO                   ; Put X & Y into 16-bit mode.
                REP     #00010000B
 ;              NOP                     ; NOP necessary for early versions of
                ENDM                    ; '802 and '816 >4MHz.
 ;-------------------
INDEX_8:        MACRO                   ; Put X & Y into 8-bit mode.
                SEP     #00010000B
 ;              NOP                     ; NOP necessary for early versions of
                ENDM                    ; '802 and '816 >4MHz.
 ;-------------------

        HEADER "UM/MOD", NOT_IMMEDIATE                  ; ( ud u -- rem quot )
UMsMOD: CODE            ; To save a cycle from every stack load or store, move
        TXY             ; the inputs to N.  Save stack pointer to restore later
        INY             ; with one less cell.  (At the exit, N+2 below will be
        INY             ; transferred to the top of the stack, and N, N+1 where
        STY     XSAVE   ; the divisor is, will be dropped in the bit bucket.)
                        ; +-----|-----+-----|-----+-----|-----+-----|------+
        LDY     #< N    ; |  DIVISOR  |    D I V I D E N D    |temp storage|
        LDA     #5      ; |           |  hi cell     lo cell  |of carry bit|
        MVN     0,0     ; |  N    N+1 | N+2   N+3 | N+4   N+5 | N+6   N+7  |
                        ; +-----|-----+-----|-----+-----|-----+-----|------+
        SEC             ; Detect overflow or /0 condition.  To find out, sub-
        LDA     N+2     ; tract divisor from hi cell of dividend; if C flag
        SBC     N       ; remains set, divisor was not big enough to avoid
        BCS     uoflo   ; overflow.  This also takes care of any /0 conditions.
                        ; If branch not taken, C flag is left clear for 1st ROL.
                        ; We will loop 16x; but since we shift the dividend
        LDX     #<$11   ; over at the same time as shifting the answer in, the
                        ; operation must start AND finish with a shift of the
                        ; lo cell of the dividend (which ends up holding the
        INDEX_16        ; quotient), so we start with 17 in X.  We will use Y
                        ; for temporary storage too, so set index reg.s 16-bit.
 ushftl: ROL    N+4     ; Move lo cell of dividend left one bit, also shifting
                        ; answer in.  The 1st rotation brings in a 0, which
        DEX             ; later gets pushed off the other end in the last
        BEQ     umend   ; rotation.     Branch to the end if finished.

        ROL     N+2     ; Shift hi cell of dividend left one bit, also shifting
        LDA     #0      ; next bit in from high bit of lo cell.
        ROL     A
        STA     N+6     ; Store old hi bit of dividend in N+6.

        SEC             ; See if divisor will fit into hi 17 bits of dividend
        LDA     N+2     ; by subtracting and then looking at carry flag.
        SBC     N       ; If carry got cleared, divisor did not fit.
        TAY             ; Save the difference in Y until we know if we need it.

        LDA     N+6     ; Bit 0 of N+6 serves as 17th bit.
        SBC     #0      ; Complete the subtraction by doing the 17th bit before
        BCC     ushftl  ; determining if the divisor fit into the hi 17 bits of
                        ; the dividend.  If so, the C flag remains set.
        STY     N+2     ; If divisor fit into hi 17 bits, update dividend hi
        BRA     ushftl  ; cell to what it would be after subtraction.    Branch.

 uoflo: LDA     #$FFFF  ; If an overflow or /0 condition occurs, put FFFF in
        STA     N+4     ; both the quotient
        STA     N+2     ; and the remainder.

 umend: INDEX_8
        LDX     XSAVE
        LDA     N+2     ; Put quotient and remainder on stack where dividend
        STA     2,X     ; used to be, and restore the stack pointer.  Remember
        LDA     N+4     ; that we incremented the stack pointer above with TXY,
        STA     0,X     ; INY, INY, STY XSAVE  to knock one cell off the stack.
        JMP     NEXT    ; (Doing TXY first was to keep X's value for the MVN.)
 ;-------------------

zurück zur Startseite
5.05.2023 Webmaster