defining computers: 6800 Example VM with Synthetic Parameter Stack

In Forth Threading (Is Not Process Threads), I commented on the fig Forth models insisting on using the CPU stack register for parameters. The code I presented for the 6800 follows that design choice, but the code I presented there for the 6809 and 68000 does not. Here I'll show, for your consideration, the snippets for indirect threading on the 6800 done right (according to me) -- with the CPU stack implementing RP and the synthesized stack implementing Forth's SP.

You'll note that the code here will end up maybe 5% slower and 5% bigger overall than the code that the fig model for the 6800 shows. But it will not mix return addresses and parameters on the same stack, and it should end up a bit easier to read and analyze, meaning that it should be more amenable to optimization techniques, when generating machine code instead of i-code lists.

In the code below, the virtual registers are allocated as follows:

W RMB 2 the instruction register points to 6800 code
IP RMB 2 the instruction pointer points to pointer to 6800 code
* RP is S
FSP RMB 2 Forth SP, the parameter stack pointer
UP RMB 2 the pointer to base of current user's 'USER' table ( altered during multi-tasking )
N RMB 10 used as scratch by various routines

The symbol table entry remains the same:

HEADER
LABEL CODE_POINTER
DEFINITION

So we won't discuss it here. (See the discussion in the post on threading techniques.)

I won't repeat the fig model here. I'll be bringing more of it into the discussion than I did in the post on threading techniques, refer to the model in my repository, here: https://sourceforge.net/p/asm68c/code/ci/master/tree/fig-forth/fig-forth.68c.

This is an exercise in refraining from early optimization, so the parameter stack pointer will point to the last item pushed, even though it's tempting to cater to the lack of negative index offset in the 6800 and point to the first free cell beyond top of stack.

The inner interpreter of the virtual machine looks the same, but the code surrounding it changes.

PULABX is only used by "!" (STORE), so let's start the rework there:

STORE   FDB   *+2   ; ( n adr --- )
    LDX    FSP
    LDAA    2,X    ; get n
    LDAB    3,X
    LDX    0,X    ; get adr
    STAA    0,X   ; store it away
    STAB    1,X
DEALL2    LDX    FSP    ; potential code-stealing point
    INX
    INX
    INX
    INX
    STX    FSP
    JMP    NEXT

And we no longer have a use for PULABX.

As a quick estimate, we're using 2 more instructions and maybe 12 more cycles this way. (I should count cycles, but that would take several minutes.) The code is much more readable, and much less likely to be impacted by changing between indirect and direct threading and between subroutine calls and hard jumps.

STABX is used in eight places, including AND, OR, XOR, and +, which all follow the same pattern. Let's look at +, and then at replacement code stealing in AND.

PLUS   FDB   *+2    ; ( n1 n2 --- sum )
    LDX    FSP
    LDAA    2,X
    LDAB    3,X
    ADDB    1,X
    ADCA    0,X
DEAST1    INX   ; code steal if X is FSP, to deallocate a cell and overwrite top with A:B
    INX
    STX    FSP
STABX    STAB    1,X    ; code steal to store A:B at X and proceed
    STAA    0,X
    JMP    NEXT
*
*
AND    FDB    *+2    ; OR and XOR also follow this template
    LDX    FSP
    LDAA    0,X
    LDAB    1,X
    ANDB    3,X
    ANDA    2,X
    JMP    DEAST1
*
* 0= and 0< also use the DEAST1 code steal

Again, a few more bytes and a few more cycles than using S for parameters.

GETX is used in five places (six if you don't alias I to R), so we'll look more closely at that, as well. I is the first place in the model, and the simplest. We might consider locating the body of I/R in front of NEXT, to optimize loop counter access, but there are other uses to consider.

R    FDB    *+2    ; ( --- n )
    TSX ; Remember that the 6800 adjusts S for this.
GETX    LDA A    0,X
   LDA B    1,X    ; no deallocate
PUSHBA    LDX    FSP
    DEX
    DEX
    STX FSP
    STAA    0,X
    STAB    1,X
    JMP    NEXT
*
*
I    FDB    R+2
*

GETX also uses PUSHBA, which is used in 18 places, and we'll examine that, as well.

I possibly has the most frequent use of GETX, but @ has the most frequent use of PUSHBA, and is significantly more used than I and R together, so if we locate a definition body in front of NEXT, it probably should be @.

AT    FDB   *+2    ; ( adr --- n )
    LDX    FSP
    LDX    0,X
* Same as GETX? No.
    LDAA    0,X
    LDAB    1,X    ; no need to deallocate or allocate
* Same as STABX?
    STAB    1,X    ; code steal to store A:B at X and proceed
    STAA    0,X
*    JMP NEXT? Put NEXT here?
* Wait. That's the same code as STABX up there.

At this point, we are sure that we need differently organized code from the model, so we throw out all the code above and start over. Well, we don't just throw it out, we save it away where we can look at it as we rebuild it.

We consider that @ is more frequently used than !, so we make @ the focus:

R    FDB    *+2    ; ( --- n ) ( n *** n )
    TSX ; Remember that the 6800 adjusts S for this.
GETX    LDA A    0,X
   LDA B    1,X    ; no deallocate
PUSHBA    LDX    FSP
    DEX
    DEX
    STX FSP
    BRA    STABX    ; Doesn't save time, but saves a few bytes of repeated code.
*
*
I    FDB    R+2
*
*
STORE   FDB   *+2   ; ( n adr --- )
    LDX    FSP
    LDAA    2,X    ; get n
    LDAB    3,X
    LDX    0,X    ; get address
    STAA    0,X   ; store n away
    STAB    1,X
DEALL2    LDX    FSP    ; potential code-stealing point
DEALLS
    INX
    INX
DEALL1    INX
    INX
    STX    FSP
    BRA    NEXT    ; although the optimizer will recognize a JMP more directly
*
*
AT    FDB   *+2    ; ( adr --- n )
    LDX    FSP
    LDX    0,X
LDABX    LDAA    0,X
    LDAB    1,X    ; no need to deallocate or allocate
STOTOP    LDX    FSP
STABX    STAB    1,X    ; code steal to store A:B at X and proceed
    STAA    0,X
NEXT    LDX    IP
NEXTW    INX        ; pre-increment mode
   INX
   STX    IP
NEXT2    LDX    0,X    ; get W which points to CFA of word to be done
NEXT3    STX    W
   LDX    0,X
NEXTGO    JMP    0,X
*
*
DOCOL    LDAA    IP    ; 6801 can PSHX
    LDAB    IP+1
    PSHB
    PSHA
    LDX    W    ; Get first sub-word of that definition
    BRA    NEXTW    ; and execute it
*
*
SEMIS    FDB    *+2
    TSX
   INS
   INS
   LDX    0,X    get address we have just finished.
    BRA    NEXTW    increment the return address & do next word

@ , DOCOL, and ;S are cases where the synthetic stack yields better performance and tighter code than the model code, balancing the hit we take in definitions such as ! .

Continuing, instead of focusing on + for STABX, 0= is more frequently used, so we'll focus on that, and return a -1 flag instead of a 1, in keeping with more common Forth practices:

ZEQU   FDB   *+2    ; ( n --- flag )
    LDX    FSP
   LDAA    0,X
    ORAB    1,X
   BNE   ZEQUT
    DEC B
    TBA
    BRA    STABX
ZEQUT    CLRA
    CLRB
    JMP    STABX
*
*
PLUS   FDB   *+2    ; ( n1 n2 --- sum )
    LDX    FSP
    LDAA    2,X
    LDAB    3,X
    ADDB    1,X
    ADCA    0,X
DEAST1    INX   ; code steal if X is FSP, to deallocate a cell and overwrite top with A:B
    INX
    STX    FSP
    JMP    STABX
*
*
AND    FDB    *+2    ; ( n1 n2 --- n3 ) OR and XOR can do this, too.
    LDX    FSP
    LDAA    0,X
    LDAB    1,X
    ANDB    3,X
    ANDA    2,X
    BRA    DEAST1
*
*
DPLUS    FDB    *+2    ; ( d1 d2 --- dsum )
    LDX FSP
   CLC
   LDA B    #4
DPLUS2    LDA A    3,X    ; This loop is worth flattening on the 6801 and 6809.
   ADC A    7,X
   STA A    7,X
   DEX
   DEC B
   BNE    DPLUS2
    JMP    DEALLS

Except for a few very specialized definitions, the coding in i-code list definitions won't change. The 2+ example is not one of those few that would change, so we won't take the time to look at it.

As noted above, improved DOCOL and ;S will improve entry and exit times for high-level definitions, so I think we can be justified in using a synthetic parameter stack to avoid mixing parameters with return addresses.

Why did I waste a day going through these calculations?

One, to demonstrate that focusing too early on optimizations often leads us into tricky code that is not necessary.

(This is not the swamp monster some people have made of it, however. Working out optimizations often helps us understand the dark corners of code that are blocking progress. It's just something to keep in mind.)

Two, as I mentioned in the previous post and above, understandable code lends itself better to mechanical optimization. Even 6800 code can be optimized. Let's look at 2+ again, after all.

TWOP    FDB    DOCOL
    FDB    TWO,PLUS
   FDB    SEMIS

DOCOL and ;S are recognized as entry and exit. Let's probe 2:

TWO FDB DOCON
FDB 2

Our optimizer should give constants specific attention anyway, so maybe we won't probe DOCON further. We can just put

    CLRA
    LDAB    #2
    JMP    PUSHBA

into the object code trough and expand PUSHBA, then STABX:

    CLRA
    LDAB    #2
    LDX    FSP
    DEX
    DEX
    STX FSP
    STAB    1,X
    STAA    0,X

which brings us to NEXT. Now we can probe + and expand it, then expand STABX at the end:

    CLRA
    LDAB    #2
    LDX    FSP
    DEX
    DEX
    STX FSP
    STAB    1,X
    STAA    0,X
    LDX    FSP
    LDAA    2,X
    LDAB    3,X
    ADDB    1,X
    ADCA    0,X
    INX
    INX
    STX    FSP
    STAB    1,X
    STAA    0,X

Find and clear the redundant STX and LDX:

    CLRA
    LDAB    #2
    LDX    FSP
    DEX
    DEX
*    STX FSP
    STAB    1,X
    STAA    0,X
*    LDX    FSP
    LDAA    2,X
    LDAB    3,X
    ADDB    1,X
    ADCA    0,X
    INX
    INX
    STX    FSP
    STAB    1,X
    STAA    0,X

Move accumulator activity closer together:

    LDX    FSP
    DEX
    DEX
    CLRA
    LDAB    #2
    STAB    1,X
    STAA    0,X
    LDAA    2,X
    LDAB    3,X
    ADDB    1,X
    ADCA    0,X
    INX
    INX
    STX    FSP
    STAB    1,X
    STAA    0,X

Recognize that we are storing a constant on the stack and then adding it to what was the top as a pattern, and collapse the data movement:

    LDX    FSP
    DEX
    DEX
*    CLRA
* LDAB    #2
*    STAB    1,X
*    STAA    0,X
    LDAA    2,X
    LDAB    3,X
*    ADDB    1,X
*    ADCA    0,X
    ADDB    #2
    ADCA    #0
    INX
    INX
    STX    FSP
    STAB    1,X
    STAA    0,X

Recognize the unnecessary adjustments to the stack pointer, including the unnecessary update at the end:

    LDX    FSP
*    DEX
*    DEX
*    LDAA    2,X
*    LDAB    3,X
    LDAA    0,X
    LDAB    1,X
    ADDB    #2
    ADCA    #0
*    INX
*    INX
*    STX    FSP
    STAB    1,X
    STAA    0,X

Examine the result,

    LDX    FSP
    LDAA    0,X
    LDAB    1,X
    ADDB    #2
    ADCA    #0
    STAB    1,X
    STAA    0,X

to weigh it against the allowed size expansion -- 6 bytes of i-code vs. 14 bytes of machine code.

That's more than double the size, but it provides oppurtunities for further optimizations where 2+ is used, so it is likely to be kept.

Of course, an actual optimizer would already have optimized DOCON constants out, so a lot of that would be already done.

But it should be clear that code that is readable is easier to optimize, which, added to the increased safety of not mixing parameters with return addresses, justifies using a synthetic stack.

It might even be motivation to use a locals stack instead of the return stack for the actual target of R , >R and R> , which is something else I would encourage considering.

defining computers

Misunderstanding Computers

Monday, October 26, 2020

6800 Example VM with Synthetic Parameter Stack

No comments:

Post a Comment