In Forth Threading (Is Not Process Threads), I commented on the fig Forth models insisting on using the CPU stack register for parameters. The code I presented for the 6800 follows that design choice, but the code I presented there for the 6809 and 68000 does not. Here I'll show, for your consideration, the snippets for indirect threading on the 6800 done right (according to me) -- with the CPU stack implementing RP and the synthesized stack implementing Forth's SP.
You'll note that the code here will end up maybe 5% slower and 5% bigger overall than the code that the fig model for the 6800 shows. But it will not mix return addresses and parameters on the same stack, and it should end up a bit easier to read and analyze, meaning that it should be more amenable to optimization techniques, when generating machine code instead of i-code lists.
In the code below, the virtual registers are allocated as follows:
- W RMB 2 the instruction register points to 6800 code
- IP RMB 2 the instruction pointer points to pointer to 6800 code
- * RP is S
- FSP RMB 2 Forth SP, the parameter stack pointer
- UP RMB 2 the pointer to base of current user's 'USER' table ( altered during multi-tasking )
- N RMB 10 used as scratch by various routines
The symbol table entry remains the same:
HEADERSo we won't discuss it here. (See the discussion in the post on threading techniques.)
LABEL CODE_POINTER
DEFINITION
I won't repeat the fig model here. I'll be bringing more of it into the discussion than I did in the post on threading techniques, refer to the model in my repository, here: https://sourceforge.net/p/asm68c/code/ci/master/tree/fig-forth/fig-forth.68c.
This is an exercise in refraining from early optimization, so the parameter stack pointer will point to the last item pushed, even though it's tempting to cater to the lack of negative index offset in the 6800 and point to the first free cell beyond top of stack.
The inner interpreter of the virtual machine looks the same, but the code surrounding it changes.
PULABX is only used by "!" (STORE), so let's start the rework there:
STORE FDB *+2 ; ( n adr --- )
LDX FSP
LDAA 2,X ; get n
LDAB 3,X
LDX 0,X ; get adr
STAA 0,X ; store it away
STAB 1,X
DEALL2 LDX FSP ; potential code-stealing point
INX
INX
INX
INX
STX FSP
JMP NEXT
And we no longer have a use for PULABX.
As a quick estimate, we're using 2 more instructions and maybe 12 more cycles this way. (I should count cycles, but that would take several minutes.) The code is much more readable, and much less likely to be impacted by changing between indirect and direct threading and between subroutine calls and hard jumps.
STABX is used in eight places, including AND, OR, XOR, and +, which all follow the same pattern. Let's look at +, and then at replacement code stealing in AND.
PLUS FDB *+2 ; ( n1 n2 --- sum )
LDX FSP
LDAA 2,X
LDAB 3,X
ADDB 1,X
ADCA 0,X
DEAST1 INX ; code steal if X is FSP, to deallocate a cell and overwrite top with A:B
INX
STX FSP
STABX STAB 1,X ; code steal to store A:B at X and proceed
STAA 0,X
JMP NEXT
*
*
AND FDB *+2 ; OR and XOR also follow this template
LDX FSP
LDAA 0,X
LDAB 1,X
ANDB 3,X
ANDA 2,X
JMP DEAST1
*
* 0= and 0< also use the DEAST1 code steal
Again, a few more bytes and a few more cycles than using S for parameters.
GETX is used in five places (six if you don't alias I to R), so we'll look more closely at that, as well. I is the first place in the model, and the simplest. We might consider locating the body of I/R in front of NEXT, to optimize loop counter access, but there are other uses to consider.
R FDB *+2 ; ( --- n )
TSX ; Remember that the 6800 adjusts S for this.
GETX LDA A 0,X
LDA B 1,X ; no deallocate
PUSHBA LDX FSP
DEX
DEX
STX FSP
STAA 0,X
STAB 1,X
JMP NEXT
*
*
I FDB R+2
*
GETX also uses PUSHBA, which is used in 18 places, and we'll examine that, as well.
I possibly has the most frequent use of GETX, but @ has the most frequent use of PUSHBA, and is significantly more used than I and R together, so if we locate a definition body in front of NEXT, it probably should be @.
AT FDB *+2 ; ( adr --- n )
LDX FSP
LDX 0,X
* Same as GETX? No.
LDAA 0,X
LDAB 1,X ; no need to deallocate or allocate
* Same as STABX?
STAB 1,X ; code steal to store A:B at X and proceed
STAA 0,X
* JMP NEXT? Put NEXT here?
* Wait. That's the same code as STABX up there.
At this point, we are sure that we need differently organized code from the model, so we throw out all the code above and start over. Well, we don't just throw it out, we save it away where we can look at it as we rebuild it.
We consider that @ is more frequently used than !, so we make @ the focus:
R FDB *+2 ; ( --- n ) ( n *** n )
TSX ; Remember that the 6800 adjusts S for this.
GETX LDA A 0,X
LDA B 1,X ; no deallocate
PUSHBA LDX FSP
DEX
DEX
STX FSP
BRA STABX ; Doesn't save time, but saves a few bytes of repeated code.
*
*
I FDB R+2
*
*
STORE FDB *+2 ; ( n adr --- )
LDX FSP
LDAA 2,X ; get n
LDAB 3,X
LDX 0,X ; get address
STAA 0,X ; store n away
STAB 1,X
DEALL2 LDX FSP ; potential code-stealing point
DEALLS
INX
INX
DEALL1 INX
INX
STX FSP
BRA NEXT ; although the optimizer will recognize a JMP more directly
*
*
AT FDB *+2 ; ( adr --- n )
LDX FSP
LDX 0,X
LDABX LDAA 0,X
LDAB 1,X ; no need to deallocate or allocate
STOTOP LDX FSP
STABX STAB 1,X ; code steal to store A:B at X and proceed
STAA 0,X
NEXT LDX IP
NEXTW INX ; pre-increment mode
INX
STX IP
NEXT2 LDX 0,X ; get W which points to CFA of word to be done
NEXT3 STX W
LDX 0,X
NEXTGO JMP 0,X
*
*
DOCOL LDAA IP ; 6801 can PSHX
LDAB IP+1
PSHB
PSHA
LDX W ; Get first sub-word of that definition
BRA NEXTW ; and execute it
*
*
SEMIS FDB *+2
TSX
INS
INS
LDX 0,X get address we have just finished.
BRA NEXTW increment the return address & do next word
@ , DOCOL, and ;S are cases where the synthetic stack yields better performance and tighter code than the model code, balancing the hit we take in definitions such as ! .
Continuing, instead of focusing on + for STABX, 0= is more frequently used, so we'll focus on that, and return a -1 flag instead of a 1, in keeping with more common Forth practices:
ZEQU FDB *+2 ; ( n --- flag )
LDX FSP
LDAA 0,X
ORAB 1,X
BNE ZEQUT
DEC B
TBA
BRA STABX
ZEQUT CLRA
CLRB
JMP STABX
*
*
PLUS FDB *+2 ; ( n1 n2 --- sum )
LDX FSP
LDAA 2,X
LDAB 3,X
ADDB 1,X
ADCA 0,X
DEAST1 INX ; code steal if X is FSP, to deallocate a cell and overwrite top with A:B
INX
STX FSP
JMP STABX
*
*
AND FDB *+2 ; ( n1 n2 --- n3 ) OR and XOR can do this, too.
LDX FSP
LDAA 0,X
LDAB 1,X
ANDB 3,X
ANDA 2,X
BRA DEAST1
*
*
DPLUS FDB *+2 ; ( d1 d2 --- dsum )
LDX FSP
CLC
LDA B #4
DPLUS2 LDA A 3,X ; This loop is worth flattening on the 6801 and 6809.
ADC A 7,X
STA A 7,X
DEX
DEC B
BNE DPLUS2
JMP DEALLS
Except for a few very specialized definitions, the coding in i-code list
definitions won't change. The 2+ example is not one of those few that
would change, so we won't take the time to look at it.
As noted above, improved DOCOL and ;S will improve entry and exit times for high-level definitions, so I think we can be justified in using a synthetic parameter stack to avoid mixing parameters with return addresses.
Why did I waste a day going through these calculations?
One, to demonstrate that focusing too early on optimizations often leads us into tricky code that is not necessary.
(This is not the swamp monster some people have made of it, however. Working out optimizations often helps us understand the dark corners of code that are blocking progress. It's just something to keep in mind.)
Two, as I mentioned in the previous post and above, understandable code lends itself better to mechanical optimization. Even 6800 code can be optimized. Let's look at 2+ again, after all.
TWOP FDB DOCOLDOCOL and ;S are recognized as entry and exit. Let's probe 2:
FDB TWO,PLUS
FDB SEMIS
TWO FDB DOCONOur optimizer should give constants specific attention anyway, so maybe we won't probe DOCON further. We can just put
FDB 2
CLRA
LDAB #2
JMP PUSHBA
into the object code trough and expand PUSHBA, then STABX:
CLRA
LDAB #2
LDX FSP
DEX
DEX
STX FSP
STAB 1,X
STAA 0,X
which brings us to NEXT. Now we can probe + and expand it, then expand STABX at the end:
CLRA
LDAB #2
LDX FSP
DEX
DEX
STX FSP
STAB 1,X
STAA 0,X
LDX FSP
LDAA 2,X
LDAB 3,X
ADDB 1,X
ADCA 0,X
INX
INX
STX FSP
STAB 1,X
STAA 0,X
Find and clear the redundant STX and LDX:
CLRA
LDAB #2
LDX FSP
DEX
DEX
* STX FSP
STAB 1,X
STAA 0,X
* LDX FSP
LDAA 2,X
LDAB 3,X
ADDB 1,X
ADCA 0,X
INX
INX
STX FSP
STAB 1,X
STAA 0,X
Move accumulator activity closer together:
LDX FSP
DEX
DEX
CLRA
LDAB #2
STAB 1,X
STAA 0,X
LDAA 2,X
LDAB 3,X
ADDB 1,X
ADCA 0,X
INX
INX
STX FSP
STAB 1,X
STAA 0,X
Recognize that we are storing a constant on the stack and then adding it to what was the top as a pattern, and collapse the data movement:
LDX FSPRecognize the unnecessary adjustments to the stack pointer, including the unnecessary update at the end:
DEX
DEX
* CLRA
* LDAB #2
* STAB 1,X
* STAA 0,X
LDAA 2,X
LDAB 3,X
* ADDB 1,X
* ADCA 0,X
ADDB #2
ADCA #0
INX
INX
STX FSP
STAB 1,X
STAA 0,X
LDX FSP
* DEX
* DEX
* LDAA 2,X
* LDAB 3,X
LDAA 0,X
LDAB 1,X
ADDB #2
ADCA #0
* INX
* INX
* STX FSP
STAB 1,X
STAA 0,X
Examine the result,
LDX FSP
LDAA 0,X
LDAB 1,X
ADDB #2
ADCA #0
STAB 1,X
STAA 0,X
to weigh it against the allowed size expansion -- 6 bytes of i-code vs. 14 bytes of machine code.
That's more than double the size, but it provides oppurtunities for further optimizations where 2+ is used, so it is likely to be kept.
Of course, an actual optimizer would already have optimized DOCON constants out, so a lot of that would be already done.
But it should be clear that code that is readable is easier to optimize, which, added to the increased safety of not mixing parameters with return addresses, justifies using a synthetic stack.
It might even be motivation to use a locals stack instead of the return stack for the actual target of R , >R and R> , which is something else I would encourage considering.
No comments:
Post a Comment