Code-threading in the context of Forth deserves its own rant to help untangle the knot I picked up in Computer Languages -- Interpreted vs. Compiled Forth?. This is not a complete treatment, only background and an overview. YMMV.
Forth Threading (Is Not Process Threads)
If you've worked much with low-level code, when I mentioned (in comparing compiled and interpreted Forth) that Forth compiles definitions to a list of the addresses of definitions, you will have recognized that there is a recursion there that has to be broken at leaf calls if a Forth VM ever wants to get any useful work done.
At some point, real code has to be executed.
The terms "direct-threaded",
"indirect-threaded", and "subroutine threaded" are generally invoked to
describe the means of breaking the recursion, but they tend to be used
differently by different engineers.
Before I get into things, consider a casual assertion I slipped past you in the previous post:
The run-time architecture of a compiled language is a corollary of a virtual machine.
I'm
not sure whether this is obvious or not. When I dug into my first fig
model, the one for the 6800, it seemed obvious to me. But many people
talking about Forth seemed to not even want to talk about the
inner-interpreter as a VM. When I started looking at the Pascal virtual
machine, I saw all sorts of parallels, and when I started reading the
output of the C compilers at school, I saw most of the same essential
parallels. (And none of my professors seemed to want to compare run-time architectures with VMs, either.)
- Stack pointer(s): Both VMs and non-VM run-times usually have one or more stack pointers.
- Call protocol: Both VMs and non-VM run-times have call protocols.
- Global variables/parameters: Both VMs and non-VM run-times have them.
- Local variables/parameters: Likewise.
- Instruction pointer(s): Again, in both.
- Scratch registers: In both.
- Current procedure/function pointer: Necessary in either if object-like self-inspection is part of the language.
- Stack frames: of questionable utility in either, more often not present when parameters are separated from return pointers.
Stack frames in C had me scratching my head, at first. I could almost understand why Pascal defined them, because of the existence of local procedures and functions. But C had no such concepts, and still wasted the time to build and tear-down stack frames. The only conclusion I could come to was that the compiler authors wanted the convenience of not having to remember how deeply things were stacked in the current expression evaluation, plus the offset for the local parameters. And it seemed obvious to me that the difficulty in tracking was primarily due to interleaving the stacks.
Should I have considered the
possibility that certain well-known system architects simply preferred
to have easily crashed stacks?
Anyway, a stack frame is not a required element of a run-time architecture.
Setting aside the call protocols for a bit, let's talk about register mappings.
The fig-Forth model run time includes the following registers:- W: pointer to the currently running definition
- IP: pointer to the next i-code for the VM to execute
- RP: pointer to the top of the stack of nested calls
- SP: pointer to the top of the stacked parameters
- UP: pointer to the base of the user process state table
- PC: pointer to next machine code to execute (on non-Forth CPUs)
- scratch registers for intermediate values and operators.
Common C run-times statically map the following registers:
- IP or PC: pointer to next CPU instruction to perform
- SP: pointer to the top of the stack of nested calls
- Link: cached most recent return address, not saved in leaf routines (Not present in all run-times.)
- Heap pointer: Often implicitly the bottom of global address space
- Here pointer: Pointer to the state of the currently active object in object-oriented languages
- Scratch registers for intermediate values and operators
Now we can make some comparisons.
IP is to virtual instructions when a VM is running, and is separate from the processor's PC. But they are the same essential concept.
SP and RP, as mentioned, are tangled concepts for saving the state of a routine or function before it calls another.
W and Link are not the same, even if neither is saved in leaf routines. W is the currently executing definition, Link is the caller.
UP and the heap address/pointer are essentially corollary, even if what they contain is organized differently.
However, ...
Here are some possible register mappings for (fig) Forth vs. certain common C run-times:
On the 68000:
- A3: W vs. Link (maybe), here pointer, or scratch
- A4: IP vs. scratch
- A7: RP vs. interleaved stack pointer
- A6: Forth SP vs. frame pointer (if present) (Note that MOVEM={PUSH|POP} for all An.)
- A5:UP vs. heap pointer (if present)
- A0~A2, D0~D7: scratch
On the 6809:
- DP[0] (first two bytes in the direct page): W vs. unspecified/scratch
- Y: IP (save before using for other things) vs. unspecified/scratch
- S: RP vs. interleaved stack pointer
- U: Forth SP vs. frame pointer or scratch
- DP: UP (putting W in the user state table) vs. optional optimization use or per-task variables
- X, D (A:B) vs. scratch
On the 6800:
- $F0 in the direct (zero) page: W vs. frame pointer or scratch
- $F6 in the direct (zero) page: IP vs. scratch
- SP: RP vs. interleaved stack pointer
- $F4 in the direct (zero) page: SP vs. scratch
- $F2: UP vs. heap pointer or scratch
- X, A, B: scratch
What I want to point out is that C is really stripped down. This is because C was originally structured to be very lightweight, to use as few of the processor resources as possible, leaving the rest to the compiler or application to make optimal use of. (But current trends in C have been adding things like the here pointer to the run-time model.)
Also, C was designed to, if possible, use no global RAM. This is not possible on the 6800 because there are so few registers, and you have to use scratch RAM to do many things. On the 6801, it becomes possible, if a bit awkward, to move all scratch RAM to an interleaved stack.
I should note that my assignment of the CPU's stack pointer to the return stack pointer and the consequent use of a synthetic stack pointer as the parameter stack pointer is opposite of the fig Forth models. I'm not sure if the engineers who built the 68000 or 6809 versions of the fig models were aware that the MOVEM instruction that does the PUSH and POP on on the 68000 operates equally on all address registers, or that the 6809's U register has it's own PSHU and PULU instructions, with no overhead compared to PSHS and PULS.
With the 6800, it was probable that the the PSH/PUL A/B instructions were desired for the parameter stack, but I have done scratch calculations showing that they provide only minimal advantage. (For me, five percent is minimal when it means something like data and return addresses on the same stack.)
You have to use some sort of global RAM variable for one or the other of the stack pointers on the 6800. If it's in the direct page, the choice balances out a bit.
I'll emphasize this -- I prefer to avoid mixing parameters with return addresses whenever possible.
Now let's look at the call protocols.
In C, various call protocols have been used. Points where the protocols differ:
- Saving registers --
- Calling routine saves the registers it needs preserved, vs.
- the called routine saves the registers it uses.
- Calling routine saves the registers it needs preserved, vs.
- Building a stack frame or not
- Caching the most recent caller in a register or not
Likewise, in Forth, various call protocols have been used.
A certain aspect of the call protocol is the primary point of distinction between the three threading types in Forth, but there is still a bit of variation in how the calls are implemented.
I'll initially analyze the 6800 fig model (Find the actual source here, and the assembled listing here.), then extrapolate from there. In this analysis, I'll use the fig Forth register assignments, with all VM registers in the direct (zero) page except for SP:
- W RMB 2 the instruction register points to 6800 code
- IP RMB 2 the instruction pointer points to pointer to 6800 code
- RP RMB 2 the return stack pointer
- UP RMB 2 the pointer to base of current user's 'USER' table ( altered during multi-tasking )
- N RMB 10 used as scratch by various routines
- * SP is S
(For an in-depth look at how this plays out using a synthetic stack for the parameters and the CPU stack for RP, see here.)
The inner interpreter of the virtual machine looks like this:
NEXT
LDX IP
INX pre-increment fetch from i-code list in definition
INX
STX IP Update it.
NEXT2
LDX 0,X Get the i-code which points to a pointer to CPU code
NEXT3
STX W This is the same as the label in the symbol table.
LDX 0,X Get pointer to executable code.
NEXTGO
JMP 0,X Jump to executable code.
The definition header consists of a symbol table entry like this:
HEADERThe header consists of the name that the Forth command line interpreter will recognize, massaged a bit, and a link to the previous definition. (I'll explain some of that below.)
LABEL CODE_POINTER
DEFINITION
The CODE_POINTER is a pointer to machine language code the CPU can execute directly.
For a low-level definition, the definition consists of the machine code that the CPU can execute, and this is where the CODE_POINTER points. The code ends with a jump back to NEXT (or some other appropriate place that leads back to NEXT).
For a "high-level" (non-leaf) definition, the CODE_POINTER points to a nesting routine, and the definition consists of the list of definition addresses -- as i-codes for the virtual machine inner interpreter as described in part one of this rant. This list will be terminated by an i-code that unnests the VM and returns to the caller.
For an easy example of a low-level (leaf) definition, we can look at the code to add a double integer, starting at line 1008 in the assembled listing.
Note that the header starts with the length of the symbol name for Forth masked with some mode bits, followed by all but the last of the characters of the name, then the last character with its high bit set. ("$" says hexadecimal, ASCII for "+" is hexadecimal 4B.) The mode/length byte has its high bit set, as well, to brace the name. The link to the previous definition in the table, PLUS, is adjusted to point to that definition's name: two bytes of link, one byte of length, and one byte of name makes 4.
"*+" means "address of here plus 2", which address is where the machine code starts:
13FD 82 FCB $82 13FE 44 FCC 1,D+ 13FF AB FCB $AB 1400 13 ED FDB PLUS-4 1402 14 04 DPLUS FDB *+2 1404 30 TSX 1405 0C CLC 1406 C6 04 LDA B #4 1408 A6 03 DPLUS2 LDA A 3,X 140A A9 07 ADC A 7,X 140C A7 07 STA A 7,X 140E 09 DEX 140F 5A DEC B 1410 26 F6 BNE DPLUS2 1412 31 INS 1413 31 INS 1414 31 INS 1415 31 INS 1416 7E 10 34 JMP NEXT
For an easy example of a definition that consists of i-codes, we can look at a convenience definition that adds 2 to the top integer on stack, starting at line 1535. The listing is a little awkward -- it's hard to see that TWOP labels the address 171D, and that the six bytes following are the label values of DOCOL (1525), TWO (15B5), and PLUS (13F1). (Remember, the 6800 is most significant byte first, so you don't have to reverse the bytes in your mind.)
So the CODE_POINTER for "2+" is DOCOL, and the i-code list ends with SEMIS.
1718 82 FCB $82 1719 32 FCC 1,2+ 171A AB FCB $AB 171B 17 0B FDB ONEP-5 171D 15 25 15 B2 13 F1 TWOP FDB DOCOL,TWO,PLUS 1723 13 67 FDB SEMIS
Let's look at DOCOL, starting at line 1206. It's part of the mixed definition COLON, and has no header of its own:
1525 DE F4 DOCOL LDX RP make room in the stack 1527 09 DEX 1528 09 DEX 1529 DF F4 STX RP 152B 96 F2 LDA A IP 152D D6 F3 LDA B IP+1 152F A7 02 STA A 2,X Store address of the high level word 1531 E7 03 STA B 3,X that we are starting to execute 1533 DE F0 LDX W Get first sub-word of that definition 1535 7E 10 36 JMP NEXT+2 and execute it
We can see how it decrements RP to save the current IP, saves it, loads the address that NEXT just stored in W, then jumps into the right place in the NEXT inner interpreter to save it as the new IP and store the appropriate new W.
To brace our understanding of this, let's look at SEMIS, starting at line 896:
1362 82 FCB $82 1363 3B FCC 1,;S 1364 D3 FCB $D3 1365 13 52 FDB RPSTOR-6 1367 13 69 SEMIS FDB *+2 1369 DE F4 LDX RP 136B 08 INX 136C 08 INX 136D DF F4 STX RP 136F EE 00 LDX 0,X get address we have just finished. 1371 7E 10 36 JMP NEXT+2 increment the return address & do next word
We can see that it is a low-level leaf. It increments RP, gets the saved IP, and jumps into the right place in NEXT to store it back in IP and continue where the caller left off.
This threading of functionality through low-level and high-level definitions is what the Forth community calls "threaded".
And the above model is an example of indirect threading, where the i-codes are pointers to pointers to code.
In case you're wondering, no, this is not the only way to do indirect threading. It's just one example.
Now, what about direct threading?
We'll stick with the same register model. And we'll start with the routine for double add, since it's sort-of familiar. The header structure will be the same, up to the label.
DPLUS TSX ; index the parameters
CLC ; so we can do this in a loop
LDA B #4 ; bytes to add, start with least significant
DPLUSL LDA A 3,X ; right-hand term
ADC A 7,X ; left-hand term
STA A 7,X ; overwrite left-hand term
DEX ;
DEC B ; done?
BNE DPLUSL ; back for more
INS ; Deallocate right-hand term.
INS
INS
INS
JMP NEXT
the only thing that has changed is that the CODE_POINTER is missing in SEMIS. So how do the i-code lists get their starts? Let's look at 2+:
TWOP JMP DOCOL
FDB TWO,PLUS,SEMIS
Now the i-code lists start with a little machine language code. On the 6801, DOCOL could be located in the direct page, and the jump would only be two bytes, FWTW. But you'd still be starting an i-code list with something that doesn't even look like an i-code. That adds a bump in designing debuggers, and in code analysis of the i-code lists.
Let's see how NEXT changes to support this:
NEXT LDX IP
INX ; We can still do pre-increment mode
INX
STX IP
NEXT2 LDX 0,X ; get W which points to definition to be done
NEXT3 STX W
JMP 0,X ; Just jump right to it.
This looks like it ought to be faster, by a small amount. What happens to DOCOL and SEMIS, then? Interestingly enough, they don't have to change, except for losing the CODE_POINTER.
So why use indirect threading?
Basically, it keeps the i-code list pure, which simplifies debugging tools and such. Also, indirect threading helps avoid some of the problems your developers' tools may make for you, when they try to help you. (Yet another topic.)
Again, this is not the only way to do direct-threading.
How about subroutine threading?
In subroutine threading, the definitions are called by subroutine. This can be done both indirect-threaded and direct-threaded, but, typically, it is done direct-threaded, in the idea that it is faster. NEXT looks like this:
NEXT LDX IP
INX ; We can still do pre-increment mode
INX
STX IP
NEXT2 LDX 0,X ; get W which points to definition to be done
NEXT3 STX W
* LDX 0,X ; if doing indirect
JSR 0,X ; Use a native call.
To match this, low-level definitions end in a RTS instead of a JMP NEXT. The call and return actually take a bit more time than JMP, having to save and restore data on the CPU stack.
DOCOL doesn't have to change, and SEMIS ends in a RTS. Certain definitions we haven't looked at change because of return addresses that are now on the parameter stack. (Using the CPU's call stack for the parameter stack causes the greater ripple effect here.)
Again, these are not the only ways of doing subroutine threading. In particular, since the call in the direct-threaded model actually saves W on the return stack, explicitly having a W register in the VM and storing X to it at NEXT3 can be done away with. Such an approach does requires changes to DOCOL, however.
Some daredevils pervert the CPU's stack pointer into the IP and NEXT becomes a CPU return. I say daredevil, because, if the CPU gets any sort of interrupt, the interrupt walks all over the code. (Most CPUs save interrupt state on the call stack.) If you're that desperate to use an IP register with auto-increment mode, use a CPU that has a valid auto-increment mode for a non-stack register.
Just for curiosity's sake, let's look at non-subroutine indirect threading on the 6809, going with my recommendation of using the U stack for parameters this time. The header will not change. Registers in the virtual machine will be assigned as above:
- W: X (Save before using for other things, if you need W.)
- IP: Y (save before using for other things and restore before NEXT)
- RP: S
- SP: U
- UP: DP
The examples, as with the 6800:
DPLUS FDB *-2
LDD 6,U ; left-hand
ADDD 2,U ; right-hand
STD 6,U
LDD 4,U ; left-hand
ADCB 1,U ; No ADCD, do it by bytes.
ADCA ,U
LEAU 4,U ; deallocate before store to save a cycle.
STD ,U
JMP NEXT
* JMP [,Y++] ; This could be NEXT on the 6809, if ignoring W
*
DOCOL, NEXT, and SEMIS are significantly optimized:
DOCOL LDX ,Y++ ; using post-increment instead of pre-
PSHS Y ; save pointer to next
LDY ,Y ; Get new IP
NEXT LDX ,Y++ ; using post-increment instead of pre-
JMP [,X]
*
SEMIS PULS Y
BRA NEXT
How about the 68000? It gives us a 32-bit add naturally, so we'll define a 64-bit double add. Using the registers as I suggest above,
DPLUS DC.L *+4And
MOVEM (A6),D0/D1/D2/D3
ADD.L D1,D3 ; less significant 32 bits
ADDX.L D0,D2 ; more significant 32 bits
LEA 8(A6),A6 ; deallocate right-hand term
JMP NEXT
DOCOL MOVE.L (A4)+,A3 ; using post-increment instead of pre-The shift from indirect-threaded to direct, and the use of subroutine-threading would follow pretty much as the shift in the 6800, modulus the advantage of having the registers in actual registers instead of in memory.
MOVE.L A4,-(A7) ; save pointer to next
MOVE.L A3,A4 ; Get new IP
NEXT (A4)+,A3 ; using post-increment instead of pre-
MOVE.L (A3),A2
JMP (A2)
*
SEMIS MOVE.L (A7)+,A4
BRA NEXT
There is one more step I mentioned in passing in part one of this rant. If using native calls, a certain degree of speed optimization can be obtained by flattening an i-code list. Instead of starting the definition to be optimized this way with a jump to the inner interpreter, the definition can be compiled as a series of calls.
On the 6809, 2+ can be changed from
TWOP JMP DOCOL
FDB TWO,PLUS,SEMIS
to
TWOP JSR TWOAnd from there, optimizing to
JSR PLUS
RTS
TWOP LDD #2
ADDD ,U
STD ,U
RTS
becomes almost trivial.
(I also discuss optimizations on the 6800 a little in the synthetic stack examples here: https://defining-computers.blogspot.com/2020/10/6800-example-vm-with-synthetic.html.)
Caveat: It's way past my bedtime again, and I may have glaring mistakes in the above. If so, leave me comments, please.
Why is this important?
Using the fig Forth VM, it is a bit difficult to share the Forth words as library code for compiled languages such as C. Using the native CPU calls instead, it becomes possible to share, if the compiler for the other language uses a split stack.
No comments:
Post a Comment