defining computers: Forth Threading (Is Not Process Threads)

Code-threading in the context of Forth deserves its own rant to help untangle the knot I picked up in Computer Languages -- Interpreted vs. Compiled Forth?. This is not a complete treatment, only background and an overview. YMMV.

Forth Threading (Is Not Process Threads)

If you've worked much with low-level code, when I mentioned (in comparing compiled and interpreted Forth) that Forth compiles definitions to a list of the addresses of definitions, you will have recognized that there is a recursion there that has to be broken at leaf calls if a Forth VM ever wants to get any useful work done.

At some point, real code has to be executed.

The terms "direct-threaded", "indirect-threaded", and "subroutine threaded" are generally invoked to describe the means of breaking the recursion, but they tend to be used differently by different engineers.

Before I get into things, consider a casual assertion I slipped past you in the previous post:

The run-time architecture of a compiled language is a corollary of a virtual machine.

I'm not sure whether this is obvious or not. When I dug into my first fig model, the one for the 6800, it seemed obvious to me. But many people talking about Forth seemed to not even want to talk about the inner-interpreter as a VM. When I started looking at the Pascal virtual machine, I saw all sorts of parallels, and when I started reading the output of the C compilers at school, I saw most of the same essential parallels. (And none of my professors seemed to want to compare run-time architectures with VMs, either.)

Stack pointer(s): Both VMs and non-VM run-times usually have one or more stack pointers.
Call protocol: Both VMs and non-VM run-times have call protocols.
Global variables/parameters: Both VMs and non-VM run-times have them.
Local variables/parameters: Likewise.
Instruction pointer(s): Again, in both.
Scratch registers: In both.
Current procedure/function pointer: Necessary in either if object-like self-inspection is part of the language.
Stack frames: of questionable utility in either, more often not present when parameters are separated from return pointers.

Stack frames in C had me scratching my head, at first. I could almost understand why Pascal defined them, because of the existence of local procedures and functions. But C had no such concepts, and still wasted the time to build and tear-down stack frames. The only conclusion I could come to was that the compiler authors wanted the convenience of not having to remember how deeply things were stacked in the current expression evaluation, plus the offset for the local parameters. And it seemed obvious to me that the difficulty in tracking was primarily due to interleaving the stacks.

Should I have considered the possibility that certain well-known system architects simply preferred to have easily crashed stacks?

Anyway, a stack frame is not a required element of a run-time architecture.

Setting aside the call protocols for a bit, let's talk about register mappings.

The fig-Forth model run time includes the following registers:

W: pointer to the currently running definition
IP: pointer to the next i-code for the VM to execute
RP: pointer to the top of the stack of nested calls
SP: pointer to the top of the stacked parameters
UP: pointer to the base of the user process state table
PC: pointer to next machine code to execute (on non-Forth CPUs)
scratch registers for intermediate values and operators.

Common C run-times statically map the following registers:

IP or PC: pointer to next CPU instruction to perform
SP: pointer to the top of the stack of nested calls
Link: cached most recent return address, not saved in leaf routines (Not present in all run-times.)
Heap pointer: Often implicitly the bottom of global address space
Here pointer: Pointer to the state of the currently active object in object-oriented languages
Scratch registers for intermediate values and operators

Now we can make some comparisons.

IP is to virtual instructions when a VM is running, and is separate from the processor's PC. But they are the same essential concept.

SP and RP, as mentioned, are tangled concepts for saving the state of a routine or function before it calls another.

W and Link are not the same, even if neither is saved in leaf routines. W is the currently executing definition, Link is the caller.

UP and the heap address/pointer are essentially corollary, even if what they contain is organized differently.

However, ...

Here are some possible register mappings for (fig) Forth vs. certain common C run-times:

On the 68000:

A3: W vs. Link (maybe), here pointer, or scratch
A4: IP vs. scratch
A7: RP vs. interleaved stack pointer
A6: Forth SP vs. frame pointer (if present) (Note that MOVEM={PUSH|POP} for all An.)
A5:UP vs. heap pointer (if present)
A0~A2, D0~D7: scratch

On the 6809:

DP[0] (first two bytes in the direct page): W vs. unspecified/scratch
Y: IP (save before using for other things) vs. unspecified/scratch
S: RP vs. interleaved stack pointer
U: Forth SP vs. frame pointer or scratch
DP: UP (putting W in the user state table) vs. optional optimization use or per-task variables
X, D (A:B) vs. scratch

On the 6800:

$F0 in the direct (zero) page: W vs. frame pointer or scratch
$F6 in the direct (zero) page: IP vs. scratch
SP: RP vs. interleaved stack pointer
$F4 in the direct (zero) page: SP vs. scratch
$F2: UP vs. heap pointer or scratch
X, A, B: scratch

What I want to point out is that C is really stripped down. This is because C was originally structured to be very lightweight, to use as few of the processor resources as possible, leaving the rest to the compiler or application to make optimal use of. (But current trends in C have been adding things like the here pointer to the run-time model.)

Also, C was designed to, if possible, use no global RAM. This is not possible on the 6800 because there are so few registers, and you have to use scratch RAM to do many things. On the 6801, it becomes possible, if a bit awkward, to move all scratch RAM to an interleaved stack.

I should note that my assignment of the CPU's stack pointer to the return stack pointer and the consequent use of a synthetic stack pointer as the parameter stack pointer is opposite of the fig Forth models. I'm not sure if the engineers who built the 68000 or 6809 versions of the fig models were aware that the MOVEM instruction that does the PUSH and POP on on the 68000 operates equally on all address registers, or that the 6809's U register has it's own PSHU and PULU instructions, with no overhead compared to PSHS and PULS.

With the 6800, it was probable that the the PSH/PUL A/B instructions were desired for the parameter stack, but I have done scratch calculations showing that they provide only minimal advantage. (For me, five percent is minimal when it means something like data and return addresses on the same stack.)

You have to use some sort of global RAM variable for one or the other of the stack pointers on the 6800. If it's in the direct page, the choice balances out a bit.

I'll emphasize this -- I prefer to avoid mixing parameters with return addresses whenever possible.

Now let's look at the call protocols.

In C, various call protocols have been used. Points where the protocols differ:

Saving registers --
- Calling routine saves the registers it needs preserved, vs.
- the called routine saves the registers it uses.
Building a stack frame or not
Caching the most recent caller in a register or not

Likewise, in Forth, various call protocols have been used.

A certain aspect of the call protocol is the primary point of distinction between the three threading types in Forth, but there is still a bit of variation in how the calls are implemented.

I'll initially analyze the 6800 fig model (Find the actual source here, and the assembled listing here.), then extrapolate from there. In this analysis, I'll use the fig Forth register assignments, with all VM registers in the direct (zero) page except for SP:

W RMB 2 the instruction register points to 6800 code
IP RMB 2 the instruction pointer points to pointer to 6800 code
RP RMB 2 the return stack pointer
UP RMB 2 the pointer to base of current user's 'USER' table ( altered during multi-tasking )
N RMB 10 used as scratch by various routines
* SP is S

(For an in-depth look at how this plays out using a synthetic stack for the parameters and the CPU stack for RP, see here.)

The inner interpreter of the virtual machine looks like this:

NEXT
    LDX    IP
   INX        pre-increment fetch from i-code list in definition
   INX
   STX    IP   Update it.
NEXT2
    LDX    0,X    Get the i-code which points to a pointer to CPU code
NEXT3
    STX    W      This is the same as the label in the symbol table.
   LDX    0,X    Get pointer to executable code.
NEXTGO
    JMP    0,X    Jump to executable code.

The definition header consists of a symbol table entry like this:

HEADER
LABEL CODE_POINTER
DEFINITION

The header consists of the name that the Forth command line interpreter will recognize, massaged a bit, and a link to the previous definition. (I'll explain some of that below.)

The CODE_POINTER is a pointer to machine language code the CPU can execute directly.

For a low-level definition, the definition consists of the machine code that the CPU can execute, and this is where the CODE_POINTER points. The code ends with a jump back to NEXT (or some other appropriate place that leads back to NEXT).

For a "high-level" (non-leaf) definition, the CODE_POINTER points to a nesting routine, and the definition consists of the list of definition addresses -- as i-codes for the virtual machine inner interpreter as described in part one of this rant. This list will be terminated by an i-code that unnests the VM and returns to the caller.

For an easy example of a low-level (leaf) definition, we can look at the code to add a double integer, starting at line 1008 in the assembled listing.

Note that the header starts with the length of the symbol name for Forth masked with some mode bits, followed by all but the last of the characters of the name, then the last character with its high bit set. ("$" says hexadecimal, ASCII for "+" is hexadecimal 4B.) The mode/length byte has its high bit set, as well, to brace the name. The link to the previous definition in the table, PLUS, is adjusted to point to that definition's name: two bytes of link, one byte of length, and one byte of name makes 4.

"*+" means "address of here plus 2", which address is where the machine code starts:

  13FD 82          	FCB	$82
  13FE 44          	FCC	1,D+
  13FF AB          	FCB	$AB
  1400 13 ED       	FDB	PLUS-4
  1402 14 04       DPLUS	FDB	*+2
  1404 30          	TSX
  1405 0C          	CLC
  1406 C6 04       	LDA B	#4
  1408 A6 03       DPLUS2	LDA A	3,X
  140A A9 07       	ADC A	7,X
  140C A7 07       	STA A	7,X
  140E 09          	DEX
  140F 5A          	DEC B
  1410 26 F6       	BNE	DPLUS2
  1412 31          	INS
  1413 31          	INS
  1414 31          	INS
  1415 31          	INS
  1416 7E 10 34    	JMP	NEXT

For an easy example of a definition that consists of i-codes, we can look at a convenience definition that adds 2 to the top integer on stack, starting at line 1535. The listing is a little awkward -- it's hard to see that TWOP labels the address 171D, and that the six bytes following are the label values of DOCOL (1525), TWO (15B5), and PLUS (13F1). (Remember, the 6800 is most significant byte first, so you don't have to reverse the bytes in your mind.)

So the CODE_POINTER for "2+" is DOCOL, and the i-code list ends with SEMIS.

  1718 82          	FCB	$82
  1719 32          	FCC	1,2+
  171A AB          	FCB	$AB
  171B 17 0B       	FDB	ONEP-5
  171D 15 25 15 B2 13 F1 
                        TWOP	FDB	DOCOL,TWO,PLUS
  1723 13 67       	FDB	SEMIS

Let's look at DOCOL, starting at line 1206. It's part of the mixed definition COLON, and has no header of its own:

  1525 DE F4       DOCOL	LDX	RP	make room in the stack
  1527 09          	DEX
  1528 09          	DEX
  1529 DF F4       	STX	RP
  152B 96 F2       	LDA A	IP
  152D D6 F3       	LDA B	IP+1	
  152F A7 02       	STA A	2,X	Store address of the high level word
  1531 E7 03       	STA B	3,X	that we are starting to execute
  1533 DE F0       	LDX	W	Get first sub-word of that definition
  1535 7E 10 36    	JMP	NEXT+2	and execute it

We can see how it decrements RP to save the current IP, saves it, loads the address that NEXT just stored in W, then jumps into the right place in the NEXT inner interpreter to save it as the new IP and store the appropriate new W.

To brace our understanding of this, let's look at SEMIS, starting at line 896:

   1362 82          	FCB	$82
   1363 3B          	FCC	1,;S
   1364 D3          	FCB	$D3
   1365 13 52       	FDB	RPSTOR-6
   1367 13 69       SEMIS	FDB	*+2
   1369 DE F4       	LDX	RP
   136B 08          	INX
   136C 08          	INX
   136D DF F4       	STX	RP
   136F EE 00       	LDX	0,X	get address we have just finished.
   1371 7E 10 36    	JMP	NEXT+2	increment the return address & do next word

We can see that it is a low-level leaf. It increments RP, gets the saved IP, and jumps into the right place in NEXT to store it back in IP and continue where the caller left off.

This threading of functionality through low-level and high-level definitions is what the Forth community calls "threaded".

And the above model is an example of indirect threading, where the i-codes are pointers to pointers to code.

In case you're wondering, no, this is not the only way to do indirect threading. It's just one example.

Now, what about direct threading?

We'll stick with the same register model. And we'll start with the routine for double add, since it's sort-of familiar. The header structure will be the same, up to the label.

DPLUS   TSX   ; index the parameters
  CLC  ; so we can do this in a loop
  LDA B #4 ; bytes to add, start with least significant
DPLUSL   LDA A 3,X  ; right-hand term
  ADC A 7,X  ; left-hand term
  STA A 7,X  ; overwrite left-hand term
  DEX  ; 
  DEC B  ; done?
  BNE DPLUSL ; back for more
  INS  ; Deallocate right-hand term.
  INS
  INS
  INS
  JMP NEXT

the only thing that has changed is that the CODE_POINTER is missing in SEMIS. So how do the i-code lists get their starts? Let's look at 2+:

TWOP	JMP DOCOL
        FDB TWO,PLUS,SEMIS

Now the i-code lists start with a little machine language code. On the 6801, DOCOL could be located in the direct page, and the jump would only be two bytes, FWTW. But you'd still be starting an i-code list with something that doesn't even look like an i-code. That adds a bump in designing debuggers, and in code analysis of the i-code lists.

Let's see how NEXT changes to support this:

NEXT LDX IP
  INX  ; We can still do pre-increment mode
  INX
  STX IP
NEXT2 LDX 0,X  ; get W which points to definition to be done
NEXT3 STX W
  JMP 0,X  ; Just jump right to it.

This looks like it ought to be faster, by a small amount. What happens to DOCOL and SEMIS, then? Interestingly enough, they don't have to change, except for losing the CODE_POINTER.

So why use indirect threading?

Basically, it keeps the i-code list pure, which simplifies debugging tools and such. Also, indirect threading helps avoid some of the problems your developers' tools may make for you, when they try to help you. (Yet another topic.)

Again, this is not the only way to do direct-threading.

How about subroutine threading?

In subroutine threading, the definitions are called by subroutine. This can be done both indirect-threaded and direct-threaded, but, typically, it is done direct-threaded, in the idea that it is faster. NEXT looks like this:

NEXT LDX IP
  INX  ; We can still do pre-increment mode
  INX
  STX IP
NEXT2 LDX 0,X  ; get W which points to definition to be done
NEXT3 STX W
*  LDX 0,X  ; if doing indirect
  JSR 0,X  ; Use a native call.

To match this, low-level definitions end in a RTS instead of a JMP NEXT. The call and return actually take a bit more time than JMP, having to save and restore data on the CPU stack.

DOCOL doesn't have to change, and SEMIS ends in a RTS. Certain definitions we haven't looked at change because of return addresses that are now on the parameter stack. (Using the CPU's call stack for the parameter stack causes the greater ripple effect here.)

Again, these are not the only ways of doing subroutine threading. In particular, since the call in the direct-threaded model actually saves W on the return stack, explicitly having a W register in the VM and storing X to it at NEXT3 can be done away with. Such an approach does requires changes to DOCOL, however.

Some daredevils pervert the CPU's stack pointer into the IP and NEXT becomes a CPU return. I say daredevil, because, if the CPU gets any sort of interrupt, the interrupt walks all over the code. (Most CPUs save interrupt state on the call stack.) If you're that desperate to use an IP register with auto-increment mode, use a CPU that has a valid auto-increment mode for a non-stack register.

Just for curiosity's sake, let's look at non-subroutine indirect threading on the 6809, going with my recommendation of using the U stack for parameters this time. The header will not change. Registers in the virtual machine will be assigned as above:

W: X (Save before using for other things, if you need W.)
IP: Y (save before using for other things and restore before NEXT)
RP: S
SP: U
UP: DP

The examples, as with the 6800:

DPLUS   FDB *-2
    LDD 6,U ; left-hand
    ADDD 2,U ; right-hand
    STD 6,U
    LDD   4,U ; left-hand
    ADCB 1,U ; No ADCD, do it by bytes.
    ADCA   ,U
    LEAU   4,U ; deallocate before store to save a cycle.
    STD   ,U
    JMP   NEXT
*    JMP   [,Y++] ; This could be NEXT on the 6809, if ignoring W
*

DOCOL, NEXT, and SEMIS are significantly optimized:

DOCOL    LDX   ,Y++   ; using post-increment instead of pre-
    PSHS   Y    ; save pointer to next
    LDY ,Y    ; Get new IP
NEXT LDX   ,Y++ ; using post-increment instead of pre-
    JMP   [,X]
*
SEMIS    PULS Y
    BRA NEXT

How about the 68000? It gives us a 32-bit add naturally, so we'll define a 64-bit double add. Using the registers as I suggest above,

DPLUS DC.L *+4
    MOVEM   (A6),D0/D1/D2/D3
    ADD.L D1,D3 ; less significant 32 bits
    ADDX.L D0,D2   ; more significant 32 bits
    LEA 8(A6),A6 ; deallocate right-hand term
JMP   NEXT

And

DOCOL    MOVE.L (A4)+,A3 ; using post-increment instead of pre-
    MOVE.L    A4,-(A7)   ; save pointer to next
    MOVE.L    A3,A4    ; Get new IP
NEXT (A4)+,A3 ; using post-increment instead of pre-
    MOVE.L    (A3),A2
    JMP   (A2)
*
SEMIS    MOVE.L (A7)+,A4
    BRA NEXT

The shift from indirect-threaded to direct, and the use of subroutine-threading would follow pretty much as the shift in the 6800, modulus the advantage of having the registers in actual registers instead of in memory.

There is one more step I mentioned in passing in part one of this rant. If using native calls, a certain degree of speed optimization can be obtained by flattening an i-code list. Instead of starting the definition to be optimized this way with a jump to the inner interpreter, the definition can be compiled as a series of calls.

On the 6809, 2+ can be changed from

TWOP JMP DOCOL
FDB TWO,PLUS,SEMIS

TWOP    JSR     TWO
    JSR     PLUS
    RTS

And from there, optimizing to

TWOP    LDD    #2
    ADDD    ,U
    STD    ,U
    RTS

becomes almost trivial.

(I also discuss optimizations on the 6800 a little in the synthetic stack examples here: https://defining-computers.blogspot.com/2020/10/6800-example-vm-with-synthetic.html.)

Caveat: It's way past my bedtime again, and I may have glaring mistakes in the above. If so, leave me comments, please.

Why is this important?

Using the fig Forth VM, it is a bit difficult to share the Forth words as library code for compiled languages such as C. Using the native CPU calls instead, it becomes possible to share, if the compiler for the other language uses a split stack.

defining computers

Misunderstanding Computers

Monday, October 26, 2020

Forth Threading (Is Not Process Threads)

Forth Threading (Is Not Process Threads)

No comments:

Post a Comment