Misunderstanding Computers

Why do we insist on seeing the computer as a magic box for controlling other people?
人はどうしてコンピュータを、人を制する魔法の箱として考えたいのですか?
Why do we want so much to control others when we won't control ourselves?
どうしてそれほど、自分を制しないのに、人をコントロールしたいのですか?

Computer memory is just fancy paper, CPUs are just fancy pens with fancy erasers, and the network is just a fancy backyard fence.
コンピュータの記憶というものはただ改良した紙ですし、CPU 何て特長ある筆に特殊の消しゴムがついたものにすぎないし、ネットワークそのものは裏庭の塀が少し拡大されたものぐらいです。

(original post/元の投稿 -- defining computers site/コンピュータを定義しようのサイト)

Monday, October 26, 2020

6800 Example VM with Synthetic Parameter Stack

In Forth Threading (Is Not Process Threads), I commented on the fig Forth models insisting on using the CPU stack register for parameters. The code I presented for the 6800 follows that design choice, but the code I presented there for the  6809 and 68000 does not. Here I'll show, for your consideration, the snippets for indirect threading on the 6800 done right (according to me) -- with the CPU stack implementing RP and the synthesized stack implementing Forth's SP. 

You'll note that the code here will end up maybe 5% slower and 5% bigger overall than the code that the fig model for the 6800 shows. But it will not mix return addresses and parameters on the same stack, and it should end up a bit easier to read and analyze, meaning that it should be more amenable to optimization techniques, when generating machine code instead of i-code lists.

In the code below, the virtual registers are allocated as follows:

  • W    RMB    2    the instruction register points to 6800 code
  • IP    RMB    2    the instruction pointer points to pointer to 6800 code
  • * RP is S
  • FSP    RMB    2    Forth SP, the parameter stack pointer
  • UP    RMB    2    the pointer to base of current user's 'USER' table  ( altered during multi-tasking ) 
  • N    RMB    10    used as scratch by various routines

The symbol table entry remains the same:

    HEADER
LABEL CODE_POINTER
    DEFINITION
So we won't discuss it here. (See the discussion in the post on threading techniques.)

I won't repeat the fig model here. I'll be bringing more of it into the discussion than I did in the post on threading techniques, refer to the model in my repository, here: https://sourceforge.net/p/asm68c/code/ci/master/tree/fig-forth/fig-forth.68c.

This is an exercise in refraining from early optimization, so the parameter stack pointer will point to the last item pushed, even though it's tempting to cater to the lack of negative index offset in the 6800 and point to the first free cell beyond top of stack.

The inner interpreter of the virtual machine looks the same, but the code surrounding it changes. 

PULABX is only used by "!" (STORE), so let's start the rework there:

Forth Threading (Is Not Process Threads)

Code-threading in the context of Forth deserves its own rant to help untangle the knot I picked up in Computer Languages -- Interpreted vs. Compiled Forth?. This is not a complete treatment, only background and an overview. YMMV.

Forth Threading (Is Not Process Threads)

If you've worked much with low-level code, when I mentioned (in comparing compiled and interpreted Forth) that Forth compiles definitions to a list of the addresses of definitions, you will have recognized that there is a recursion there that has to be broken at leaf calls if a Forth VM ever wants to get any useful work done. 

At some point, real code has to be executed.

The terms "direct-threaded", "indirect-threaded", and "subroutine threaded" are generally invoked to describe the means of breaking the recursion, but they tend to be used differently by different engineers.

Before I get into things, consider a casual assertion I slipped past you in the previous post:

The run-time architecture of a compiled language is a corollary of a virtual machine.

I'm not sure whether this is obvious or not. When I dug into my first fig model, the one for the 6800, it seemed obvious to me. But many people talking about Forth seemed to not even want to talk about the inner-interpreter as a VM. When I started looking at the Pascal virtual machine, I saw all sorts of parallels, and when I started reading the output of the C compilers at school, I saw most of the same essential parallels. (And none of my professors seemed to want to compare run-time architectures with VMs, either.)

  • Stack pointer(s): Both VMs and non-VM run-times usually have one or more stack pointers.
  • Call protocol: Both VMs and non-VM run-times have call protocols.
  • Global variables/parameters: Both VMs and non-VM run-times have them.
  • Local variables/parameters: Likewise.
  • Instruction pointer(s): Again, in both.
  • Scratch registers: In both. 
  • Current procedure/function pointer: Necessary in either if object-like self-inspection is part of the language.
  • Stack frames: of questionable utility in either, more often not present when parameters are separated from return pointers.

Stack frames in C had me scratching my head, at first. I could almost understand why Pascal defined them, because of the existence of local procedures and functions. But C had no such concepts, and still wasted the time to build and tear-down stack frames. The only conclusion I could come to was that the compiler authors wanted the convenience of not having to remember how deeply things were stacked in the current expression evaluation, plus the offset for the local parameters. And it seemed obvious to me that the difficulty in tracking was primarily due to interleaving the stacks.

Should I have considered the possibility that certain well-known system architects simply preferred to have easily crashed stacks?

Anyway, a stack frame is not a required element of a run-time architecture.

Setting aside the call protocols for a bit, let's talk about register mappings.

The fig-Forth model run time includes the following registers:
  • W: pointer to the currently running definition
  • IP: pointer to the next i-code for the VM to execute
  • RP: pointer to the top of the stack of nested calls
  • SP: pointer to the top of the stacked parameters
  • UP: pointer to the base of the user process state table
  • PC: pointer to next machine code to execute (on non-Forth CPUs)
  • scratch registers for intermediate values and operators.

Common C run-times statically map the following registers:

  • IP or PC: pointer to next CPU instruction to perform
  • SP: pointer to the top of the stack of nested calls
  • Link: cached most recent return address, not saved in leaf routines (Not present in all run-times.)
  • Heap pointer: Often implicitly the bottom of global address space
  • Here pointer: Pointer to the state of the currently active object in object-oriented languages
  • Scratch registers for intermediate values and operators

Now we can make some comparisons.

IP is to virtual instructions when a VM is running, and is separate from the processor's PC. But they are the same essential concept.

SP and RP, as mentioned, are tangled concepts for saving the state of a routine or function before it calls another.

W and Link are not the same, even if neither is saved in leaf routines. W is the currently executing definition, Link is the caller.

UP and the heap address/pointer are essentially corollary, even if what they contain is organized differently.

However, ...

Here are some possible register mappings for (fig) Forth vs. certain common C run-times:

On the 68000:

  • A3: W vs. Link (maybe), here pointer, or scratch
  • A4: IP vs. scratch
  • A7: RP vs. interleaved stack pointer
  • A6: Forth SP vs. frame pointer (if present) (Note that MOVEM={PUSH|POP} for all An.)
  • A5:UP vs. heap pointer (if present)
  • A0~A2, D0~D7: scratch

On the 6809:

  • DP[0] (first two bytes in the direct page): W vs. unspecified/scratch
  • Y: IP (save before using for other things) vs. unspecified/scratch
  • S: RP vs. interleaved stack pointer
  • U: Forth SP vs. frame pointer or scratch
  • DP: UP (putting W in the user state table) vs. optional optimization use or per-task variables
  • X, D (A:B) vs. scratch

On the 6800:

  • $F0 in the direct (zero) page: W vs. frame pointer or scratch
  • $F6 in the direct (zero) page: IP vs. scratch
  • SP: RP vs. interleaved stack pointer
  • $F4 in the direct (zero) page: SP vs. scratch
  • $F2: UP vs. heap pointer or scratch
  •  X, A, B: scratch

What I want to point out is that C is really stripped down. This is because C was originally structured to be very lightweight, to use as few of the processor resources as possible, leaving the rest to the compiler or application to make optimal use of. (But current trends in C have been adding things like the here pointer to the run-time model.)

Also, C was designed to, if possible, use no global RAM. This is not possible on the 6800 because there are so few registers, and you have to use scratch RAM to do many things. On the 6801, it becomes possible, if a bit awkward, to move all scratch RAM to an interleaved stack.

I should note that my assignment of the CPU's stack pointer to the return stack pointer and the consequent use of a synthetic stack pointer as the parameter stack pointer is opposite of the fig Forth models. I'm not sure if the engineers who built the 68000 or 6809 versions of the fig models were aware that the MOVEM instruction that does the PUSH and POP on on the 68000 operates equally on all address registers, or that the 6809's U register has it's own PSHU and PULU instructions, with no overhead compared to PSHS and PULS.

With the 6800, it was probable that the the PSH/PUL A/B instructions were desired for the parameter stack, but I have done scratch calculations showing that they provide only minimal advantage. (For me, five percent is minimal when it means something like data and return addresses on the same stack.)

You have to use some sort of global RAM variable for one or the other of the stack pointers on the 6800. If it's in the direct page, the choice balances out a bit.

I'll emphasize this -- I prefer to avoid mixing parameters with return addresses whenever possible.

Now let's look at the call protocols. 

In C, various call protocols have been used. Points where the protocols differ:

  • Saving registers --
    • Calling routine saves the registers it needs preserved, vs.
    • the called routine saves the registers it uses.
  • Building a stack frame or not
  • Caching the most recent caller in a register or not

Likewise, in Forth, various call protocols have been used.

A certain aspect of the call protocol is the primary point of distinction between the three threading types in Forth, but there is still a bit of variation in how the calls are implemented.

I'll initially analyze the 6800 fig model (Find the actual source here, and the assembled listing here.), then extrapolate from there. In this analysis, I'll use the fig Forth register assignments, with all VM registers in the direct (zero) page except for SP:

  • W    RMB    2    the instruction register points to 6800 code
  • IP    RMB    2    the instruction pointer points to pointer to 6800 code
  • RP    RMB    2    the return stack pointer
  • UP    RMB    2    the pointer to base of current user's 'USER' table  ( altered during multi-tasking ) 
  • N    RMB    10    used as scratch by various routines
  • * SP is S

(For an in-depth look at how this plays out using a synthetic stack for the parameters and the CPU stack for RP, see here.)

The inner interpreter of the virtual machine looks like this:

NEXT
    LDX    IP
    INX        pre-increment fetch from i-code list in definition
    INX
    STX    IP   Update it.
NEXT2
    LDX    0,X    Get the i-code which points to a pointer to CPU code
NEXT3
    STX    W      This is the same as the label in the symbol table.
    LDX    0,X    Get pointer to executable code.
NEXTGO
    JMP    0,X    Jump to executable code.

The definition header consists of a symbol table entry like this:

    HEADER
LABEL CODE_POINTER
    DEFINITION
The header consists of the name that the Forth command line interpreter will recognize, massaged a bit, and a link to the previous definition. (I'll explain some of that below.)

The CODE_POINTER is a pointer to machine language code the CPU can execute directly.

For a low-level definition, the definition consists of the machine code that the CPU can execute, and this is where the CODE_POINTER points. The code ends with a jump back to NEXT (or some other appropriate place that leads back to NEXT).

For a "high-level" (non-leaf) definition, the CODE_POINTER points to a nesting routine, and the definition consists of the list of definition addresses -- as i-codes for the virtual machine inner interpreter as described in part one of this rant. This list will be terminated by an i-code that unnests the VM and returns to the caller.

For an easy example of a low-level (leaf) definition, we can look at the code to add a double integer, starting at line 1008 in the assembled listing. 

Note that the header starts with the length of the symbol name for Forth masked with some mode bits, followed by all but the last of the characters of the name, then the last character with its high bit set. ("$" says hexadecimal, ASCII for "+" is hexadecimal 4B.) The mode/length byte has its high bit set, as well, to brace the name. The link to the previous definition in the table, PLUS, is adjusted to point to that definition's name: two bytes of link, one byte of length, and one byte of name makes 4. 

"*+" means "address of here plus 2", which address is where the machine code starts:


  13FD 82          	FCB	$82
  13FE 44          	FCC	1,D+
  13FF AB          	FCB	$AB
  1400 13 ED       	FDB	PLUS-4
  1402 14 04       DPLUS	FDB	*+2
  1404 30          	TSX
  1405 0C          	CLC
  1406 C6 04       	LDA B	#4
  1408 A6 03       DPLUS2	LDA A	3,X
  140A A9 07       	ADC A	7,X
  140C A7 07       	STA A	7,X
  140E 09          	DEX
  140F 5A          	DEC B
  1410 26 F6       	BNE	DPLUS2
  1412 31          	INS
  1413 31          	INS
  1414 31          	INS
  1415 31          	INS
  1416 7E 10 34    	JMP	NEXT

For an easy example of a definition that consists of i-codes, we can look at a convenience definition that adds 2 to the top integer on stack, starting at line 1535. The listing is a little awkward -- it's hard to see that TWOP labels the address 171D, and that the six bytes following are the label values of DOCOL (1525), TWO (15B5), and PLUS (13F1). (Remember, the 6800 is most significant byte first, so you don't have to reverse the bytes in your mind.)

So the CODE_POINTER for "2+" is DOCOL, and the i-code list ends with SEMIS.


  1718 82          	FCB	$82
  1719 32          	FCC	1,2+
  171A AB          	FCB	$AB
  171B 17 0B       	FDB	ONEP-5
  171D 15 25 15 B2 13 F1 
                        TWOP	FDB	DOCOL,TWO,PLUS
  1723 13 67       	FDB	SEMIS

Let's look at DOCOL, starting at line 1206. It's part of the mixed definition COLON, and has no header of its own:


  1525 DE F4       DOCOL	LDX	RP	make room in the stack
  1527 09          	DEX
  1528 09          	DEX
  1529 DF F4       	STX	RP
  152B 96 F2       	LDA A	IP
  152D D6 F3       	LDA B	IP+1	
  152F A7 02       	STA A	2,X	Store address of the high level word
  1531 E7 03       	STA B	3,X	that we are starting to execute
  1533 DE F0       	LDX	W	Get first sub-word of that definition
  1535 7E 10 36    	JMP	NEXT+2	and execute it

We can see how it decrements RP to save the current IP, saves it, loads the address that NEXT just stored in W, then jumps into the right place in the NEXT inner interpreter to save it as the new IP and store the appropriate new W.

To brace our understanding of this, let's look at SEMIS, starting at line 896:


   1362 82          	FCB	$82
   1363 3B          	FCC	1,;S
   1364 D3          	FCB	$D3
   1365 13 52       	FDB	RPSTOR-6
   1367 13 69       SEMIS	FDB	*+2
   1369 DE F4       	LDX	RP
   136B 08          	INX
   136C 08          	INX
   136D DF F4       	STX	RP
   136F EE 00       	LDX	0,X	get address we have just finished.
   1371 7E 10 36    	JMP	NEXT+2	increment the return address & do next word

We can see that it is a low-level leaf. It increments RP, gets the saved IP, and jumps into the right place in NEXT to store it back in IP and continue where the caller left off.

This threading of functionality through low-level and high-level definitions is what the Forth community calls "threaded".

And the above model is an example of indirect threading, where the i-codes are pointers to pointers to code.

In case you're wondering, no, this is not the only way to do indirect threading. It's just one example.

Now, what about direct threading?

We'll stick with the same register model. And we'll start with the routine for double add, since it's sort-of familiar. The header structure will be the same, up to the label.

DPLUS TSX ; index the parameters
CLC ; so we can do this in a loop
LDA B #4 ; bytes to add, start with least significant
DPLUSL LDA A 3,X ; right-hand term
ADC A 7,X ; left-hand term
STA A 7,X ; overwrite left-hand term
DEX ;
DEC B ; done?
BNE DPLUSL ; back for more
INS ; Deallocate right-hand term.
INS
INS
INS
JMP NEXT 

the only thing that has changed is that the CODE_POINTER is missing in SEMIS. So how do the i-code lists get their starts? Let's look at 2+:

TWOP JMP DOCOL
FDB TWO,PLUS,SEMIS

Now the i-code lists start with a little machine language code. On the 6801, DOCOL could be located in the direct page, and the jump would only be two bytes, FWTW. But you'd still be starting an i-code list with something that doesn't even look like an i-code. That adds a bump in designing debuggers, and in code analysis of the i-code lists.

Let's see how NEXT changes to support this:

NEXT LDX IP
INX ; We can still do pre-increment mode
INX
STX IP
NEXT2 LDX 0,X ; get W which points to definition to be done
NEXT3 STX W
JMP 0,X ; Just jump right to it.

This looks like it ought to be faster, by a small amount. What happens to DOCOL and SEMIS, then? Interestingly enough, they don't have to change, except for losing the CODE_POINTER. 

So why use indirect threading?

Basically, it keeps the i-code list pure, which simplifies debugging tools and such. Also, indirect threading helps avoid some of the problems your developers' tools may make for you, when they try to help you. (Yet another topic.)

Again, this is not the only way to do direct-threading.

How about subroutine threading?

In subroutine threading, the definitions are called by subroutine. This can be done both indirect-threaded and direct-threaded, but, typically, it is done direct-threaded, in the idea that it is faster. NEXT looks like this:

NEXT LDX IP
INX ; We can still do pre-increment mode
INX
STX IP
NEXT2 LDX 0,X ; get W which points to definition to be done
NEXT3 STX W
* LDX 0,X ; if doing indirect
JSR 0,X ; Use a native call.

To match this, low-level definitions end in a RTS instead of a JMP NEXT. The call and return actually take a bit more time than JMP, having to save and restore data on the CPU stack. 

DOCOL doesn't have to change, and SEMIS ends in a RTS. Certain definitions we haven't looked at change because of return addresses that are now on the parameter stack. (Using the CPU's call stack for the parameter stack causes the greater ripple effect here.)

Again, these are not the only ways of doing subroutine threading. In particular, since the call in the direct-threaded model actually saves W on the return stack, explicitly having a W register in the VM and storing X to it at NEXT3 can be done away with. Such an approach does requires changes to DOCOL, however.

Some daredevils pervert the CPU's stack pointer into the IP and NEXT becomes a CPU return. I say daredevil, because, if the CPU gets any sort of interrupt, the interrupt walks all over the code. (Most CPUs save interrupt state on the call stack.) If you're that desperate to use an IP register with auto-increment mode, use a CPU that has a valid auto-increment mode for a non-stack register.

Just for curiosity's sake, let's look at non-subroutine indirect threading on the 6809, going with my recommendation of using the U stack for parameters this time. The header will not change. Registers in the virtual machine will be assigned as above:

  • W: X (Save before using for other things, if you need W.)
  • IP: Y (save before using for other things and restore before NEXT)
  • RP: S
  • SP: U
  • UP: DP

The examples, as with the 6800:

DPLUS   FDB *-2
    LDD   6,U  ; left-hand
    ADDD   2,U  ; right-hand
    STD   6,U
    LDD   4,U  ; left-hand
    ADCB   1,U ; No ADCD, do it by bytes.
    ADCA   ,U
    LEAU   4,U ; deallocate before store to save a cycle.
    STD   ,U
    JMP   NEXT
*    JMP   [,Y++] ; This could be NEXT on the 6809, if ignoring W
*

 DOCOL, NEXT, and SEMIS are significantly optimized:

DOCOL    LDX   ,Y++   ; using post-increment instead of pre-
    PSHS   Y    ; save pointer to next
    LDY   ,Y    ; Get new IP
NEXT   LDX   ,Y++  ; using post-increment instead of pre-
    JMP   [,X]
*
SEMIS    PULS   Y
    BRA NEXT

How about the 68000? It gives us a 32-bit add naturally, so we'll define a 64-bit double add. Using the registers as I suggest above,

DPLUS   DC.L   *+4
    MOVEM   (A6),D0/D1/D2/D3
    ADD.L   D1,D3   ; less significant 32 bits
    ADDX.L   D0,D2   ; more significant 32 bits
    LEA   8(A6),A6   ; deallocate right-hand term
    JMP   NEXT
And

DOCOL    MOVE.L   (A4)+,A3   ; using post-increment instead of pre-
    MOVE.L    A4,-(A7)   ; save pointer to next
    MOVE.L    A3,A4    ; Get new IP
NEXT   (A4)+,A3  ; using post-increment instead of pre-
    MOVE.L    (A3),A2
    JMP   (A2)
*
SEMIS    MOVE.L (A7)+,A4
    BRA NEXT
The shift from indirect-threaded to direct, and the use of subroutine-threading would follow pretty much as the shift in the 6800, modulus the advantage of having the registers in actual registers instead of in memory.

There is one more step I mentioned in passing in part one of this rant. If using native calls, a certain degree of speed optimization can be obtained by flattening an i-code list. Instead of starting the definition to be optimized this way with a jump to the inner interpreter, the definition can be compiled as a series of calls.

On the 6809, 2+ can be changed from

TWOP    JMP DOCOL
        FDB TWO,PLUS,SEMIS 

to 

TWOP    JSR     TWO
    JSR     PLUS
    RTS
And from there, optimizing to

TWOP    LDD    #2
    ADDD    ,U
    STD    ,U
    RTS

becomes almost trivial.

(I also discuss optimizations on the 6800 a little in the synthetic stack examples here: https://defining-computers.blogspot.com/2020/10/6800-example-vm-with-synthetic.html.)

Caveat: It's way past my bedtime again, and I may have glaring mistakes in the above. If so, leave me comments, please.

Why is this important?

Using the fig Forth VM, it is a bit difficult to share the Forth words as library code for compiled languages such as C. Using the native CPU calls instead, it becomes possible to share, if the compiler for the other language uses a split stack.

Thursday, October 22, 2020

Computer Languages -- Interpreted vs. Compiled Forth?

[JMR202010251132:

Comments on a post by Peter Forth in the Forth2020 Users-Group on FB set me off on a long response that turned into this webrant on the distinctions between

Interpreter vs. compiler? -- Interpreted vs. Compiled? 

Forth?

][JMR202010251132 -- edited for clarification.]

This is a particularly tangled intersection of concepts and jargon.

Many early BASICs directly interpreted the source code text. If you typed

PRINT "HELLO "; 4*ATN(1)

and hit ENTER, the computer would parse (scan) along the line of text you had typed and find the space after the "PRINT", and go to the symbol table (the BASIC interpreter's dictionary) and look it up. There it would find some routines that would prepare to print something out on the screen (or other active output device).

Then it would keep parsing and find the quoted 

"HELLO " 

followed by the semicolon, and it would prepare the string "HELLO " for printing, probably by putting the string into a print buffer.

Continuing to scan, it would find the numeric text "4", followed by the asterisk. It would recognize it as a number and convert the numeric text to the number 4, then, since asterisk is the multiplication symbol in BASIC, put both the number and the multiplication operation somewhere to remember them. 

Then it would scan the symbol "ATN" followed by the opening parenthesis character. Looking this up in the symbol table, it would find a routine to compute the arctangent of a number, and remember that waiting function call, along with the fact that it was looking for the closing parenthesis character.

Then it would scan the text "1", followed by the closing parenthesis. Recognizing the text as a number, it would convert it to the number 1. Then it would act on the closing parenthesis and call the saved function, passing it the 1 to work on. The function would return the result,  0.7853982, and save it away after the 4 stored earlier.

Then it would see that the line had come to an end, and it would go back and see what was left to do. It would find the multiplication, and multiply 4 x 0.7853982, giving 3.1415927 (which needs a little explaining), and remember the result. Then it would see that it was in the middle of PRINTing, and convert the number to text, put it in the buffer following the string, and print the buffer to the screen, something like

HELLO  3.1415927

I know that seems like a lot of work to go to. Maybe it would not help you to know I'm only giving you the easy overview. There are lots more steps I'm not mentioning.

(So, you're wondering why 4 x 0.7853982 is 3.1415927, not 3.1415928? Cheap calculators are inexact after about four to eight digits of decimal fraction. BASIC languages tended to implement a cheap calculator -- especially early BASIC languages. But, just for the record, written with a few bits more accuracy, π/4 ≒ 0.78539816; π ≒ 3.14159266. The computer often keeps more accuracy than it shows.)

Interpreted vs. compiled actually indicates a spectrum, not a binary classification. What I have described above is one extreme end of the spectrum, referred to in such terms as "pure source text interpreter".

Other approaches to the BASIC language were taken. Some were pure compiled languages. Instead of acting immediately on each symbol as it parses, compilers save the corresponding code in  the CPU's native machine language away in a file, and the user has to call the compiled program back up to run it -- after the compiler has finished checking that the code follows the grammar rules and has finished saving all the generated machine code. (Again, there are steps I am not mentioning at this point.)

In the case of the above line of BASIC, the resultant machine code might look something like the following, if the compiler compiles to object code for the 6809 CPU: 

48454C4C4F20
00
40800000
3F800000
308CEE
3610
17CFE9
308CED
3610
308CEC
3610
17D828
17D801
17CFEE
17CFF7

That isn't very easily read by humans. Here is what the above looks like in 6809 assembly language:

1000                  00007         S0001
1000 48454C4C4F20     00008             FCC 'HELLO '
1006 00               00009             FCB 0
1007                  00010         FP0001
1007 40800000         00011             FCB $40,$80,00,00 ; 4
100B                  00012         FP0002
100B 3F800000         00013             FCB $3F,$80,00,00 ; 1
100F                  00014         L0001
100F 308CEE           00015             LEAX S0001,PCR
1012 3610             00016             PSHU X
1014 17CFE9           00017             LBSR BUFPUTSTR
1017 308CED           00018             LEAX FP0001,PCR
101A 3610             00019             PSHU X
101C 308CEC           00020             LEAX FP0002,PCR
101F 3610             00021             PSHU X
1021 17D828           00022             LBSR ATAN
1024 17D801           00023             LBSR FPMUL
1027 17CFEE           00024             LBSR BUFPUTFP
102A 17CFF7           00025             LBSR PRTBUF

Think you could almost read that? You see the word "HELLO " in there, and you can guess that 

  1. FP0001 and FP0002 are the floating point encodings for 4 and 1, respectively. 
  2. LEAX calculates a pointer to the value given to it, and 
  3. LBSR calls subroutines which
    1. put the string in the print buffer, 
    2. calculate the arctangent, 
    3. multiply the arctangent (by the saved 4.0), 
    4. convert the result to text and put it in the print buffer, 
    5. and send the print buffer to the screen (or other current output device).

Since the CPU interprets the machine language directly, the compiled code is going to run very fast. 

With something this short, we don't really care about how fast it is, but with a really long program we might care a lot.

On the other hand, with something as short as this, we are less interested in how fast it runs than in the effort and time it takes to prepare and run the code. 

Compiled languages add a compile step between typing the code in and running it. If we want to change something, we have to edit the source code and run it through the compiler again. 

On modern computers with certain integrated developer environments, that's actually less trouble than it sounds, but without those IDEs, or on slower computers, the pure text interpreter is much easier to use for short programs.

On eight bit computers, those fancy IDEs took too much of the computer's resources, because the machine code of a CPU can be very complicated.

Hmm. The example of compiling to the 6809 is a bit counter to my purpose here, because the architecture and machine code of the 6809 were at once quite simple and well-matched to the programs we write. 

Let's see what that line of code would look like compiled on a "modern" CPU. I'm going to cheat a little and show you what it looks like written in C, and what it looks like compiled from C, because I don't want to take the time to install either FreeBasic or Gambas in my workstation and figure out how to get a look at the compiled output. Here's the C version, wrapped in the mandatory main() procedure:

/* Using the C compiler as a calculator.
*/

#include <stdlib.h>
#include <math.h>
#include <stdio.h>

int main( int argc, char *argv[] )
{
  printf( "HELLO %g\n", 4 * atan( 1 ) );
}

That's a lot of wrapper just to do calculations, don't you think?

(Not all compiled languages require the wrapping main function to be explicitly declared, but many do.)

Well, here's the assembly language output on my Ubuntu (Intel64 perversion of AMD64 evolution of 'x86 CPU) workstation: 

    .file    "printsample.c"
    .text
    .section    .rodata
.LC1:
    .string    "HELLO %g\n"
    .text
    .globl    main
    .type    main, @function
main:
.LFB5:
    .cfi_startproc
    pushq    %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $32, %rsp
    movl    %edi, -4(%rbp)
    movq    %rsi, -16(%rbp)
    movq    .LC0(%rip), %rax
    movq    %rax, -24(%rbp)
    movsd    -24(%rbp), %xmm0
    leaq    .LC1(%rip), %rdi
    movl    $1, %eax
    call    printf@PLT
    movl    $0, %eax
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE5:
    .size    main, .-main
    .section    .rodata
    .align 8
.LC0:
    .long    1413754136
    .long    1074340347
    .ident    "GCC: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0"
    .section    .note.GNU-stack,"",@progbits

Now, with a bit of effort, I can read that. I can see where it's setting up a stack frame on entry and discarding it on exit. I can see where the compiler has pre-calculated the constant, so that all that is left to do at run-time is load the result and print it. But, even though you can see the string "HELLO " and the final call to printf(), if you can decipher the stuff in between without explanation, you probably already know anything I can tell you about this stuff. 

The IDE running on your computer has to keep track of stuff like this very quickly, in order to make it easy for you to change something and get quick results. (Back in the 1980s, the Think C compiler for the 68000 (Apple Macintosh) was able to do stuff like this, in no small part because the 68000, like it's little brother the 6809, is pretty powerful even though it's relatively simple.)

So, back in the 1970s and 1980s, companies that weren't, for various reasons, willing to use either the 6809 or 68000 in their designs, wanted something a little simpler than the CPU to compile to, so that the IDE (and the humans) could keep track of what was happening.

So they devised "virtual machines" that interpreted an intermediate code between the complexity of the CPU's raw machine code and the readability of pure text. These virtual machines also made it easier to have compilers that were portable between various processors, but that's another topic.

The intermediate codes these VMs ran were known as "i-codes", but were often called "p-codes" (for Pascal i-codes) or "byte-codes" (because many of them, especially Java, were byte-oriented encodings). (And, BTW, "P-code" was also an expression used at the time to indicate pseudo-coding, which can be a bit confusing, but that's another topic.)

If you want an idea what an i-code looked like, one possible i-code using prefix 3-tuples and converted to human-readable form might look like

string 'HELLO '
putstr stdout s0001
float 4
float 1
library    arctan fp0002
multiply fp0001 res0001
putfloat stdout res0002
printbuffer stdout

A stacked mixed-tuple postfix i-code would also be possible:

'HELLO ' putstr
4.0
1.0
arctan
multiply
putfloat
stdout printbuffer

There were versions of BASIC that ran on VMs. The Basic compilers compiled the source code to the i-codes, and the VM interpreted the i-codes. 

Note that we have added a level of interpretation between the source text and the CPU.

You might think that the 6809 wouldn't need an i-code interpreted BASIC, but a byte-code could be designed to be more compact even than 6809 machine code. 

Basic09 was basically developed at the same time as the processor itself was designed. The 6809 was designed to run the  Basic09 VM very efficiently, and it did. But it was a compiled language, in the sense that there was no command-line interpreter built-in. (You could build or buy one, but it wasn't built-in.) It was source-code compiled, i-code interpreted. (Analyzing the functionality of Basic09 is also yet another topic, but it worked out well, partly because of the host OS-9 operating system.)

Even now, there are a number of languages that (mostly) run in a VM, for various reasons. The modular VM of Java, with the division between the CPU and the language, can make the language safer to use, which is yet another topic. (Except that now Java has had so much added to it, that .... Which is another topic, indeed.)

Which brings me to Forth.

Early Forth languages tended to be i-code interpreted VMs, for simplicity and portability. The i-code was very efficient. Not byte-code, at least not originally, but i-code. 

(Specifically, Forth built-in operations were essentially functions, and, when defining a new function, the function definition would be compiled as a list of addresses of the functions to call. A newly-defined function would also be callable by its address in the same way as built-in functions. How that is done involves something the Forth community calls threaded code, which is, ahem, another topic, partially addressed here: https://defining-computers.blogspot.com/2020/10/forth-threading-is-not-process-threads.html.)

Forth interpreters include a command-line interpretation mode, which is made easier because of the i-code interpretation. 

(In a sense, the interpreter mode is much like a directly interpreted BASIC interpreter with a giant case statement, where new definitions implicitly become part of the giant case statement.)

The built-in i-codes include commands that can be invoked as post-fix calculator commands, directly from the command-line.

The above example, entered at the command line of gforth, which is a Forth that knows floating-point, would look like this:

4.0e0   1.0e0   fatan   f*   f.

(giving the result 3.14159265358979.)

Uh, yes, I know, in terms of readability, that only looks marginally improved over 6809 assembler source. Maybe.

Lemme 'splain s'more!

That's because Forth is, essentially, the i-code of it's VM. Not byte-code, but address-word-code VM.

Forth has a postfix grammar, something like the RPN of Hewlett Packard calculators. So the first two text blobs are 4.0 x 100 and 1.0 x 100 in a modified scientific notation, pushed on the floating point stack in that order. Fatan is the floating point arctangent function, which operates on the topmost item on the floating point stack, 1.0. It returns the result -- 0.785398163397448, or a quarter of π -- in the place of the 1.0e0. 

'F*' is not a swear word, it's the floating point multiply, and it multiplies the top two items on the floating point stack, resulting in (an approximation of) π. 'F.'  prints the top floating point number to the current output device, in this case, the screen.

The stack, in combination with the post-fix notation, allows most of the symbols the user types in from the command-line to have a one-to-one correspondence with the i-codes themselves.

The 6809 and 68000 have an interesting quality in common (almost, but not quite, shared by the 8086) -- Both the 6809 and the 68000 have an instruction set that includes op-codes that can be mapped one-to-one or one-to-two with a significant subset of the Forth primitives.

There are have been other processors designed specifically to be mapped one-to-one with all the important Forth leaf-level primitives. The Forth Interest Group lists a few of the current ones on one of their web pages.

Thus, Forth is a bit unusual, even among byte-code and i-code languages. It partakes of both ends of the spectrum, and of the middle, all at once, being both interpreted at the text level and compiled at the machine-code level, as well as operating at the intermediate level (although not at all levels at once in every implementation).

Now, before we think we have covered the bases, we should note that all modern compilers compile to an intermediate form, as a means of separating the handling of the source language from the handling of the target CPU.

Forth might actually define one of the possible intermediate forms a compiler might compile to, if the compiler compiles to a split stack run-time model instead of a linked-list-of-interleaved-frames-in-a-single-stack-model. 

(Note here that the run-time model of compiled languages is essentially a close corollary of a virtual machine.)

An interleaved single-stack model mixes local variables, parameters, and return addresses on the same stack. This means that a called function can walk all over both the link to the calling frame and the address in the calling code to return to, if the programmer is only a little bit careless. When that happens, the program usually loses complete track of what it's doing, and may begin doing many kinds of random and potentially damaging things.

This is where most of the stack vulnerabilities and stack attack vectors that you hear about in software come from.

Some engineers consider the single stack easier to manage and maintain than a split stack. You only have to ask the operating system to allocate one stack region, and you only have to worry about overflow and underflow in that region.

But interleaving the return addresses with the parameters and local variables comes at a trade-off cost of having to worry about frame bounds in every function, in addition to the cost of maintaining the frame.

With a (correctly) split stack, there is no need to maintain frames in most run-times because they maintain themselves, and it takes more than carelessness to overwrite the return address (and optional frame pointer for those languages which need one).

Given this, one might expect that most modern run-times would prefer the problems of maintaining a split stack over the problems of maintaining a single stack. 

But memory was tight in the 1980s (or so we thought), and the single stack was the model the industry grew up with. (Maybe you can tell, but I believe this was a serious error.)

(By the way, this is one of the key places where the 8086 misses being able to implement a near one-to-one mapping -- segmentation biases it towards the single-stack model. BP shares a segment register with SP. If BP had its own segment register, you could put the return address stack in it's own somewhat well-protected segment, but it doesn't, so you can't.)

Especially with CPUs that map well to the Forth model, Forth interpreters can be designed to compile to machine code, and run without an intermediary virtual machine. [JMR202010222211: I'm actually working -- a little at a time -- on a project that will do that, here: https://osdn.net/projects/splitstack-runtimelib/.]

One more topic that needs to be explicitly addressed: Most languages have only limited ability to extend their symbol tables, especially at run-time. Most text-only interpreters have severe limits to extending their symbol tables -- in some cases just ten possible user functions named fn0 to fn9, in some cases, no extensions at all.

I've implied this above, but, by design, Forth interpreters do not have such limits. Symbols defined by the programmer have pretty much equal standing with pre-defined symbols. This makes Forth even less like most interpreted languages.

Yes, Forth is often an interpreted language, but not like other interpreted languages.

And it's after midnight here, and I keep drifting off, so this rant ends here.

[JMR202010251351:

Discussing of threading added in a separate post: https://defining-computers.blogspot.com/2020/10/forth-threading-is-not-process-threads.html, also linked above.

]