defining computers: What Is (the Programming Language) Forth?

[EDIT 20240526, add-overview: ]

This needs a higher-level introduction, I think.

Forth is a simple and compact combination of minimal BIOS and library, rudimentary OS, and a very flexible command-line programming language shell that encourages modifying the Forth itself as an approach to developing applications.

It's compact and simple enough that, using hardware that was common back in the 1970s and '80s, a single moderately competent engineer could put together a running system that can host its own development from scratch in a few months.

Because it is so easy to put a system together, and is so flexible, Forth is a bit of a siren. Many incompatible versions have been developed, and many engineers have found themselves starting grand plans that seem doable with Forth, but end up foundering on some of the gotchas that the simplicity hides.

The key features, I believe, are

[EDIT 20240526, end add-overview; ]

(1) The colon definition grammar.
(2) The post-fix expression grammar.
(3) Two stacks.
(4) On-line dictionary (symbol table) at development time, and, unless explicitly removed, at run-time.
(5) The ability to blend run-time with compile-time at development time.

But this is way too loose. It leaves out all sorts of implementation details that non-trivial applications depend on.

[EDIT 20240526, expand-on-key-features: ]

I've left out of the above one key feature that many fans of Forth seem to think is essential, something called

(6) an inner interpreter, or a virtual machine.

And I think I have good reason, but it does require adding a feature that is considered optional by most, but I need to expand on the above -- and add a few more key features -- before I pick that question up.

Deliberately taking up the above key features in a different order --

(3) Two stacks is an optimization for parameter passing.

In typical compiling convention, the call stack interleaves parameters and call-return information on a single stack. It's practically an assumed convention, so much so that many central processing units directly support the creation of the call record in a combined stack frame. It's also very inflexible and time-consuming, leading to significant complexities in interpreters, compilers, and the run-time models they work under and support.

One particular complexity for compilers is compiling procedure/function calls as in-line code instead of called code, requiring complex evaluation of the procedure or function, and adding significant bulk to the output code wherever the calls are in-lined.

The combined stack is also more fragile, making it much easier to accidentally or deliberately overwrite the call-return information, resulting in application behavior outside what the application engineers designed -- in other words, (additional) bugs and vulnerabilities.

The only advantage of the single-stack approach that I know of is eliminating a memory management segment. Memory management is perhaps the most complex common problem in general applications, and reducing the segment count from, say, four to three, is generally viewed as vital.

By splitting parameters from the call-return information, calls can be streamlined to the point that in-lining makes sense only for the very most simple functions. This will significantly reduce the complexity of the compiler, as well.

As a bonus, keeping the call-return in a separate segment makes it significantly harder to overwrite, eliminating one of the most common source of bugs and vulnerabilities.

(2) The (mostly) post-fix expression grammar kind-of falls naturally out of the split stack calling convention.

Post-fix is not required, and in-fix and pre-fix expressions are also simplified by the split stack convention, but a primarily post-fix expression grammar allows all elements of the language to be implemented simply as if they were called functions with results that are not limited to scalar results.

Post-fix is really simple with the split stack, again, reducing the footprint of the interpreter/compiler.

I should probably note here that most kernel Forth interpreters do not check parameters on parse or call, which is a convenience when working at the low level, but also a vulnerability.

One thing needed for Forth to be more accepted in the industry is a higher-level compiler that checks parameters at compile-time (unless directed not to). The lower-level compiler can be kept for debugging-level work and such, but a higher-level compiler is necessary.

(1) Colon (and other) definition grammar is an exception to the general post-fix expression grammar.

If implementing a stripped-down Forth, the colon definition is the only one necessary, but constant, variable, vocabulary, string, and compiler-compiler (etc.) definition grammars are a necessary convenience if you're not aiming for the most stripped-down implementation possible.

These defining grammars are (usually) all pre-fix or in-fix grammars. They can be done post-fix, but the grammar gets really convoluted when you do that.

Colon definition reads backwards from the usual English usage of the colon, but it gives one the sense of defining functions, procedures, variables, constants, vocabularies, compiler-compilers, etc. as words with definitions.

(4) The symbol table is implemented as a dictionary, or, more accurately, as a collection of vocabularies, and the interpreter/compiler is implemented as the initial dictionary.

This is what makes the interpreter/compiler facilities available to the programmer at development time, and even to the end-user at application run-time, if the application is so written.

Essentially, this dictionary implements what are called libraries in other languages, but it makes the whole language available to those who dare use it.

(5) And, as I mentioned just above, the dictionary can remain accessible to the end-user, blending development-time with compile-time and run-time.

Often, in definition grammars, common patterns call for delayed binding, but I have only once seen a Forth that tried to implement delayed binding. It gets really trippy, and really is (usually) not necessary -- because the compiler itself remains available to the run-time unless you strip it out.

Alarm bells are ringing in system engineer's brains when they read the above, but without delayed binding, access to the compiler can be controlled by the application.

And that brings us to

(6) The inner interpreter provides the framework within which the parts of the run-time not provided directly by the CPU are implemented.

This was useful back when the concept of a run-time architecture was not well understood by systems and software engineers, and even by CPU architects. It is completely superfluous with CPUs such as the 6809 and the 68000, and even mostly superfluous for the x86 and such.

An inner interpreter can be useful in defining a debugger without having the debugger know all about every feature of the CPU, as well, since all code ends up back at the inner interpreter, and all the core features are defined in the initial dictionary.

But that debugger can become cumbersome and hard to control when investigating bugs in the interpreter itself, or bugs in the programmer's understanding of the runtime and interpreter.

Moreover, the inner interpreter usually gets in the way of combining Forth definitions with the libraries of other languages. For instance, combining Forth definitions with C libraries will require implementing a C compiler and libraries that understand and run within the runtime defined by the inner interpreter.

Engineers who write Forth code tend to become impatient with the idea of implementing an entire C compiler in Forth (which is the only real reason we don't see that happen, but is sufficient reason).

For this reason, I have been and am inclined to try to implement the Forth virtual machine run-time as if it were an extension to the CPU rather than a virtual machine defined by an inner interpreter.

This will mean native CPU calls, which will mean that the compiler has to know how to compile at least the native CPU call. And it will mean that the minimal kernel must have some minimal debugger features and know at least how to compile and see the native call instruction. Yeah, it's not as easy, but having now done several fig-Forth implementations (including a buggy conversion of the 6800/6801 fig Forth to the 6809 (osdn can take a while to come up) and a fairly bug-free conversion of the same to the 68000) I'm even more inclined to think it's worth the trade-offs.

Returning to the fact that Forths are easy to roll your own, and easy to make non-standard, ...

[EDIT 20240526, end expand-on-key-features; ]

Comparing it to the programming language C, it's like saying that, not just all K&R C compilers are included as C, but all the different versions of Small C and Tiny C. (And maybe even Objective C, Javascript, Java, Ruby, PHP, and a number of other languages that borrow heavily from C syntax and grammar?)

The solution?

fig-Forth is one group of dialects that have a lot in common. We could define and develop a standard fig-Forth.

I've transcribed the 6800 fig-Forth model and optimized it for the 6801. In addition to some I/O bugs, there are enough differences from the 6502 model to cause problems for non-trivial applications.

I have a near-fig-Forth I call BIF-6809 (and a non-functional one I call BIF-C). They use a binary tree symbol table, and that alters the language enough to make it unreasonable not to give them separate names. (Double negative intentional. Not just reasonable to have separate names, but unreasonable not to.)

Forth77 and Forth83 are separate languages, and should be treated as such. And they should be referred to by their complete names.

SwiftForth's language should be referred to as SwiftForth.

ANSI Forth should be renamed. Call it CommitteeForth or something.

The name Forth, unadorned, should be reserved to whatever Charles H. Moore (the original author of Forth) wants to call Forth.

(I should note that Moore himself calls his own dialect ColorForth. He's leading out, here.)

We need to make the nomenclature a part of the dictionary/symbol table. A word called version should bring up a version number, sure.

We need a word that returns (without printing it to the terminal) a string containing the name of the language/dialect. Maybe even one word for the language family name and one for the dialect. That would give us a base point, after which it would become possible to check what kind of glue needs to be brought it, to make a particular source code compilable with a particular compiler.

After some consideration, I'll suggest the following four new words:

* language ( --- adr )

Returns a string containing the language name, "Forth". (This would allow distinction from derived but different languages.)

* dialect ( --- adr )

Returns a string containing a dialect name, "fig-Forth", "Forth77", "Forth83", "SwiftForth", "gforth", "BIF-6809", etc.

* sub-dialect ( --- adr )

Returns a string containing a modifier of the dialect name.

* target-cpu ( --- adr )

Returns a string containing the CPU targeted, such as "6502", "6801", "Z80", etc.

In addition,

* version ( --- ud )

Should return an unsigned double integer in which the first byte contains the major version number, the second byte the minor, and the third and fourth contain a sequencing number within the minor version.

To further aid tuning source to the host language, run-time, CPU, etc., dialects which adopt this practice should also adapt the practice of defining words that describe such things as cell width, sign representation, defined boolean constant to set a logical true, etc. We could take some inspiration from C's (original, sparse) limits.h include file for this, but it should not duplicate the contents of limits.h.

defining computers

Misunderstanding Computers

Monday, March 1, 2021

What Is (the Programming Language) Forth?

No comments:

Post a Comment