I want to note at the top here, that this is not about which company's CPU was better. This is not about comparing CPUs at all.
And this is not disparaging Motorola. Motorola did a pretty decent job of designing each of their CPUs, especially when considering that they were not just pioneering microprocessor design. Engineers with experience designing CPUs were basically all already employed, mostly by other companies. (And many of those CPU engineers didn't really understand CPUs all that well, after all.) Motorola was also pioneering the design of CPUs in general.
The engineers at Motorola did a good job. But nobody's perfect.
Taking these in the order that Motorola produced them:
6800 Niggles:
(1) It's not hard to guess that the improper optimization of the CPX (compare X index register) instruction was an attempt to be too clever, a bad case of penny-pinching and setting arbitrary deadlines, an oversight, or any and all of the foregoing. But, as a result, the branches implementing signed and unsigned comparisons just don't do what they would be expected to do after CPX.
- C (Carry) is simply not affected by CPX on the 6800 (and 6802), so the branches implementing unsigned compare, BCC, BCS, BHI, and BLS just won't work after CPX.
- V (oVerflow) is the result from comparing the most-significant byte only, so the branches implementing signed comparison, BGE, BGT, BLE, and BLT fail in hard-to-predict ways after CPX.
- N (Negative) is also the result from comparing the most-significant byte only. It may not seem that this is a problem for BPL (branch if plus) and BMI (branch if minus), but the programmers' manual says neither N nor V are intended for conditional branching. It seems to me that the N flag will actually be set correctly after the CPX, giving the sign of the result of the thrown-away subtraction of the argument address from the address in X. But using BPL and BMI in ordered comparison is just going to be a bit fiddly, no matter what. You probably just won't get what you thought you wanted if you use BPL or BMI after CPX.
Z (Zero) is the result of all 15 bits of the result of the compare, so BEQ (branch if equal) and BNE (branch if not equal) after a CPX work as expected.
This mis-feature was preemptively prevented in the designs of the 68000 and the 6809, and was fixed, pretty much without issue, in the 6801. In the 6805, it's prevented by making the X register an 8-bit register anyway, more on that below.
(2) Addressing temporary variables and parameters on a stack required using X, and if you had something you needed in X, you had to save X somewhere safe -- which meant on a stack if you wanted your code to be re-entrant. But the 6800 had no instructions to directly push or pop X. That left you with a conundrum. You had to save X to use X to save X.
So you had to use a statically allocated temporary variable. Statically allocated temporaries tend to introduce race conditions even in single-processor designs, because you really don't want to take the time to block interrupts just to use the temporaries, especially for something like adjusting a stack pointer.
You can potentially work around the race conditions in some cases by having your interrupt-time stack pointers separate from your non-interrupt-time stack pointers, but that can also get pretty tricky pretty quickly.
The 6801 provides push (PSHX) and pop (PULX) instructions for X.
Stack-addressable temporary variables and parameters were supported by definition in the 68000 and 6809 designs, but not on the 6801. They were considered out of scope on the 6805, but were addressed on descendants of the 6805.
(3) This niggle is somewhat controversial, but using a single stack that combines return addresses and parameters and temporary variables is a fiddly solution that has become widely accepted as the standard. Even though it is accepted, and learning how to set up a a stack frame is something of a rite-of-passage, setting up stack frames to keep the regions on stack straight consumes cycles, even when it can be done without inducing race conditions (see the above niggle about using X to address the stack.)
Separating parameters and temporaries from return addresses is supported by design on the 68000 and 6809, but not on the 6801 or 6805.
(4) The lack of direct-page mode op-codes for the unary operators was, in my opinion, a serious strategic miss. Sure, you could address variables in the direct page with extended mode addressing, but it cost extra cycles, and it just felt funny.
To explain, the binary instructions (loads, stores, two-operand arithmetic and logic) all have a direct-page mode. This allows saving a byte and a cycle when working on variables in the direct page (called zero page on other processors -- addresses from 0 to 255).
The unary instructions (increment/decrement, shifts, complements, etc.) do not. The irony is that the unary instructions are the ones you use on memory when you don't want to waste time and accumulators loading and storing a result.
This may have been another attempt to save transistors by not implementing every possible op-code. But a careful re-examination of the op-code table layout map indicates that it should have been possible without using significantly more transistors. In fact, I'm guessing it actually required more transistors to do it the way they ended up doing it.
Or it may have been an attempt to avoid running into the situation where they would need an op-code for something important but had already used all of the available codes in a particular area of the map. But, again, re-examining the op-code map would have revealed room to fit the op-codes in.
Maybe there just wasn't enough time to re-examine and reconsider the omissions before the scheduled deadlines, and they thought absolute/extended addressing should be good enough.
I'll come back to the reasons it really wasn't further down.
This one was also fixed in the designs of the 68000, and 6809, and sort-of in the 6805, but not addressed or fixed in the 6801.
Fixing it in the 6801 would have been awkward after-the-fact tack-on, but I'll look at that below.
(5) The 6800 had a few instructions for inter-accumulator math -- ABA (add B to A), SBA (subtract B from A), and CBA (compare B with A, which is SBA but without storing the result).
But it's missing the logical instructions AND, OR, and EOR (Exclusive-OR) of B into A, and doesn't have any instructions at all going the other direction, A into B.
Surprisingly, this is not hard to work around in most cases, but the workarounds are case-by-case tricks with the condition codes. Otherwise, you're back to using statically allocated temporaries, and care must be taken to avoid potential race conditions by such things as using the same temporaries during interrupt processing.
This is fixed in the design of the 68000, and eliminated from the scope of the 6805, effectively fixed in the 6809 (by the addition of stack-relative addressing for temporaries), and partially addressed in the 6801 (by adding 16-bit math, the most common place where it becomes a problem, more below).
(6) The 6800 has no native 16-bit math other than incrementing, decrementing, and comparing X, and incrementing and decrementing S. Synthesizing 16-bit math is straightforward, but -- especially without the inter-accumulator logical operators -- it does require temporary variables, requiring extra processor cycles and potentially inducing race conditions.
Also, you usually need one or more extra test cases to cover partial results in one or the other byte, or the use of a logical instruction to collect the results, and it's easy to forget or just fail to complete the math, per the problem with CPX.
And you need 16-bit arithmetic to deal with 16-bit addresses.
This is solved on the 6809 and 6801 by adding 16-bit addition and subtraction. On the 68000, the problem becomes 32-bit math, and it's solved for addition and subtraction, but, oddly, not quite completely for multiplication and division, more below.
For example, if your system design declares and systematically uses something like the following:
ORG $80 ; non-interrupt time global pseudo-registers
PSTK RMB 2 ; two bytes for parameter stack pointer
QTMP RMB 2 ; temporary for high bytes of 32-bit quadruple accumulator
DTMP RMB 2 ; temporary for 16-bit double accumulator
XTMP RMB 2 ; temporary for index math and copy source pointer
YTMP RMB 2 ; temporary for index math and copy destination pointer
ORG $90 ; interrupt time global pseudo-registers
IPSTK RMB 2 ; two bytes for parameter stack pointer
IQTMP RMB 2 ; temporary for high bytes of 32-bit quadruple accumulator
IDTMP RMB 2 ; temporary for 16-bit double accumulator
IXTMP RMB 2 ; temporary for index math and copy source pointer
IYTMP RMB 2 ; temporary for index math and copy destination pointer
... and if all the processes running on your system respect those global variable declarations, then you may at least have a way to avoid the race conditions.
But that chews a piece out of the memory map for user applications.
Now, if the unary operators all had direct-page mode versions, see niggle (4) above, the processor could also define a direct-page address space function code, along several other such function codes, allowing the system designer to optionally include hardware to separate the direct-page system resources from other resources in the address map, such as general data, stack, code, interrupt vectors, etc.
Two or three extra address lines could be provided as optional address function codes, to allow hardware to separate the spaces out.
This looks kind of like the I/O instructions on the 8080 and 8086 families, but it isn't separate instructions, it's separate address maps.
An example two-bit function code might be
- 00: general (extended/absolute) data and I/O
- 01: direct-page data and I/O
- 10: code/interrupt vectors
- 11: return address stack
But they provide a place for such things as bank-switch hardware, in addition to general I/O and system globals and temporaries, without having to eat holes in general address space. And completely separating the return pointer stack from general data greatly increases the security of the system.
I'm not sure if Motorola ever did so in any of their evolved microcontrollers, but this could also potentially allow optimizing access to direct-page pseudo-registers when direct-page RAM is provided on-chip in integrated system-on-a-chip devices like the 6801 and 6805 SOC packages.
The 68000 provides similar address function codes, but the address space on the 68000 is so much bigger than 64 kilobytes that the address function codes have been largely ignored.
Before Motorola began designing new microprocessors, such niggles in the 6800 were noticed and discussed in engineering and management within Motorola. The company decided to analyze code they had available, including internally developed code and code customers shared with them for the purpose of the analysis, looking for bottlenecks and inefficient sequences that an improved processor design could help avoid. The results of this code analysis motivated the design of the 68000 and the 6809.
The 68000 and the 6809 were designed concurrently, by different groups within Motorola.
68000 Niggles:
The 68000 significantly increases the number of both accumulators (data registers) and index registers, and directly supports common address math in the instruction set. It also widens address and data registers to 32 bits. They solved a lot of problems, but they left a few niggles.
(1) The processor was excessively complex. Having a lot of registers reduced the need for complex instructions and for instructions that operated directly on memory without going through registers, but the 68000 did complex instructions and instructions that operated directly on memory, as well.
IBM was just beginning work on the 801 (followup to the ROMP) at the time, and reduced instruction sets were still not a common topic, so the assumption of complexity can be understood.
Still, the complexity required a lot of work to test and properly qualify products for production.
(2) They got the stack frame for memory management exceptions wrong. That is, memory management hardware turned out to work significantly better using the approach they did not initially choose to support, so the frames they had defined did not contain enough information to recover using the preferred memory management techniques. This was fixed in the 68010.
(3) The exception vector space being global made it difficult to fully separate the user program space from the system program space. This was also fixed in the 68010.
(4) Constant offsets for the indexed modes were limited to 16 bits. This seems to be another false optimization -- not fatal because they included variable (register) offsets in the addressing modes, so you could load a 32-bit offset into a data register to get what you wanted. But it still had a cost in cycle counts and register usage. This was not fixed until the 68020, and then they went overboard, making the addressing even more complex, which made the 68020 even harder to test.
(5) They added hardware multiplication and division to the 68000, but they didn't fully support 32 bit multiply and divide. This also was not fixed until the 68020. This can make such things as accessing really large data structures in memory suddenly become slow, when the index to the data structure exceeds 32,767.
Of the above, (4), and (5) could conceivably have been dealt with in the initial design, if management had not been pushing engineering to find corners to cut. The first three were problems that simply required experience to get right.
6809 Niggles:
The 6809 does not increase the number of accumulators, but it does add instructions that combine the two 8-bit accumulators, A and B, into a single 16-bit accumulator D for basic math -- addition, subtraction, load, and store.
On the other hand, it does increase the number of indexable registers to six, and it adds a whole class of address math that can be incorporated into the addressing portion of the instructions themselves, or can be calculated independently of other instructions.
It supports using two of the index registers as stack pointers, and thus supports stack addressing, so that race conditions can generally be completely avoided by using temporary variables on stack. (In comparison, the 68000 can use any of the 8 address registers visible to the programmer as stack pointers.)
One of the stack-pointer capable registers can be used as a frame pointer, making stack frames less of a bottleneck. Or it can be used as a separate parameter stack pointer, pretty much eliminating the bottleneck and improving security. (In comparison, the 68000 includes an instruction to generate a stack frame, which, of course, you don't need when you use properly split stacks. It also includes an entirely superfluous instruction to destroy a stack frame.)
One of the index-capable registers is the PC, which simplifies such things as mixing tables of constants in the code. (This is also supported on the 68000, making a ninth index-capable register for the 68000.)
One of the index registers (DP, for direct page) is a funky 8-bit high-byte partial index for the direct page modes it inherits from the 6800. (This is not done on the 68000, but any of the 68000's address registers can be used in a similar way, with short constant offsets for compact code and reduced cycle counts.)
All unary instructions have a direct page mode op-code, which saves byte count if not cycle count.
(1) As a minor niggle, I can't tell that not providing a full 16-bit base address for the direct addressing mode actually saved them anything in terms of transistor count and instruction cycle count, but we are probably safe in guessing that was their reasoning for doing it that way. It is still useful, although it might have been more useful to have provided finer-grain control of the base address of the direct page. (See above about using any address register in the 68000 in a similar way.)
The DP can be used, with caveat, as a base for process-local static allocations, which greatly reduces potential for inadvertent conflicts in use of global variables and race conditions.
(2) Another niggle about the direct page, the caveat, is that the direct page base is not directly supported for address math. Just finding where the direct page is pointing requires moving DP to the A accumulator and clearing the B accumulator, after which you can move it to one of the index registers. Cycle and register consuming, but not fatal.
(3) A third niggle about both the direct page and the indexed mode, it seems like cycle counts for both could have been better. The 6801 improved cycle counts for both, making the 6809 seem less attractive to engineers seeking for speed. It would have been nice for Motorola to have followed the 6801 with an improved 6809 that fixed the DP niggles and cycle count niggles.
(4) The 6809 also does not have address function code signals. The overall design provides enough power to implement mini-computer class operating systems, but the 64 kilobyte address space then limits the size of user applications. Address function signals that allow separating code, stack, direct page, and extended data would have eased the limits significantly.
On the other hand, widening the index registers would have done even more to ease the addressing restrictions. (I've talked about that elsewhere, and I hope to examine in more carefully sometime in a rant on how the 6809 could have evolved.)
(5) Other than those niggles, the 6809 is about as powerful a design as you can get and still call a CPU an 8-bit processor. In spite of the fact that it would have meant letting the 6809 compete with the 68000 in the market, they could have used the 6809 as the base design of a family of very competitive 16-bit CPUs.
In other words, my fifth niggle is that Motorola never pursued the potential of the 6809.
(6) but not really -- 8-bit CPUs are generally focused on keeping transistor count down for 8-bit applications, so hardware multiplication and division of 16-bit numbers doesn't really make sense in an 8-bit CPU design. This is probably the reason the 6809 only had 8- by 8-bit multiplication, and also probably the reason for the irregular structure of the operation.
A similar 8-bit division of accumulator A by accumulator B yielding 8 bits of quotient and 8 bits of remainder might make sense, but I'm not sure we should want to waste the transistors.
16-bit multiply and divide would have been good for a true 16-bit version of the 6809, but that would include a full 16-bit instruction set.
6801 Niggles:
When the 6809 was introduced in the market, it was still a bit too much complexity in the CPU to comfortably integrate peripheral parts -- timers, serial and parallel ports, and such -- into the same semiconductor die that contained the CPU. So Motorola decided to fix just a few of the niggles of the 6800 for use as a core CPU in semi-custom designs that included on-chip peripheral devices.
(It's something that is commonly misunderstood, that the 6801 actually came after the 6809 historically, but is best understood as a slightly improved 6800, not as a stripped-down 6809. Three steps forward, three steps back, half a step forward.)
As noted above, they fixed the CPX instruction in the 6801, but they did not fix the lack of direct-page unary instructions. They also added instructions to directly push and pop the X index register, which greatly helped when you had something in X that you needed to save before you used X for something else.
And they added the 16-bit loads, stores, and math that combined A and B into a single 16-bit double accumulator D -- similar to the 6809, which overcame a lot of the other niggles about the 6800. In particular, you don't feel the lack of an OR B with A instruction to make sure both bytes of the result were zero, because the flags are correctly set after the D accumulator instructions.
And they included the 8-bit multiply A by B from the 6809. They also included a couple of 16-bit double accumulator shifts, but only for D, not for memory, which is a very minor niggle, an engineering trade-off.
They also added an instruction to add B to X, ABX, to help calculate the addresses of fields within records.
This brings up niggle (1) -- ABX is unsigned, and they did not include a subtract B from X instruction. Being able to subtract B from X, or add a negative value in B to X, would have significantly helped with allocating local variable space on the stack. As it is, ABX is primarily useful for addressing elements with records and structures.
Although I/O devices tended to be assigned addresses in high memory on early 6800 designs, the 6801 put the built-in I/O devices in the direct page. They also put a bit of built-in RAM in the direct page, starting at $80.
But, as I noted above, niggle (2) is that they did not add direct-page mode unary instructions.
If they had done so, either they'd have broken object code compatibility with the 6800, or they'd have had to spread the direct-page op-codes in awkward places in the 6800, which definitely would have cost transistors that they wanted for the I/O devices and such. Either way, I think it would have been worth the cost.
I put together a table showing one possible way to spread them out among unimplemented op-code locations in the inherent/branch section of the op-code table for a chapter of one of my stalled novels, and I'll just copy below a list of where I allocated the direct page op-codes:
- NEG direct: $02
- ROR direct: $12
- ASR direct: $03
- COM direct: $13
- LSR direct: $14
- ROL direct: $15
- ASL direct: $18
- DEC direct: $1A
- INC direct: $1C
- TST direct: $1D
- JMP direct: $1E
- CLR direct: $1F
That doesn't prove anything other than that there were ultimately enough op-codes available. But I'm guessing this layout could be done with a hundred or less extra transistors -- transistors that admittedly would then be unavailable for counters or port bits. But it could be done, and it wouldn't have cost that much.
Also, with these in the op-code map, they could have provided this version of the CPU for compatibility, and then provided another version with the direct-page op-codes correctly laid out for customers who were willing to simply re-assemble their source code. (That's all it would have taken, but many customers wouldn't be willing to take a chance that something would sneak up and bite them.)
One possible more efficient layout would have been to repeat the addressing of the binary op-code groups. Working from the right in the opcode map, there are four columns for accumulator B binary operators and four columns for accumulator A binary operators:
- $FX is extended mode B, and $BX is extended mode A;
- $EX is indexed mode B, and $AX is indexed mode A;
- $DX is direct page B, and $9X is direct page A;
- $CX is immediate mode B, and $8X is immediate mode A.
In the existing 6800, this continues down two more for the unaries, but then you have the unary A and B instructions:
- $7X is extended mode unary;
- $6X is indexed mode unary;
- $5X is B unary;
- $4X is A unary.
Then you have inherent mode instructions in columns $3X, $1X, and $0X, with the branches in column $2X.
In a restructured op-code map, it could be done like this:
- $7X is extended mode unary;
- $6X is indexed mode unary;
- $5X would be direct page unary;
- $4X would be B unary;
- $0X would be A unary.
And the inherent mode operators would be more densely packed in the $1X and $3X columns.
This would require either moving the negate instructions or the halt-and-catch-fire instruction, I suppose. [I'm not finding my reference that had me thinking the 6801's test instruction was at $00. Cancel that thought.] Interestingly, when Motorola laid out the op-code map for the 6809, they kept A and B in columns $4X and $5X, and put the direct page in column $0X -- and left the negate at row $X0, so that they had to move the test instruction. [Again, I'm not finding my reference on the location of the 6809's test instruction. But they did leave negate where it was.]
Also interestingly, the 6801 has a direct-page jump to subroutine, which could be put to good use for a small set of quick global routines (like stack?). (The op-code is $9D, which some sources say was one of the accidental test instructions in the 6800).
Niggle (3) about the 6801 is that I think they should have split the stack. Add a parameter stack U, and then pushes and pops (PULs) would operate on the U stack, but JSR/BSR/RET would operate on the S stack. This would make stack frames much less of a bottleneck, make it possible to reduce call and return cycle counts, and increase general code security somewhat.
(Note again that the 6809 and the 68000 both directly support this kind of split stack. It was the education system that failed to teach engineers to use it.)
And I'll note here that the 68HC11 derivative of the 6801 added, among other things, a Y index, but no parameter stack.
6805 Niggles
Really the only niggle I have with the 6805 is the lack of a separate parameter stack, and the lack of any push/pop at all in the original 6805. Motorola did add pushes and pops to some derivatives of the 6805, but they were on the same S stack as the return address was going to.
The idea of an 8-bit index that could have a 16-bit base (as opposed to an offset) was novel to me when I first looked at the 6805, but it is rather useful. Instead of thinking in terms of putting a base address in X and then adding an offset, you think in terms of having a constant base address -- like an array with a known, fixed address, and the X register provides a variable offset. Indexed mode for binary operators includes no base, 8-bit base, and 16-bit base, allowing use anywhere in the address space.
A small caveat is that unary operators do not have 16-bit base address indexed versions. This is a valid engineering tradeoff, and they cut the right corners here, fully supporting unary instructions for variables in the direct page.
The 8-bit index does not support generalized copying and other generalized functions needed to support self-hosted development environments (without self-modifying code), but that's not necessarily a problem. Hosted development environments are much more powerful tools than self-hosted. (I think a very small Tiny-BASIC interpreter could be constructed without self-modifying code, but that's more of an application than a self-hosted dev environment.)
It does make the CPX operator much simpler -- as an 8-bit operator.
Motorola ultimately extended the index with an XHI in some derivatives of the 6805, which would have allowed self-hosting for those derivatives, but we won't go there today. Also, we won't look at the 68HC11 in detail today. Nor will we do more than glance at the 68HC12 and 68HC16, even though both are quite interesting designs -- in spite of not having split stacks.
I think this is enough to show that Motorola really did do a fairly decent job with their CPU designs.
Actually comparing CPUs, by the way, requires producing a lot of parallel code implementing several real-world applications for each CPU compared. I'd like to do that someday, but I doubt I'll ever have the spare time and money to do so.