defining computers: Converging CPU Models on the 68K Model?

Around 1986, I was talking with a fellow student at BYU about the CPU wars that "everyone" knew had finally (Oh, the relief!) been declared in Intel's favor.

(Do not listen to the salesmen when they tell you their product is the de-facto standard. They are usually exaggerating, but, if they aren't, you should throw them out on their ears. "Everybody does it!" is never a good priority argument.)

I told him that the industry would find itself painted into a corner with the 80x86, and everyone would find themselves having to convert their code to the 680X0.

On the one hand, I did not reckon on Intel being willing to waste as much money on the race to shrink, and on implementing self-fulfilling prophecy heuristics, just to keep their brain-damaged intellectual property somewhat competitive in apparent performance.

If the same amount of money were spent on shrinking and optimizing ARM, there would be no contest at all. Likewise on Freescale's ColdFire.

AMD's 64 bit "extension" to the INTEL 80x86 model basically saved INTEL's bacon.

And this is what is amusing about this whole thing: All the successful modern processors have ended up looking a lot like the 680X0 -- about 16 internal mostly orthogonally accessible general-purpose registers, and a control-flow stack model somewhat supported in the silicon.

If you don't have enough registers (see the 6800), or if the registers are not orthogonal and general-purpose (see the 8080 and 8086), you tend to waste a lot of instructions in moving data between the registers where you will work on them and the temporary storage where you keep intermediate results. Extra code is extra processing time. And it's extra space in cache, which is extra time moving code in and out of cache.

But if you have too many registers (see all the true "RISC" processors and most modified RISC such as the PowerPC), maybe you don't have to waste time getting data in and out of registers, but you lose on code density. The instructions are bigger and they often don't quite do enough, so you end up using more. And that means that you end up with more code to move in and out of cache again.

And it's actually getting the code in and out of cache that generally dings your performance most in general purpose computing. (Embedded applications that don't depend on cache take less of a hit, but the problem of instructions that don't quite do all that you want them to do, requiring more instructions to finish a calculation, still ends up being a trade-off.)

So, there it is. Look at the AMD64 and the first thing you notice is that they generalized the registers in the 80x86 model. Lots of talk about register-renaming, but the biggest thing was making the register set orthogonal to the instructions. Then you notice that they added eight registers somewhere along the line.

We spend a lot of money and time in keeping our code in the intermediate level C language or higher level languages, so that moving to a new CPU is just a compile away.

And the high-level code eats away at performance, too, so we waste a lot of money on bad attempts at optimizing the compilers and interpreters.

Now, we gain quite a lot in developer efficiency by giving up some of the performance to higher-level source code. So higher-level code is actually a good trade-off.

Even though I was wrong in details, I was essentially right. The current most successful "desktop" CPU looks a lot more like the 680X0 than the 80x86, and that CPU competes well in the server market, as well. The code that it spends most of its time running is not 80x86 code any more.

And the most successful CPU overall is the ARM, which also looks a lot more like the 68K than the 80x86. (Low order bytes first is a bit of brain damage that still has to be exorcised from the industry. The more we deal with multi-byte character sets, the more people will realize why.)

Motorola wasted a lot of resources in the CPU war, trying to keep the 680X0 ahead of the 80x86 in salescrew friendly features, many of which turned out to be less-than-useful.

The most important features (full 32 bit branching and a reliable exception model) were implemented in the 68010. Many of the advances between the 68020 and the 68060 actually came in the form of getting rid of features in ways that didn't penalize customers too much when they had blindly used those features.

And ColdFire, which is Freescale's more modern descendant of the 68K, gets rid of even more of the fluff features from the feature wars, and improves the exception model significantly. ColdFire is actually competitive with the ARM architecture in performance and power curve at the same level of shrinkage.

(AMD64/80x86 is only competitive on the power curve in salesperson's dreams, and, even that, with serious handicaps forced on ARM designs by the prevailing INTEL-leaning customs of the industry.)

There is lots of irony in history, and the computing industry is no exception.

Looking ahead, there is one more big irony that I see. When we understand the run-time model a bit better, we will quit interleaving the flow-of-control (return address) stack with the data stack.

Which 8 bit processor of a bygone era had two stacks?

By the way, here's a minimal execution (register) model I see as optimal:

Two accumulators, fully orthogonal, to allow tracking intermediate results of two different sets of calculations without hitting the memory bus;
The ability to concatenate the accumulators to work with double-width integers and pointers;
Two index registers, fully orthogonal, to allow referencing a source and a destination non-scalar without having to save and restore pointers;
Fully indexable data stack, so that accessing local variables on the stack doesn't require stack manipulation or using an index register;
Flow-of-control stack independent of the data stack, to get rid of the problems that happen when return addresses are overwritten by data;
It would be preferable to keep the return addresses in a memory area protected from normal access;
Local, per-task static allocation area for variables that are not accessed concurrently ("thread" local in current vernacular);
Global static allocation area for constants, code, and variables that need to be guarded by some sort of mutual exclusion;
Indexable program counter for calculated jump tables and constants buried in code.

Which archaic 8-bit processor fits this description almost to a "t", if you overlook the clumsiness of the 8-bit direct page register and the lack of a bus signal to indicate when a return address is being pushed or popped? Heh.

I've found that this model supports pretty much all the hand-optimized code I've produced without excess data movement, and allows a very compact machine code expression.

Now, the 6809 doesn't quite fit this model. The direct page register can't be indexed, which is okay for compiled code but not interpreted.

And it doesn't have a separate address space for a return address stack. I have a vague memory about something in the timing diagrams that might allow synthesizing a separate address space, but that's probably wishful thinking. I don't have timing diagrams handy, so I can't check.

Also, the 16 bit address limits are a bit tight.

If having products of your own company that compete with each other were a bad thing, Motorola would have been right to have de-emphasized the 6809.

But look at Freescale's current lineup. PowerPC, ColdFire, ARM? They essentially have three of their own products competing directly with each other in a lot of target product spaces.

When any two are less popular, the third keeps things stable.

The only problem here is that Freescale's salescrew have to educate themselves a bit more about the lineup. That's actually a good thing, because there is nothing more damaging to an industry than a pack of lazy salespeople.

The 68K can be programmed to fit this model, with a bit less compact machine code, except for the separate address space for the return addresses. (Again, I need to check the timing diagrams and, with certain members of the 68K family, the memory access function codes, to see if there might be some way to cache the stack.) Still, with a 32 bit address, you can get the return addresses far enough away from everything else that accidentally overwriting the return address becomes much less common, and most of the attacks on code that involve buffer overruns and such to overwrite the return addresses become very difficult.

The 68K provides other ways to concatenate small integers, so natural concatenation of accumulators is not necessary.

The PowerPC (which Freescale manufactures in cooperation with IBM) can be programmed to fit this model, too, but it usually isn't. Too many registers go unused. (It would be an interesting research project into register counts to compare the results of compiling to the large register set, single stack model on the PowerPC is more efficient than compiling to the above model. Note that the PowerPC code density is about 50% of the 68K at best.)

The ARM (which Freescale also manufactures under license) can also be easily mapped to this model. 64 bit ARM should allow even better separation of the return address stack from other parts of memory.

Renesas's SuperH has some odd non-orthogonalities that require a bit of intimacy with the processor, but it can be mapped to this run-time model as well.

80x86? Not so much. Too much funky stuff happening when you aren't expecting it, too much optimizing to the standard C runtime model which INTEL helped create. AMD64, on the other hand, should map rather well to the model. Maybe, if the register renaming doesn't get too much in the way.

Now, for the 68K, ARM, SH3, and other processors with a bit more registers than you need, the extra registers can always be used.

A third index register allows one to access two sources and a destination without having to save and restore index registers.
An extra index register could be used to differentiate between global constants and global variables. Yet another could be used for linking to global or parent process callbacks, particularly those for maintaining concurrent access coherency -- semaphores, monitors, mutual exclusion code, etc.
Another index register could maintain a second data stack, to separate parameters from local variables. (It's an artificial distinction, I know, but there are many distinctions not required by the math that help optimize code and execution.)
You might even dedicate an indexable registers to token threaded interpreter virtual instruction-code list execution.

Returning to the separate stack that is currently not part of the standard C runtime model, another benefit of a separate stack is reduced function call overhead. Since the return address is elsewhere, call frames really don't need special maintenance. If there is no need to walk the variable space of the calling routines, no frame pointers need to be generated at all.

A downside is that it adds a separate page for memory management functions to keep track of, with the added complexity of the operating system needing to make sure the return address stack and the parameter stack both always stay in memory while the process is in memory.

However, if the CPU manufacturer could be convinced to help just a little, the return address could be cached in a dedicated stack-oriented cache, and the return address page management could be greatly simplified, compared to the rest of memory. Careful caching could basically postpone, or even eliminate most of the remaining overhead of subroutine calls, reducing the overhead to the cost of the two ordinary jumps.

In addition, the return address stack does not have to maintain coherence with external memory. Spurious writes to the cache (in other words, not from the instruction fetch mechanism) could be silently dropped, or could trigger a memory access violation.

(Certain kinds of exception processing are usually done be re-writing return addresses, but there are other, better ways to do that on a decent CPU.)

I'm going to make another wild prediction. Once people get used to having 16 registers to use as the code needs it, instead of being restricted to the standard C model that INTEL helped converge around their 80x86 execution model, we'll see more multi-stack solutions. One of the first splits will be the flow-of-control stack.

(More about this model some other time.)

defining computers

Misunderstanding Computers

Saturday, March 28, 2015

Converging CPU Models on the 68K Model?

No comments:

Post a Comment