Misunderstanding Computers

Why do we insist on seeing the computer as a magic box for controlling other people?
人はどうしてコンピュータを、人を制する魔法の箱として考えたいのですか?
Why do we want so much to control others when we won't control ourselves?
どうしてそれほど、自分を制しないのに、人をコントロールしたいのですか?

Computer memory is just fancy paper, CPUs are just fancy pens with fancy erasers, and the network is just a fancy backyard fence.
コンピュータの記憶というものはただ改良した紙ですし、CPU 何て特長ある筆に特殊の消しゴムがついたものにすぎないし、ネットワークそのものは裏庭の塀が少し拡大されたものぐらいです。

(original post/元の投稿 -- defining computers site/コンピュータを定義しようのサイト)

Tuesday, March 31, 2015

Splitting the Return Addresses out of the Parameters

[I think I'm done editing now.]

My previous post here and some recent posts on misc@openbsd got me thinking again about significantly mitigating buffer overflows and the like by properly separating the parameters and the return addresses into separate stacks.

I posted my thoughts and asked for suggestions. Philip Guenther pointed out that I needed to unpack my thoughts a bit better.

I responded to several of his points, but I was asleep at the keyboard and didn't get my example of a function call using the separated stacks posted in my reply.

I don't want to add to the chatter on list, so I decided to unpack things here.

This is a lot more work than it seems on the surface.

First, typing in Euclid's method for greatest common divisors was faster than looking it up someplace where I've typed it in before:
Then I compiled to assembler. Don't even have to think to do that:
cc -Wall -S gcd.c


Then I modified the internal gcd() routine to use a separate parameter stack.

Here's how to call the gcd() routine:


-------------------------------
         subl    $2*4, %ebp      /* allocate parameters */
         movl    3*4(%ebp), %eax /* copy n2 for the call to gcd */
         movl    %eax, 1*4(%ebp)
         movl    4*4(%ebp), %eax /* copy n1 for the call to gcd */
         movl    %eax, (%ebp)
         call    gcd             /* return address on the flow-of-control stack */
         addl    $2*4, %ebp      /* de-allocate parameters */
 
 /* Recover the return value. */
         movl    %eax, (%ebp)    /* move the result to divisor */
-------------------------------

Here's the gcd() hand-compiled to access it's parameters on the parameter stack: 

-------------------------------
 gcd:
         pushl   %ebp    /* Frame for unwinding */
         subl    $4, %ebp        /* Locals */
         jmp     .L2
 .L3:
         movl    4(%ebp), %eax   /* numA */
         cmpl    8(%ebp), %eax   /* numB */
         jge     .L4
         movl    4(%ebp), %eax   /* numA */
         movl    %eax, (%ebp)    /* temp */
         movl    8(%ebp), %eax   /* numB */
         movl    %eax, 4(%ebp)   /* numA */
         movl    (%ebp), %eax    /* temp */
         movl    %eax, 8(%ebp)   /* numB */
 .L4:
         movl    8(%ebp), %eax   /* numB */
         subl    %eax, 4(%ebp)   /* numA */
 .L2:
         movl    4(%ebp), %eax   /* numA */
         cmpl    8(%ebp), %eax   /* numB */
         jne     .L3
         movl    4(%ebp), %eax   /* numA */
         popl    %ebp    /* Restore previous frame */
         ret

-------------------------------

(I'm not putting the full source of that up here because I'm lazy. Sorry. This is taking more time than I have.)

That wasn't too bad, but it really didn't answer Philip's objections.

From there, things got time consuming.



Providing shims to the C library functions was easy. I put generalized shims at the top of the assembler file because it was easy, but the generalized shims require loading the call address to %eax before calling the shim. Specific shims would be one instruction shorter and faster on the call, but would (potentially) require more shims, depending on how many calls of each number of parameters occurs.

There are other ways to do the shims, of course, but the shims should go away once the whole OS is compiled with the separated stacks.

Tracking which variables were where, so I could demonstrate the shims by keeping them all on the parameter stack that I allocated, required a lot of grunt work -- hand de-compiling and re-compiling.

And the comments I inserted were for my own benefit, more than for yours. Without them, I could not have tracked the variables on the stack. Thankfully, current gcc makes it easier, allocating the whole rack of local variables at function entry. I should have done the same, but I found it easier to focus on small bits of code at a time.

Note that I initially just used calloc() to allocate space for the parameter stack in a c source file I called gcddummy.c, and then shifted to using mmap, so I could map the stack up high, around 64M below the return address stack.

Placement of the two stacks requires some thinking. Putting the parameter stack above the return address stack will leave you vulnerable to certain kinds of collisions, primarily deep recursion with large local variables.

Putting the parameter stack below the return address stack will leave you vulnerable to buffer overflows more than recursion -- large overflows, that is, in the 64 megabyte range in this example. A 256M gap would be even better, but the theoretical vulnerability remains.


I chose below because it would have been much more work to move the return address stack down.

But there should be a gap of unallocated memory between the stacks, and the illegal access exception that should happen (if you have proper memory management hardware) when a buffer overflow hits the gap should prevent anything more than a denial of service to the attacker.

Recursion may be more difficult to handle, but, if the OS provides automatic extension of the stack, one might hope that it would also provide some way to detect such recursions.

Anyway, ultimately, the OS should support the second stack, so that the compiler and the source code programmer won't have to deal with the allocation, other than compiler switches for the rare case when the default size or placement of the gap is not desired.

64-bit addresses, even if not fully decoded, should help with the collision issues.

This looks pretty simple.

The problem is that no current compiler I know of does it this way. That means I have to write such a compiler myself, or I have to go find the production rules in the source of an existing compiler and see whether I can successfully modify the function call and return and parameter access without breaking the compiler.

Learning (finally) how to use compiler-compilers and lexical analyzers would be the easy part. Not breaking the compiler is the hard part.

Writing a new compiler myself might be easier, considering how familiar I'd have to get with the compiler of choice.

And then I would have to go digging through the library source, looking for all the places that directly access the stack and fixing the code that avoids the return address that isn't there any more. va_arg is just the tip of the iceburg.

It's a big project, but I think it needs to be done. It's just too easy to walk on the return addresses (and do bad things with them) otherwise.

Now it doesn't fix the general problem of using buffer overflows to modify the behavior of programs, it just makes the easiest way to do so significantly harder.

It also opens the C language to some paradigms that are currently blocked, and those paradigms just happen to make it easier to write robust programs. But that is a discussion for a later date.

Saturday, March 28, 2015

Converging CPU Models on the 68K Model?

Around 1986, I was talking with a fellow student at BYU about the CPU wars that "everyone" knew had finally (Oh, the relief!) been declared in Intel's favor.

(Do not listen to the salesmen when they tell you their product is the de-facto standard. They are usually exaggerating, but, if they aren't, you should throw them out on their ears. "Everybody does it!" is never a good priority argument.)

I told him that the industry would find itself painted into a corner with the 80x86, and everyone would find themselves having to convert their code to the 680X0.

On the one hand, I did not reckon on Intel being willing to waste as much money on the race to shrink, and on implementing self-fulfilling prophecy heuristics, just to keep their brain-damaged intellectual property somewhat competitive in apparent performance.

If the same amount of money were spent on shrinking and optimizing ARM, there would be no contest at all. Likewise on Freescale's ColdFire.

AMD's 64 bit "extension" to the INTEL 80x86 model basically saved INTEL's bacon.

And this is what is amusing about this whole thing: All the successful modern processors have ended up looking a lot like the 680X0 -- about 16 internal mostly orthogonally accessible general-purpose registers, and a control-flow stack model somewhat supported in the silicon.

If you don't have enough registers (see the 6800), or if the registers are not orthogonal and general-purpose (see the 8080 and 8086), you tend to waste a lot of instructions in moving data between the registers where you will work on them and the temporary storage where you keep intermediate results. Extra code is extra processing time. And it's extra space in cache, which is extra time moving code in and out of cache.

But if you have too many registers (see all the true "RISC" processors and most modified RISC such as the PowerPC), maybe you don't have to waste time getting data in and out of registers, but you lose on code density. The instructions are bigger and they often don't quite do enough, so you end up using more. And that means that you end up with more code to move in and out of cache again.

And it's actually getting the code in and out of cache that generally dings your performance most in general purpose computing. (Embedded applications that don't depend on cache take less of a hit, but the problem of instructions that don't quite do all that you want them to do, requiring more instructions to finish a calculation, still ends up being a trade-off.)

So, there it is. Look at the AMD64 and the first thing you notice is that they generalized the registers in the 80x86 model. Lots of talk about register-renaming, but the biggest thing was making the register set orthogonal to the instructions. Then you notice that they added eight registers somewhere along the line.

We spend a lot of money and time in keeping our code in the intermediate level C language or higher level languages, so that moving to a new CPU is just a compile away.

And the high-level code eats away at performance, too, so we waste a lot of money on bad attempts at optimizing the compilers and interpreters.

Now, we gain quite a lot in developer efficiency by giving up some of the performance to higher-level source code. So higher-level code is actually a good trade-off.

Even though I was wrong in details, I was essentially right. The current most successful "desktop" CPU looks a lot more like the 680X0 than the 80x86, and that CPU competes well in the server market, as well. The code that it spends most of its time running is not 80x86 code any more.

And the most successful CPU overall is the ARM, which also looks a lot more like the 68K than the 80x86. (Low order bytes first is a bit of brain damage that still has to be exorcised from the industry. The more we deal with multi-byte character sets, the more people will realize why.)

Motorola wasted a lot of resources in the CPU war, trying to keep the 680X0 ahead of the 80x86 in salescrew friendly features, many of which turned out to be less-than-useful.

The most important features (full 32 bit branching and a reliable exception model) were implemented in the 68010. Many of the advances between the 68020 and the 68060 actually came in the form of getting rid of features in ways that didn't penalize customers too much when they had blindly used those features.

And ColdFire, which is Freescale's more modern descendant of the 68K, gets rid of even more of the fluff features from the feature wars, and improves the exception model significantly. ColdFire is actually competitive with the ARM architecture in performance and power curve at the same level of shrinkage.

(AMD64/80x86 is only competitive on the power curve in salesperson's dreams, and, even that, with serious handicaps forced on ARM designs by the prevailing INTEL-leaning customs of the industry.)

There is lots of irony in history, and the computing industry is no exception.


Looking ahead, there is one more big irony that I see. When we understand the run-time model a bit better, we will quit interleaving the flow-of-control (return address) stack with the data stack.

Which 8 bit processor of a bygone era had two stacks?

By the way, here's a minimal execution (register) model I see as optimal:
  • Two accumulators, fully orthogonal, to allow tracking intermediate results of two different sets of calculations without hitting the memory bus;
  • The ability to concatenate the accumulators to work with double-width integers and pointers;
  • Two index registers, fully orthogonal, to allow referencing a source and a destination non-scalar without having to save and restore pointers;
  • Fully indexable data stack, so that accessing local variables on the stack doesn't require stack manipulation or using an index register;
  • Flow-of-control stack independent of the data stack, to get rid of the problems that happen when return addresses are overwritten by data;
  • It would be preferable to keep the return addresses in a memory area protected from normal access;
  • Local, per-task static allocation area for variables that are not accessed concurrently ("thread" local in current vernacular);
  • Global static allocation area for constants, code, and variables that need to be guarded by some sort of mutual exclusion;
  • Indexable program counter for calculated jump tables and constants buried in code.
Which archaic 8-bit processor fits this description almost to a "t", if you overlook the clumsiness of the 8-bit direct page register and the lack of a bus signal to indicate when a return address is being pushed or popped? Heh.

I've found that this model supports pretty much all the hand-optimized code I've produced without excess data movement, and allows a very compact machine code expression.

Now, the 6809 doesn't quite fit this model. The direct page register can't be indexed, which is okay for compiled code but not interpreted.

And it doesn't have a separate address space for a return address stack. I have a vague memory about something in the timing diagrams that might allow synthesizing a separate address space, but that's probably wishful thinking. I don't have timing diagrams handy, so I can't check.

Also, the 16 bit address limits are a bit tight.

If having products of your own company that compete with each other were a bad thing, Motorola would have been right to have de-emphasized the 6809.

But look at Freescale's current lineup. PowerPC, ColdFire, ARM? They essentially have three of their own products competing directly with each other in a lot of target product spaces.

When any two are less popular, the third keeps things stable.

The only problem here is that Freescale's salescrew have to educate themselves a bit more about the lineup. That's actually a good thing, because there is nothing more damaging to an industry than a pack of lazy salespeople.

The 68K can be programmed to fit this model, with a bit less compact machine code, except for the separate address space for the return addresses. (Again, I need to check the timing diagrams and, with certain members of the 68K family, the memory access function codes, to see if there might be some way to cache the stack.) Still, with a 32 bit address, you can get the return addresses far enough away from everything else that accidentally overwriting the return address becomes much less common, and most of the attacks on code that involve buffer overruns and such to overwrite the return addresses become very difficult.

The 68K provides other ways to concatenate small integers, so natural concatenation of accumulators is not necessary.

The PowerPC (which Freescale manufactures in cooperation with IBM) can be programmed to fit this model, too, but it usually isn't. Too many registers go unused. (It would be an interesting research project into register counts to compare the results of compiling to the large register set, single stack model on the PowerPC is more efficient than compiling to the above model. Note that the PowerPC code density is about 50% of the 68K at best.)

The ARM (which Freescale also manufactures under license) can also be easily mapped to this model. 64 bit ARM should allow even better separation of the return address stack from other parts of memory.

Renesas's SuperH has some odd non-orthogonalities that require a bit of intimacy with the processor, but it can be mapped to this run-time model as well.

80x86? Not so much. Too much funky stuff happening when you aren't expecting it, too much optimizing to the standard C runtime model which INTEL helped create. AMD64, on the other hand, should map rather well to the model. Maybe, if the register renaming doesn't get too much in the way.

Now, for the 68K, ARM, SH3, and other processors with a bit more registers than you need, the extra registers can always be used.
  • A third index register allows one to access two sources and a destination without having to save and restore index registers.
  • An extra index register could be used to differentiate between global constants and global variables. Yet another could be used for linking to global or parent process callbacks, particularly those for maintaining concurrent access coherency -- semaphores, monitors, mutual exclusion code, etc.
  • Another index register could maintain a second data stack, to separate parameters from local variables. (It's an artificial distinction, I know, but there are many distinctions not required by the math that help optimize code and execution.)
  • You might even dedicate an indexable registers to token threaded interpreter virtual instruction-code list execution.
Returning to the separate stack that is currently not part of the standard C runtime model, another benefit of a separate stack is reduced function call overhead. Since the return address is elsewhere, call frames really don't need special maintenance. If there is no need to walk the variable space of the calling routines, no frame pointers need to be generated at all.

A downside is that it adds a separate page for memory management functions to keep track of, with the added complexity of the operating system needing to make sure the return address stack and the parameter stack both always stay in memory while the process is in memory.

However, if the CPU manufacturer could be convinced to help just a little, the return address could be cached in a dedicated stack-oriented cache, and the return address page management could be greatly simplified, compared to the rest of memory. Careful caching could basically postpone, or even eliminate most of the remaining overhead of subroutine calls, reducing the overhead to the cost of the two ordinary jumps.

In addition, the return address stack does not have to maintain coherence with external memory. Spurious writes to the cache (in other words, not from the instruction fetch mechanism) could be silently dropped, or could trigger a memory access violation.

(Certain kinds of exception processing are usually done be re-writing return addresses, but there are other, better ways to do that on a decent CPU.)

I'm going to make another wild prediction. Once people get used to having 16 registers to use as the code needs it, instead of being restricted to the standard C model that INTEL helped converge around their 80x86 execution model, we'll see more multi-stack solutions. One of the first splits will be the flow-of-control stack.

(More about this model some other time.)