defining computers: June 2017

Misunderstanding Computers

Why do we insist on seeing the computer as a magic box for controlling other people?
人はどうしてコンピュータを、人を制する魔法の箱として考えたいのですか?
Why do we want so much to control others when we won't control ourselves?
どうしてそれほど、自分を制しないのに、人をコントロールしたいのですか？

Computer memory is just fancy paper, CPUs are just fancy pens with fancy erasers, and the network is just a fancy backyard fence.
コンピュータの記憶というものはただ改良した紙ですし、CPU 何て特長ある筆に特殊の消しゴムがついたものにすぎないし、ネットワークそのものは裏庭の塀が少し拡大されたものぐらいです。

(original post/元の投稿 -- defining computers site/コンピュータを定義しようのサイト)

Friday, June 30, 2017

Keeping the Return Address Stack Separate

So far, in this extended rant, I have

tried to show how to look at relevant addresses on your computer
tried to explain how current OSses still let memory regions clash
and talked about using existing processor tech to mitigate the problem.

Now I want to show how a CPU could completely prevent an entire subclass of these attacks, without a whole lot of loss in processor speed.

I first came on these ideas twenty, maybe thirty years ago, when trying to figure out what made the M6809 such a magical microprocessor. (That's a subject for another day.) The M6809 was (and still is) an 8-bit microprocessor with a 16-bit address bus. That means it can address 64Kbytes of memory.

Motorola specified a memory management part called the 6829 which supported designs up to 1 megabyte of memory. It was essentially a block of fast RAM that would be used to translate the upper bits of memory, plus a latch that would select which parts of the RAM would be used to translate the upper bits of the address bus, something like this:

(This is from memory, and not really complete. Hopefully, it's enough to get the concept from.)

Memory management control would provide functions like write protect and read protect, so you could keep the CPU from overwriting program code and set parts of the address space up as guard pages.

With 32 bits and more of address, memory management doesn't quite work this simply, but this is enough to get some confidence that memory management can actually and meaningfully be done.

Now, if you are familiar with the 8086, you may wonder what the difference between the 8086's segment registers and this would be. This kind of scheme provides fairly complete control over the memory map.

The 8086 segment registers only moved 64k address windows around in the physical map, and provided no read or write control. Very simple, but no real management. The 80286 provided write protect and such, but the granularity was still abysmal, mostly based on guesses about the usage of certain registers, guesses which sort of worked with some constrained C programming language run-time models. And these guesses were frozen into silicon before they were tested. It should go without saying, that such guesses miss the mark for huge segments of the industry, but Intel's salescrew has always been trained in the art of smooth-talking.

(Intel are not the only badguys in the industry, they are just the ones who played this particular role.)

Now I knew about both of these approaches, and I knew about the split stack in Forth. And it occurred to me that, if a 6829-like MMU could talk to the CPU, and select a different task latch on accesses through the return stack pointer (S in the 6809), you could make it completely impossible to crash the return stack and overwrite return addresses.

I'm not talking about guard pages, I'm talking about the return addresses just simply can't be accesses by any means except call and return. They're outside the range of addresses that application code can generate.

Of course, the OS kernel can access the stack regions by mapping them in, but we expect the OS to behave itself.

(We would provide system calls to allow an application to have the OS adjust a return address when such is necessary.

Also, since we are redesigning the CPU, we might add instructions for exceptional return states, but I would really rather not do that. It seems redundant, since split stacks make multiple return values so much easier.)

Another thing that occurred to me is that the stack regions could be mapped to separate RAM from the main memory. This would allow calls and returns that would take no more time than regular branches or jumps.

At this point in my imaginings, I'm thinking about serious redesign of the CPU. So I thought about adding one more stack register to the 6809, a dedicated call/return stack. It would never be indexed, so it would be a very simple bit of circuitry. That would free up registers for other use, including additional stacks and such.

(Well, if we allow frame pointers to be pushed with the instruction pointers, and provide instructions for walking the stack, there would be one kind of indexing -- an instruction to fetch a frame pointer at a specific level above the current one. I'll explain how this would work, to aid understanding what's going on here:

There would be a couple of bits in the processor status area, which the OS would set before calling the application startup. The application must not be allowed to modify these bits, but, since the application must be able to confirm that the frames are present, it should be able to read them.

These bits would tell the processor which stack pointers to save with the instruction pointer on calls. The return instructions would have a bit field to determine whether to restore or discard each saved frame pointer.

"Walking the stack" would be simply a load of a specified saved frame pointer at a specified level of calling routine.

In the example shown here, the instruction GETFP sees from the status register that both LP and SP are being recorded, and multiplies the index argument by 3, then adds 2 to point to LP, checks against the return stack base register, and loads LP0 into X.

But GETFP SP,3,Y after pointing into stack that isn't there, checks the return stack base register and refuses to load the frame pointer that isn't there.

Another flag in the status register might select between generating an exception on failure and recording the failure in a status bit.

Maybe. :-/)

Could we do such a thing with the 68000 or other 32-bit CPUs? Add a dedicated call/return stack and free the existing stack pointer for use as a parameter stack in a split-stack architecture? 64-bit CPUs?

Sure.

But if we intend to completely separate the return addresses, we have to add at least one bit of physical address, or we have to treat at least one of the existing address bits (the highest bit) in a special way. I think I'd personally want to lean towards adding a physical address bit, even for the 64-bit CPUs, to keep the protection simple. But, of course, there are interesting possibilities with keeping the physical and logical addresses the same size, but filtering the high bit in user mode.

And that would provide us with a new kind of level-1 cache -- 8, 16, or maybe 32 entries of spill-fill cache attached to the call/return stack, operating in parallel with a (modified) generalized level-1 cache. The interface between memory management and cache would need a bit of redesign, of course.

I'm not sure it would mix well with register renaming. At bare minimum, the call/return stack pointer would have to be completely separate from the rename-able registers.

Would this require rewriting a lot of software?

Some, but mostly just the programming language compilers would need to be worked on.

And most of the rewrite would focus on simplification of code designed to work around the bottleneck of having the return addresses mixed in with parameters.

There you have a way to completely protect the return addresses on stack.

What about other regions of memory? Can we separate them meaningfully?

That kind of thing is already being done in software used on real mainframes, so, yes. But it does have a much larger impact on existing software and on run-time speed, and it is not as simply accomplished.

But that's actually a question I want to visit when I start ranting about the ideal processor that I want to design but will probably never get a chance to. Later.

Wednesday, June 28, 2017

Proper Use of CPU Address Space

I referred to this in my overview about re-inventing the industry, but I was not very specific. Now that I've been motivated to write a rant or two about memory maps and how they can be exploited:

I can write about the ideal, perfect CPU (that may be too perfect for this world), and how it works with memory.

First, these are the general addressable regions of memory that you want to be able to separate out. I'll put them in the order I've been using in the other two rants:

0x7FFFFFFFFFFFFFFF
stack (dynamic variables, stack frames, return pointers)
0x7FFFxxxxxxxxxxxx ← SP
gap
guard page (Access to this page triggers OS responses.)
gap
heap (malloc()ed variables, etc.)
statically allocated variables
0x4000000000000000
application code
0x2000000000000000
operating system code, variables, etc.
0x0000000000000000

The regions we see are

Stack (dynamic variables, stack frames, return pointers)
Heap (malloc()ed variables, etc.)
application code (including object code, constants, linkage tables, etc.)

Operating system code should include the same sort of regions. But that should not really be visible in the application code map. Only the linkage to the OS should be visible, and that would be clumped with the application code.

Memory management hardware provides the ability to move OS code out of the application map. Let's see how that would look:

We used to talk about the problems of accidentally using small integers as pointers. Basically, when pointer variables get overwritten with random integers, the overwriting integers tend to be relatively small integers. Then when those integers are used as pointers, they access arbitrary stuff in low memory. We can notice that and refrain from allocating small integer space. And we realize that we have already dealt with small negative integers by buffering the wraparound into highest memory:

0xFFFFFFFFFFFFFFFF
gap (wraparound and small negative integers)
0x8000000000000000
stack (dynamic variables, stack frames, return pointers)0x7FFFxxxxxxxxxxxx ← SP
gap
guard page (Access to this page triggers OS responses.)
gap
heap (malloc()ed variables, etc.)
statically allocated variables
0x4000000000000000
application code
0x0000000100000000
gap (small integers)

I've posted a rant about using a split stack, with a little of the explanation for why at the end. Basically, that would allow us to move those local buffers that can oerflow, crash, and/or smash the stack way away from the return address stack.

Thus, even if the attacker could muck in the local variables, he would still be at least one step from overwriting a return address. That means he has to use some harder method to get control of the instruction pointer.

Stack usage patterns actually point us to using a third stack, or a stack-organized heap separate from the random allocation heap. Parameters and small local variables could be on one stack, and large local variables on the other.

In other words, scalar local/dynamic variables would be on the second stack and vector/structure local/dynamic variables on the third. This would be especially convenient for Forth and C run-times, virtually eliminating all need of function preamble and cleanup, and simplifying stack management.

Another way to use the third stack would be to just put all the local variables on it. It might be easier to understand it this way, and I'll use the parameter/locals division below. As far as the discussion below goes, the two divisions can be interchanged. (The run-time details are significant, but I'll leave that for another day. Besides, there is no reason for a single computer to limit itself to one or the other. With a little care, the approaches could even be mixed in a running process.)

But the third stack could be optional, and its use determined by the language run time support. The OS run-time support really doesn't need to see it other than as a region to be separated from the others. Here is a possible general map, using 64 bit addressing:

0xFFFFFFFFFFFFFFFF
gap (wraparound and all negative integers)
0x8000000000000000
gap (large positive integers)
0x7FFFFF0000000000
gap
return stack ← RP
gap
0x7FFFFE0000000000
guard page (2⁴⁰ addresses)
0x7FFFFD0000000000
gap
parameter stack ← SP
gap
0x7FFFFC0000000000
guard page (2⁴⁰ addresses)
0x7FFFFB0000000000
gap
local stack ← LP gap
0x7FFFFA0000000000
guard page (really huge)
0x7000000000000000
gap
heap (malloc()ed variables, etc.)
gap
0x4000020000000000
guard page (2⁴⁰ addresses)
0x4000010000000000
gap
statically allocated variables
gap
0x4000000000000000
gap
application code
gap
0x0000010000000000
gap (small positive integer pointer guard)
0x0000000000000000

If we choose to have stack frames, we could manage them very simply on the return stack by just pushing the local and/or pointer stack pointer when we push the IP. And we just discard them when we pop the IP. Or we can pop them, to force-balance the stack. This gets rid of pretty much all the complexity of walking the stack.

The gaps should be randomized, to make it harder for attacker code to find anything to abuse.

The regions we now have are

Return Stack (return address and maybe frame pointers)
Parameter stack (parameters only)
Locals Stack (dynamically allocated local variables)
Heap (malloc()ed variables, etc.)
Statically allocated process variables (globally and locally visible)
application code (including object code, constants, linkage tables, etc.)

And we have large guard regions between each.

What's missing?

Multiprocessing requires a region of memory dedicated to process (or thread) shared variables, semaphores, resource monitor counters and such. This is a separate topic, but basically the statically allocated variable area would have a section which could be protected from bare writes, with only reads and locked read-modify-write cycle instructions allowed. These would be a separate region, so their addresses could be somewhat randomized.

I'm not sure that it makes sense to manage allocation of shared variables in the malloc() sense, but there is room with this kind of scheme, and modern processors should support that many different regions of memory.

Also, regions of memory shared mmap-style would be in a separate region, or perhaps a guarded region for each. I'm not sure whether the would be protected in the same way as semaphores and monitor counters. It would seem, rather that the CPU instructions would be ordinary instructions, and the mmap region would be a resource protected by semaphore- or monitor- controlled access.

We can do the same sort of thing with 32-bit addressing, although, instead of guard pages 2⁴⁰ or so in size, we would be looking at guard pages between 2²⁰
and maybe 2²⁴ in size. This would be more appropriate for some controller applications.

We could do the same thing with 16-bit addressing, but it wouldn't leave us much room for the variables and code. On the other hand, looking twice at 16-bit addressing will give us clues for further refinement of these ideas. But I think I'll save that for another rant, probably another day. I have burned up enough of today on this prolonged rant.

Monday, June 12, 2017

Reinventing computers.

I mention my bad habits a bit, but I don't really go into much detail:

One of these days I'll get someone to pay me to design a language that combines the best of Forth and C.
Then I'll be able to leap wide instruction sets with a single #ifdef, run faster than a speeding infinite loop with a #define,
and stop all integer size bugs with my bare cast.

I recall trying to get a start ages ago, on character encoding, CPUs, and programming languages, and, more recently, more on character encoding.

These are areas in which I think we have gone seriously south with our current technology.

First and foremost, we tend to view computers too much as push-button magic boxes and not enough as tools.

Early PCs came with a bit of programmability in them, such as the early ROMmed BASIC languages, and, more extensively, toolsets like the downloadable Macintosh Programmers' Workbench. Office computers also often came with the ability to be programmed. Unix (and Unix-like) minicomputers and workstations generally came with, at minimum, a decent C compiler and several desktop calculator programs.

Modern computers really don't provide such tools any more. It's not that they are not available, it's that they are presented as task-specific tools, and you often have to pay extra for them. And they are not nearly as flexible (MSExcel macros?).

Computers were not given to us to use as crutches. They were given to us to help us communicate and to help us think.

I'm not alone in my interest in retro-computing, but I think I have a little bit unusual ultimate goal in my interests.

I want to go back and re-open certain paths of exploration that the industry has lopped off as being too unprofitable (or, really, too profitable for someone else).

One is character encoding. Unicode is too complicated. Complicated is great for big companies who want to offer a product that everyone must buy. The more complicated they can make things, the harder it is for ordinary customers to find alternatives. And that is especially true if they can use patents and copyrights on the artificial complexities that they invent, to scare the customer away from trying to solve his or her own problems -- or their own corporate problems, in the case of the corporate customer.

Computer are supposed to help us solve our own problems, not to impose our own solutions on unsuspecting other people, while making them pay for the solutions that really don't solve their problems.

Now, producing something simpler than Unicode is going to be hard work, harder even than putting the original Unicode together was.

Incidentally, for all that I seem to be disparaging Unicode, the Unicode Consortium has done an admirable job, and Unicode is quite useful. They just made a conscious decision to try not to induce changes on the languages they are encoding. It's a worthy and impossible goal.

And they should keep it up. Even though it's an impossible goal, their pursuing that goal is enabling us to communicate in ways we couldn't before.

But we must begin to take the next step.

Rehashing,

The encoding needs to include the ability to encode a single common set of international characters/glyphs in addition to all the national encodings.
It needs to include
- characters,
- non-character numerics,
- bitmap and vector image,
- and other arbitrary (binary/blob) data.
It needs to be easily parsed with a simple, regular encoding grammar.
And it needs to be open-ended, allowing new words and characters to be coined on-the-fly.

Another path involves CPUs. Intel wants us all to believe that they own the pinnacle of CPU design, but, of course, that is just corporate vanity.

In the embedded world, lots of CPUs that the rest of the world has forgotten are still very much in use, because their designs are optimal in specific engineering contexts. Tradition is also influential, but there are real, tangible engineering reasons that certain non-mainstream CPUs are more effective in certain application areas. The complexity and temporal patterns of the input will favor certain interrupt architectures. The format of the data will favor specific register set constructions. Etc.

Many engineers will acknowledge the old Motorola M6809 as the most advanced 8-bit CPU ever, but it seems to have been a dead-end. ("Seems." It is still in use.) "Bits of it lived on in the 68HC12 and 68HC16." But the conventional wisdom is now that, if you need such an advanced CPU, it's cheaper to go with a low-end 32-bit ARM processor.

What got left behind was the use of a split stack.

The stack is where the CPU keeps a record of where it has been as it chases the branching trails of a problem's solution. When the CPU reaches a dead end, the stack provides an organized structure for backtracking and starting back down new branches in the trail.

Even "stackless" run-time environments tend to imitate stacks in the way they operate, because of a principle called problem context, in addition to the principle of backing out of a non-workable solution.

But the stack doesn't just track where the CPU has been. It also keeps the baggage the CPU carries with it, stuff called local (or context-local) variables. Without the data in that baggage, it does no good for the CPU to try to back up. The data is part and parcel of where it has been.

Most "modern" CPUs keep the code location records in the same memory as the context-local data. It seems more efficient, but it also means that a misbehaving program can easily lose track of both the context data and the code location at once. When that happens, there is no information to analyze about what went wrong. The machine ends up in a partially or completely undefined state.

Worse, in a hostile environment, such a partially defined state provides a chance for attacking the machine and the persistent data that it keeps on the hard disk. (Stack crashes are most effective when the state of the program has already become partially undefined.)

[JMR201706301711: I've recently written rather extensively on stack vulnerabilities and using a split stack to reduce the vulnerabilities:

tried to show how to look at relevant addresses on your computer
tried to explain how current OSses still let memory regions clash
and talked about using existing processor tech to mitigate the problem.

]

Splitting the stack allows for more controlled recovery from error states that haven't been provided for. In the process, it reduces the surface area susceptible to attack.

The split stack also provides a more flexible run-time architecture, which can help engineers reduce errors in the code, which means fewer partially-defined states.

There are a couple of other areas in which so-called modern CPUs in use in desktop computers and portable data devices are not well matched to their target application areas, and the programming languages (and operating systems), reflecting the hardware, are likewise not well-matched. This is especially true of the sort of problems we find ourselves trying to solve, now that we think most of the easy ones have been solved.

In order to flesh out better CPU architectures, I want to build a virtual machine for the old M6809, then add some features like system/user separation, and then design an expanded address space and data register CPU following the same principles.

I'm pretty sure it will end up significantly different from the old 68K family. (The M6809 is often presented as a "little brother" to the 68K, but they were developed separately, with separate design goals. And Motorola management never really seemed to understand what they had in the 6809.)

Once I have an emulator for the new CPU, I want to develop a language that takes advantage of the CPU's features, allowing for a richer but cleaner procedural API that becomes less of a data and time bottleneck. It should also allow for a more controlled approach to multiprocessing.

And then I want to build a new operating system on what this language and CPU combination would allow, one which would allow the user to be in control of his tools, instead of the other way around.

This is what I mean when I say I am trying to re-invent the industry.