defining computers: A Common Code for Information Interchange

I've been thinking about this topic since I first heard of plans for what become Unicode, back in the mid 1980s.

At the time, there were many in the industry who still thought that 64K of RAM should be enough for general personal computing, and there were many people who thought 65,536 should be enough characters to cover at least all modern languages. I tried to tell as many people as I could that Japanese alone had more than 20,000 and Chinese had an estimated count in the range of a hundred thousand, and no one believed me. I also tried to tell people that we shouldn't conflate Japanese and Chinese characters, but I had a hard time convincing even myself of that.

I also tried to tell people that the Kanji radicals should be encoded first, but the Japanese standards organization wasn't doing that, so why should anyone believe it?

As I noted recently, some of the problems of the current approach to unifying the world's character sets are becoming obvious.

Each natural human language is it's own context. And each language has it's own set of sub-contexts which we call dialects. But neither the contexts nor sub-contexts nor the sets of sub-contexts are well-defined, mathematically speaking, which means that mathematical methods can not be used to perfectly parse any natural language.

Therefore, programs which work with human languages are necessarily buggy. That is, we know that, no matter how carefully they are constructed, they will always contain errors.

When we combine language contexts, we combine error rates with error rates, and the result is at best multiplicative. It is not simply additive. So we really should not want to do that. But that's what Unicode tries to do -- combine character codes for all relevant languages into one over-arching set.

Actually, for all my pessimism, Unicode works about as well as we should expect it to. I just want something better, but it's hard to describe exactly what that something is. This rant is an attempt to do so.

With just US English, it's fairly easy to construct a text editor. Parsing the entered text requires around ten simple functions, and visual formatting less than ten more. Word processors are fairly straightforward, as well.

With Unicode, a simple text editor requires more like a hundred functions, interacting in ways that are anything but obvious.

And if you need to rely on what you read in the text, as I noted in the rant linked above, you find that displaying the text reliably adds significantly more complexity.

Actually, US English is almost unnaturally simple to parse (relatively speaking). That's why it has been adopted over French, Spanish, Russian, and German, and why you don't hear much of Japanese plans to make Japanese the international language, and why the Chinese Communist Party's dreams of making Chinese the international language just will never fly, no matter how significant a fraction of the world's population ostensibly speaks Chinese as a native or second language.

Memorizing 9000+ characters for basic literacy requires starting at the age of two, I hear.

The Chinese may claim a full third, but the other two thirds are not going to happily and willingly accept being forced to propogandize their children (or themselves) with that many characters just to be literate. That alone is oppressive enough to prevent a productive peace.

Even the Japanese subset of two thousand for school literacy basically requires all twelve years of the primary grades to complete.

If we could reduce that burden by teaching the radicals first (We westerners call the sub-parts of Kanji "radicals".), we might have hope to address the difficulty, but the radicals themselves are an added layer of parsing. That's multiplicative complexity, which is one of the reasons that approach has not been successful as a general approach. (It is taught, I understand, in some schools of Japanese calligraphy, but that is not a large fraction of the population.)

And the rules for assembling and parsing the radicals are anything but simple.

Now, you may be wondering why I think the radicals should be prioritized in the encoding, but the dirty secret of Kanji is that they are not a closed set, any more than English vocabulary is a closed set. Every now and then someone invents a new one.

Methods to address new coinage must be part of the basic encoding.

This is getting long, and I think I'll wrap up my rant on my motivations for considering something to supercede Unicode here.

I wrote up a summary list of overall goals about three years back, here.

As I've said elsewhere, Unicode has served a purpose until now, and will continue to do so for a few more years, but we need something better.

It needs to provide better separation for the contexts of languages.

defining computers

Misunderstanding Computers

Saturday, October 7, 2017

A Common Code for Information Interchange

No comments:

Post a Comment