Having said a bit about why I want to re-invent Unicode (so to speak), I want to rant a little about the overall structure, relative to languages, that I propose for this Common Code for Information Interchange, as I am calling it.
I've talked a little about the goals, and the structure, in the past. Much of what I said there I still consider valid, but I want to take a different approach here, look from the outside in a bit.
First, I plan the encoding to be organized in an open-ended way, the primary reason being that language is always changing.
Second, there will be a small subset devoted primarily to the technical needs of encoding and parsing, which I will describe in more detail in a separate rant.
Third, there will be an international or interlocality context or subset, which will be relatively small, and will attempt to include enough of each current language for international business and trade. This will appear to be a subset of Unicode, but will not be a proper subset. I have not defined much of this, but I will describe what I can separately.
Parsing rules for this international subset will be as simple as possible, which means that they will depart, to some extent at least, from the rules of any particular local context.
Third, part two, there will be spans allocated for each locality within which context-local parsing and construction rules will operate.
Fourth, there will be room in each span for expansion, and rules to enable the expansion. Composition will be one such set of rules, and there will be room for dynamically allocating single code points for composed characters used in a document.
The methods of permanently allocating common composed characters should reflect the methods of temporary allocation.
Fifth, as much as possible, existing encodings will be included by offset. For instance, the JIS encoding will exist as a span starting at some multiple of 65536, which I have not yet determined, and the other "traditional" encodings will also have spans at offsets of some multiple of two. The rules for parsing will change for each local span.
I've thought about giving Unicode a span, but am not currently convinced it is possible.
Of course, this means that the encoding is assumed to require more than will fit comfortably in four bytes after UTF-8 compression.
And thinking of UTF-8 brings me to the next rant.
No comments:
Post a Comment