7. The "Astral Planes":
Supplementary Planes and Greek

Home > Greek > Unicode

Originally, Unicode was intended to be a two-byte standard, as opposed to a one-byte standard: there would be 16 bits allocated per codepoint, as opposed to 8. In that way, Unicode would fall into line with the East Asian character sets, which were already using two bytes, while most other character sets used just one.

Although it is still frequently assumed that Unicode takes up 16 bits per codepoint (giving it a maximum of 64K—65,536 characters), this is no longer the case: there was not enough room in that space to fit in all the living scripts of the world, let alone historical scripts. And even if the historical scripts like Egyptian and Mayan hieroglyphics were ignored, following the "Don't Proliferate, Transliterate" principle, the need to fit in extra characters for Cantonese alone guaranteed Unicode would have to expand its repertoire.

So as of Unicode 3.0.1 (August 2000), Unicode is organised into 16 planes, each of 64K; this gives over a million codepoints, which should be enough for all needs, past present and future. The Basic Multilingual Plane (BMP), or Plane 0, is the first 64K, which is what was in use until 2000, and where just about everything useful will still reside. The other planes are termed Supplementary.

The supplementary planes are an innovation in how characters are internally represented—programmers have to assume a character can have a million possible values, not just 64K, which means they often have to change their existing code. Furthermore, they are not drastically common in use: most 'real' scripts (though not all) are ensconced in the BMP. So software support for the supplementary planes lags that of the BMP: virtually no fonts contain them (Code2001 remains the honourable exception, with the recent additions of Alphabetum, and for the Unicode 4.1 Greek additions Cardo and New Athena Unicode); old operating systems don't acknowledge them; some browsers still can't deal with them; some text editors don't accept them; and so on. As of this writing for instance, Dreamweaver MX for MacOSX (which I am currently using to prepare this) will let you paste BMP text into its WYSIWYG window; but pasting Supplementary Plane text there will make it crash.

The informal name for the supplementary planes of Unicode is "astral planes", since (especially in the late '90s) their use seemed to be as remote as the theosophical "great beyond". There has been objection to this jocular usage (see "string vs. char" and subsequent discussion on Unicode list); and as Planes 1 and 2 spread in use there will be less occasion to feel that the planes really are 'astral'. But the jocular reference is harmless, and it serves as a reminder that we're not quite there yet.

Different planes are designated for different functions, as detailed in the Unicode Roadmap:

The SMP includes three types of scripts and notational systems that are associated somehow with Greek:

I choose to call these Semi-Greek, Para-Greek, and Anti-Greek.

1. Semi-Greek

1.1. Derived from the Greek script

The Unicode Standard currently includes two scripts closely derived from the Greek script:

Both may be thought of as Greek plus extra characters; in that, of course, they resemble the other interloping scripts of Greek:

However, texts originally written in Old Italic and Gothic appear in scholarly use in Latin-based transcriptions.

Old Italic is actually an abstraction over the epichoric proliferation of letters used in the various languages of Italy. Juan José Marcos' manual to the Alphabetum font (ZIP: PDF) has a listing of variants included in his font in the Private Use Area. Scholars working in Italic of course—all together now—don't Proliferate, but Transliterate; so they have never actually attempted this kind of standardisation for their own scripts. The Unicode Standard (§13.2) admits that fonts designed for different languages will need to select different glyphs.

Some further scripts based on Greek are also under consideration for inclusion in Unicode:

Other ancient scripts that share characters with archaic Greek, such as the Iberian script of Ancient Spain and the Numidian script of Ancient North Africa, are thought to have been derived directly from Phoenecian.

1.2. Used to write Greek

Plane 1 also includes two syllabaries formerly used to write Greek:

Texts in both scripts are conventionally transliterated into Latin. Frequently enough the Greek forms underlying the CV-syllabaries (not a good fit for Greek) are then reconstructed and published in the normal Greek script, with requisite additions (yot, digamma, , and so forth)—particularly if the text is cited in linguistic discussion of Greek, where it needs to blend in with Classical examples.

Each syllabary as a pre-Hellenic counterpart from which they seem to have been derived:

2. Para-Greek

2.1. Musical Notation

There are two musical notation schemes associated with Greek.

2.2. Numeric Notation

2.2.1. Acrophonic Numerals

Before Greek generally adopted the Milesian system for its numerals, the dominant system was acrophonic—which simply means "initials". The system was pretty much like the Roman system, in that it had a letter for 1, 5, 10, 50, 100, and so on. It used the initials of the numbers involved, outside of the inevitable Ι = 1: Π (πέντε) = 5, Δ (δέκα) = 10, Η (hεκατόν) = 100, Χ (χίλιοι) = 1000, Μ (μύριοι) = 10,000. However, the acrophonic system lacked the Roman shortcut of writing to the left to subtract: 4 was ΙΙΙΙ, not ΙΠ. Numerals not covered by initials due to ambiguity were handled by ligatures: 50 (πεντήκοντα) was formed as a delta nested inside a pi (5 tens).

The acrophonic system is routinely used in publication of inscriptions (the Milesian system did not come into general use until after the Classical era), but the Unicode proposal took a while to emerge. One of the main points of debate was, as usual, whether to conflate or not the existing Greek characters. The normal Greek characters used with a numerical value, ΙΠΔΗΧΜ, appear in print in an epigraphical, sans-serif font, which differentiates them from the surrounding Greek text; moreover, pi appears in its archaic form, with its right leg truncated. This means that there was a case for and against including those characters in any proposal: their typography distinguishes them from the letters, but conceptually they are the same letter, and the distinction is modern editorial, not inherent in the inscription.

Acrophonic Greek Numerical notation was proposed by the TLG in June 2003, and included in Unicode 4.1, March 2005, in the block Ancient Greek Numbers, U+10140 - U+1018F (as the subrange U+10140 - U+10174). As formulated by the TLG, the current proposal excludes ΙΔΗΧΜ. However, it includes the archaic, truncated pi (as U+10143 Attic Acrophonic Symbol Five, 𐅃), since it is never typographically conflated with normal pi (except on this webpage).

Apart from numerals, the Attic acrophonic system employed distinct symbols for counts of money and/or weight (talents, staters). For instance, five talents was represented in Attica as U+10148 Attic Acrophonic Symbol Five Talents, 𐅈, and five staters as U+1014F Attic Acrophonic Symbol Five Staters, 𐅏. There was also regional variation in the shape of glyphs used for both numerals and counts of money. The TLG proposal has elected to treat these as distinct codepoints; for instance, a distinction is made between U+10144 Attic Acrophonic Symbol Fifty, 𐅄, U+10166 Troezenian Acrophonic Symbol Fifty Type One, 𐅦, U+10167 Troezenian Acrophonic Symbol Fifty Type Two, 𐅧, U+10168 Hermionian Acrophonic Symbol Fifty, 𐅨, and U+10169 Thespian Acrophonic Symbol Fifty, 𐅩.

The Unicode code chart points out that "These are shown as sans-serif forms because that corresponds more closely to their appearance in ancient texts." More to the point, the convention in epigraphy is to use sans-serif when an unnormalised text is reproduced; since these numeric signs are not alphabetic letters, they are not treated like alphabetic letters, but as signs straight off the stone.

2.2.2. Ancient Greek Papyrological Numerals

Some distinct symbols also evolved post-classically as used in papyri, and most persisted in use in codices. Rather than integers, these symbols include fractions, and measures of time, money, weight, and capacity. Of these, while the symbol for 'year', U+10179 Greek Symbol Year, 𐅹, is often expanded in editions of papyri, the other symbols appear routinely in editions of later technical Greek works—medical works for the symbols of weight and capacity (in prescriptions); astronomical, mathematical and medical works for the fractions.

As the TLG proposal on these symbols makes clear, there is abundant variation in the glyphs used to represent the various values, and modern editors have made no attempt to impose a standard. The angular, the tilde-like, the S-like and the lunate sigma symbols for 1/2, for example, are all in common use in modern editions. The TLG proposal conflates rather than proliferate; mechanisms for representing glyph variation in older texts are an outstanding issue for text markup, but here the TLG has decided against foisting that variation onto Unicode.

Things are made worse when printers have improvised; for example, the use of the Latin L as a glyph for 1/2 in Theon of Alexandria's minor commentary on Ptolemy's Easy Tables (Tihon, A. 1978. Le petit commentaire de Théon d'Alexandrie aux tables faciles de Ptolémée. Studi e Testi 282. Vatican City: Biblioteca Apostolica Vaticana) seems to be a typographical convenience in the absence of a proper angular glyph. (The same editor's edition of the major commentary by Theon uses lunate sigma: Mogenet, J. & Tihon, A. 1985–91. Le grand commentaire de Théon d'Alexandrie aux tables faciles de Ptolémée. 2 vols. Studi e Testi 315 & 340. Vatican City: Biblioteca Apostolica Vaticana.)

The TLG made a proposal concerning these characters in June 2003, and this has been incorporated in Unicode 4.1 as the subrange U+10175 - U+10189 of Ancient Greek Numbers.

There is a clear inconsistency in how acrophonic and papyrological numbers are handled: the papyrological numbers are conflated, the acrophonic proliferated. There's a simple explanation: a prominent conflater in the TLG was involved in the former proposal, and not in the latter. :-) (There is still an alternative form of half preserved: U+10176 Greek One Half Sign Alternate Form, 𐅶, alongside U+10175 Greek One Half Sign, 𐅵.) That said, there is a semantic distinction in the various acrophonic numerals that is absent in the variation in fractions -- even if that distinction, being a matter of regional provenance, is entirely predictable. The multiplicity of possible glyphs poses a challenge, and it may end up proving insuperable; but as technology catches up with the multiple-glyph-per-codepoint issue, then again, it might not.

The S-like glyph for half also turns up in Coptic, as U+2CFD Coptic Fraction One Half, ⳽.

3. Anti-Greek

Mathematics as a system is the great interloper, wrenching characters out of their typographical context and using them with completely different semantics, as mathematical terms. Greek is not the only script to have been scavenged in this way; the Latin script has had different script traditions and typographical styles imbued with distinct semantics. So there is a distinction made between U+0048 Latin Capital Letter H, H; U+210B Script Capital H, ℋ (the Hamiltonian function); U+210C Black-Letter Capital H, ℌ (the Hilbert space); and U+210D Double-Struck Capital H, ℍ (the quaternions; see more on double-struck characters).

In Mathematics, then, shifts of typeface, script, and style are important enough to yield completely distinct meaning: the difference between a script and a blackletter H is rather more grave in Mathematics than it is in normal textual use of the Latin script. In textual use, Helmut in Fraktur, italics, and cursive style are identical in meaning. This is why the distinction is extraneous to the notion of plain text: so the Latin script in Unicode does not have a distinct codepoint for Fraktur H, Italic H, and Cursive H.

Mathematics does. In fact, there's a whole block of them at U+1D400 - U+1D7FF: Mathematical Alphanumeric Symbols. So you will find there distinct codepoints for:

The prospect here should be filling you with horror. What if someone wants to write in K00l 3133t Fraktüür, d00d, and pens their webpage in the Mathematical Fraktur range U+1D504 - U+1D537 (with occasional excursions into U+2100 - U+214F Letterlike Symbols for missing characters)? What will happen is that the webpage will contain a bunch of mathematical symbols, which will look like a normal vintage mid-'70s heavy metal album cover, but will not be searchable as anything but a bunch of pages on Lie Algebra. The elemental requirement of plain text, that characters which mean the same thing should be encoded the same way regardless of presentation, will have been violated—and that is not good news. The characters do have a compatibility decomposition down to the corresponding Latin letters; but that's not enough of an excuse to do this.

Since Mathematics pilfers from Greek almost as much as it does from Latin for its symbols, you can be sure Greek was not spared the duplications:

The mathematicians may well need these symbols, and need them to be distinguished from the normal letters in plain text. The rest of you will have worked out why I termed these symbols "Anti-Greek": they are never, ever to be used in a Greek text context, but only in mathematics. Never, ever. Ever.

Hope that's clear.

.i ta'o mi ckire la djan.kau,n. noi pu clite kajdygau mi tu'a la'edi'u

Nick Nicholas, opoudjis [AT] optusnet . com . au
Created: 2003-06-15; Last revision: 2005-05-23
URL: http://www.opoudjis.net/unicode/unicode_astral.html