Astral Planes

		7. The "Astral Planes": Supplementary Planes and Greek
		Home > Greek > Unicode
Language: ENG ELL EPO JBO TLH LAT		Home > Greek > Unicode

Originally, Unicode was intended to be a two-byte standard, as opposed to a one-byte standard: there would be 16 bits allocated per codepoint, as opposed to 8. In that way, Unicode would fall into line with the East Asian character sets, which were already using two bytes, while most other character sets used just one.

Although it is still frequently assumed that Unicode takes up 16 bits per codepoint (giving it a maximum of 64K—65,536 characters), this is no longer the case: there was not enough room in that space to fit in all the living scripts of the world, let alone historical scripts. And even if the historical scripts like Egyptian and Mayan hieroglyphics were ignored, following the "Don't Proliferate, Transliterate" principle, the need to fit in extra characters for Cantonese alone guaranteed Unicode would have to expand its repertoire.

So as of Unicode 3.0.1 (August 2000), Unicode is organised into 16 planes, each of 64K; this gives over a million codepoints, which should be enough for all needs, past present and future. The Basic Multilingual Plane (BMP), or Plane 0, is the first 64K, which is what was in use until 2000, and where just about everything useful will still reside. The other planes are termed Supplementary.

The supplementary planes are an innovation in how characters are internally represented—programmers have to assume a character can have a million possible values, not just 64K, which means they often have to change their existing code. Furthermore, they are not drastically common in use: most 'real' scripts (though not all) are ensconced in the BMP. So software support for the supplementary planes lags that of the BMP: virtually no fonts contain them (Code2001 remains the honourable exception, with the recent additions of Alphabetum, and for the Unicode 4.1 Greek additions Cardo and New Athena Unicode); old operating systems don't acknowledge them; some browsers still can't deal with them; some text editors don't accept them; and so on. As of this writing for instance, Dreamweaver MX for MacOSX (which I am currently using to prepare this) will let you paste BMP text into its WYSIWYG window; but pasting Supplementary Plane text there will make it crash.

The informal name for the supplementary planes of Unicode is "astral planes", since (especially in the late '90s) their use seemed to be as remote as the theosophical "great beyond". There has been objection to this jocular usage (see "string vs. char" and subsequent discussion on Unicode list); and as Planes 1 and 2 spread in use there will be less occasion to feel that the planes really are 'astral'. But the jocular reference is harmless, and it serves as a reminder that we're not quite there yet.

Different planes are designated for different functions, as detailed in the Unicode Roadmap:

The Supplementary Multilingual Plane (SMP: Plane 1, U+010000 - U+01FFFF), according to the Standard, is "dedicated to the encoding of lesser-used historic scripts, special-purpose invented scripts, and special notational systems, which either could not be fit into the BMP or which would be of very infrequent usage." The scripts and systems associated with Greek reside here.
The Supplementary Ideographic Plane (SIP: Plane 2, U+020000 - U+02FFFF) contains extra space for CJK (Chinese–Japanese–Korean) characters, including Cantonese-specific characters and obsolete characters.
Since it looks like that will not be enough space for CJK, the Auxiliary Ideographic Plane (AIP: Plane 3, U+030000 - U+03FFFF) has been proposed as additional space (see Unicode list discussion).
The Supplementary Special-Purpose Plane (SSP: Plane 14, U+0D0000 - U+0DFFFF) is designated for format control characters; this currently includes glyph variation selectors, and language tags.
Finally, Planes 15 and 16 (U+0E0000 - U+0FFFFF) have been allocated for Private Use, just as U+E000 - U+F8FF have been in the BMP.

The SMP includes three types of scripts and notational systems that are associated somehow with Greek:

Scripts closely derived from the Greek script, or formerly used to write the Greek language.
Notational systems used in a Greek-language context.
Notational systems interloping the Greek script, but extraneous to any Greek-language context.

I choose to call these Semi-Greek, Para-Greek, and Anti-Greek.

1. Semi-Greek

1.1. Derived from the Greek script

The Unicode Standard currently includes two scripts closely derived from the Greek script:

Old Italic (U+010300 - U+01032F), derived from the Western archaic alphabet (specifically that of Chalcis in Euboea);
Gothic (U+010330 - U+01034F), derived from the late uncial script.

Both may be thought of as Greek plus extra characters; in that, of course, they resemble the other interloping scripts of Greek:

Coptic (which was long considered Greek plus extra characters by Unicode, and unified with it—but was disunified in Unicode 4.1 and moved to U+2C80 - U+2CBF), derived from the early uncial script;
Cyrillic (U+0400 - U+052F), derived from the 9th century majuscule script;
and of course Latin (U+0020 - U+024F, U+1E00 - U+1EFF), derived from Old Italic.
Armenian and Georgian are probably also partly based on Greek, but could not really be conflated with it.

However, texts originally written in Old Italic and Gothic appear in scholarly use in Latin-based transcriptions.

Gothic is transliterated basically as Germanic (with a thorn for the psi-derived U+10338 Gothic Letter Thiuth, 𐌸), but with a couple of letters added: q for U+10335 Gothic Letter Qairthra, 𐌵, and U+0195 Latin Small Letter Hv, ƕ, for U+10348 Gothic Letter Hwair, 𐍈.
The Old Italic languages—Etruscan, Oscan, Umbrian, Picene, Faliscan, Messapic, Old Latin, and others—are transliterated in the Latin alphabet with a few Greek additions, notably for the aspirates.

Old Italic is actually an abstraction over the epichoric proliferation of letters used in the various languages of Italy. Juan José Marcos' manual to the Alphabetum font (ZIP: PDF) has a listing of variants included in his font in the Private Use Area. Scholars working in Italic of course—all together now—don't Proliferate, but Transliterate; so they have never actually attempted this kind of standardisation for their own scripts. The Unicode Standard (§13.2) admits that fonts designed for different languages will need to select different glyphs.

Some further scripts based on Greek are also under consideration for inclusion in Unicode:

the Elbasan and Büthakukye (Beitha Kukju) scripts of 19th century Albania (see samples)—possibly also Veso Bei's script, although I can find no information on it and it is not in the current Roadmap;
and the Lycian, Carian, and Lydian scripts of Asia Minor. (There are others in Asia Minor, but only those have a significant corpus and are appreciably different enough from Greek. Phrygian for example, as discussed with reference to the zigzag iota, is readily conflated with Greek—and is transliterated in practice into Latin anyway.)

Other ancient scripts that share characters with archaic Greek, such as the Iberian script of Ancient Spain and the Numidian script of Ancient North Africa, are thought to have been derived directly from Phoenecian.

1.2. Used to write Greek

Plane 1 also includes two syllabaries formerly used to write Greek:

Linear B (U+10000 - U+1005F, U+10080 - U+1013F), the syllabary used to write Mycenaean Greek, around 1400 BC, and spectacularly deciphered by Ventris and Chadwick in 1953.
Cypriot (U+10800 - U-1083F) was used in Cyprus to write Greek until quite late (800-200 BC), which accords with the somewhat marginal status Cyprus had in the ancient Greek world.

Texts in both scripts are conventionally transliterated into Latin. Frequently enough the Greek forms underlying the CV-syllabaries (not a good fit for Greek) are then reconstructed and published in the normal Greek script, with requisite additions (yot, digamma, qʷ, and so forth)—particularly if the text is cited in linguistic discussion of Greek, where it needs to blend in with Classical examples.

Each syllabary as a pre-Hellenic counterpart from which they seem to have been derived:

Linear A, which has been proposed for U+10600 - U+1077F; this was used to write the pre-Hellenic language of Crete.
Cypro-Minoan has been proposed for U+10780 - U+107BF; it was used to write the undeciphered, possibly Semitic language (Eteocypriot) of the Cypriot hinterland. It is first attested in the 16th century BC, and appears to have ultimately originated in Linear A (or B). I am currently uncertain of whether it can be conflated with the (Greek) Cypriot syllabary or not. Somewhat over-optimistically, one occasionally sees Cypro-Minoan associated with the Phaistos disk.

2. Para-Greek

2.1. Musical Notation

There are two musical notation schemes associated with Greek.

The Byzantine Musical Symbols (U+01D000 - U+01D0FF) are the notation used for Byzantine chant. The system consists of three phases: the Middle Byzantine, from the 9th century on, the Late Byzantine, from the 14th century on, and the Modern, which was standardised in the early 19th century. Though the basics of the notation are shared, the semantics of the notation has changed, and the decipherment of the Middle Byzantine notation was only done in 1916, by the musicologist and composer Egon Wellesz.

The two earlier notation systems are of course used by the few musicologists working on mediaeval Byzantine music. The Modern Byzantine notation system continues to be used for the notation of liturgical chant in the Greek Orthodox church (including the Balkans, the Middle East, and Romania); Western musical notation has made no inroads there. There has been use in the past of Byzantine notation for Greek folk music as well. This has mostly yielded to Western notation, particularly in the scholarly sphere, but publications with folk music in Byzantine or in both notations still appear.

I am not certain at this time whether the various musical notations employed in the Orthodox church outside the domain of Modern Byzantine notation (e.g. Russia) can be conflated with Byzantine notation as it is currently encoded in Unicode.
Archaic Greek Musical Notation was proposed by the TLG in November 2002, and was included in Unicode 4.1, March 2005, in the block U+01D200 - U+01D24F. I cannot but defer to the comprehensive background document prepared by my colleague Richard Peevers. The notation is used in editions and discussion of Ancient Greek music and musical theory; the corpus is not huge, and neither is the field, but the characters are being used.

Many of the symbols used in Ancient Greek notation are simply reused normal Greek characters (though typically in a distinct, sans-serif typeface); so this is arguably a classic interloper script. The proposal admits that those symbols cannot be differentiated from the normal Greek characters, and does not repeat those characters: it includes only the characters specific to musical notation, which have been formed from normal Greek letters through devices such as truncation, rotation, and added strokes. A couple of characters resemble extant characters, but it has been decided not to conflate them; for instance U+1D209 Greek Vocal Notation Symbol 10, 𝈉 (A below middle C) looks like U+03D8 Greek Letter Archaic Koppa, Ϙ; but the musical symbol is derived from an omicron with a stroke underneath, and so should be regarded as distinct from the koppa.

2.2. Numeric Notation

2.2.1. Acrophonic Numerals

Before Greek generally adopted the Milesian system for its numerals, the dominant system was acrophonic—which simply means "initials". The system was pretty much like the Roman system, in that it had a letter for 1, 5, 10, 50, 100, and so on. It used the initials of the numbers involved, outside of the inevitable Ι = 1: Π (πέντε) = 5, Δ (δέκα) = 10, Η (hεκατόν) = 100, Χ (χίλιοι) = 1000, Μ (μύριοι) = 10,000. However, the acrophonic system lacked the Roman shortcut of writing to the left to subtract: 4 was ΙΙΙΙ, not ΙΠ. Numerals not covered by initials due to ambiguity were handled by ligatures: 50 (πεντήκοντα) was formed as a delta nested inside a pi (5 tens).

The acrophonic system is routinely used in publication of inscriptions (the Milesian system did not come into general use until after the Classical era), but the Unicode proposal took a while to emerge. One of the main points of debate was, as usual, whether to conflate or not the existing Greek characters. The normal Greek characters used with a numerical value, ΙΠΔΗΧΜ, appear in print in an epigraphical, sans-serif font, which differentiates them from the surrounding Greek text; moreover, pi appears in its archaic form, with its right leg truncated. This means that there was a case for and against including those characters in any proposal: their typography distinguishes them from the letters, but conceptually they are the same letter, and the distinction is modern editorial, not inherent in the inscription.

Acrophonic Greek Numerical notation was proposed by the TLG in June 2003, and included in Unicode 4.1, March 2005, in the block Ancient Greek Numbers, U+10140 - U+1018F (as the subrange U+10140 - U+10174). As formulated by the TLG, the current proposal excludes ΙΔΗΧΜ. However, it includes the archaic, truncated pi (as U+10143 Attic Acrophonic Symbol Five, 𐅃), since it is never typographically conflated with normal pi (except on this webpage).

Apart from numerals, the Attic acrophonic system employed distinct symbols for counts of money and/or weight (talents, staters). For instance, five talents was represented in Attica as U+10148 Attic Acrophonic Symbol Five Talents, 𐅈, and five staters as U+1014F Attic Acrophonic Symbol Five Staters, 𐅏. There was also regional variation in the shape of glyphs used for both numerals and counts of money. The TLG proposal has elected to treat these as distinct codepoints; for instance, a distinction is made between U+10144 Attic Acrophonic Symbol Fifty, 𐅄, U+10166 Troezenian Acrophonic Symbol Fifty Type One, 𐅦, U+10167 Troezenian Acrophonic Symbol Fifty Type Two, 𐅧, U+10168 Hermionian Acrophonic Symbol Fifty, 𐅨, and U+10169 Thespian Acrophonic Symbol Fifty, 𐅩.

The Unicode code chart points out that "These are shown as sans-serif forms because that corresponds more closely to their appearance in ancient texts." More to the point, the convention in epigraphy is to use sans-serif when an unnormalised text is reproduced; since these numeric signs are not alphabetic letters, they are not treated like alphabetic letters, but as signs straight off the stone.

2.2.2. Ancient Greek Papyrological Numerals

Some distinct symbols also evolved post-classically as used in papyri, and most persisted in use in codices. Rather than integers, these symbols include fractions, and measures of time, money, weight, and capacity. Of these, while the symbol for 'year', U+10179 Greek Symbol Year, 𐅹, is often expanded in editions of papyri, the other symbols appear routinely in editions of later technical Greek works—medical works for the symbols of weight and capacity (in prescriptions); astronomical, mathematical and medical works for the fractions.

As the TLG proposal on these symbols makes clear, there is abundant variation in the glyphs used to represent the various values, and modern editors have made no attempt to impose a standard. The angular, the tilde-like, the S-like and the lunate sigma symbols for 1/2, for example, are all in common use in modern editions. The TLG proposal conflates rather than proliferate; mechanisms for representing glyph variation in older texts are an outstanding issue for text markup, but here the TLG has decided against foisting that variation onto Unicode.

Things are made worse when printers have improvised; for example, the use of the Latin L as a glyph for 1/2 in Theon of Alexandria's minor commentary on Ptolemy's Easy Tables (Tihon, A. 1978. Le petit commentaire de Théon d'Alexandrie aux tables faciles de Ptolémée. Studi e Testi 282. Vatican City: Biblioteca Apostolica Vaticana) seems to be a typographical convenience in the absence of a proper angular glyph. (The same editor's edition of the major commentary by Theon uses lunate sigma: Mogenet, J. & Tihon, A. 1985–91. Le grand commentaire de Théon d'Alexandrie aux tables faciles de Ptolémée. 2 vols. Studi e Testi 315 & 340. Vatican City: Biblioteca Apostolica Vaticana.)

The TLG made a proposal concerning these characters in June 2003, and this has been incorporated in Unicode 4.1 as the subrange U+10175 - U+10189 of Ancient Greek Numbers.

There is a clear inconsistency in how acrophonic and papyrological numbers are handled: the papyrological numbers are conflated, the acrophonic proliferated. There's a simple explanation: a prominent conflater in the TLG was involved in the former proposal, and not in the latter. :-) (There is still an alternative form of half preserved: U+10176 Greek One Half Sign Alternate Form, 𐅶, alongside U+10175 Greek One Half Sign, 𐅵.) That said, there is a semantic distinction in the various acrophonic numerals that is absent in the variation in fractions -- even if that distinction, being a matter of regional provenance, is entirely predictable. The multiplicity of possible glyphs poses a challenge, and it may end up proving insuperable; but as technology catches up with the multiple-glyph-per-codepoint issue, then again, it might not.

The S-like glyph for half also turns up in Coptic, as U+2CFD Coptic Fraction One Half, ⳽.

3. Anti-Greek

Mathematics as a system is the great interloper, wrenching characters out of their typographical context and using them with completely different semantics, as mathematical terms. Greek is not the only script to have been scavenged in this way; the Latin script has had different script traditions and typographical styles imbued with distinct semantics. So there is a distinction made between U+0048 Latin Capital Letter H, H; U+210B Script Capital H, ℋ (the Hamiltonian function); U+210C Black-Letter Capital H, ℌ (the Hilbert space); and U+210D Double-Struck Capital H, ℍ (the quaternions; see more on double-struck characters).

In Mathematics, then, shifts of typeface, script, and style are important enough to yield completely distinct meaning: the difference between a script and a blackletter H is rather more grave in Mathematics than it is in normal textual use of the Latin script. In textual use, Helmut in Fraktur, italics, and cursive style are identical in meaning. This is why the distinction is extraneous to the notion of plain text: so the Latin script in Unicode does not have a distinct codepoint for Fraktur H, Italic H, and Cursive H.

Mathematics does. In fact, there's a whole block of them at U+1D400 - U+1D7FF: Mathematical Alphanumeric Symbols. So you will find there distinct codepoints for:

U+1D407 Mathematical Bold Capital H, 𝐇;
U+1D43B Mathematical Italic Capital H, 𝐻;
U+1D46F Mathematical Bold Italic Capital H, 𝑯;
U+210B Script Capital H, ℋ (supplanting the allocated U+1D4A3);
U+1D4D7 Mathematical Bold Script Capital H, 𝓗;
U+210C Black-Letter Capital H, ℌ (supplanting the allocated U+1D50B);
U+210D Double-Struck Capital H, ℍ (supplanting the allocated U+1D53F);
U+1D573 Mathematical Bold Fraktur Capital H, 𝕳;
U+1D5A7 Mathematical Sans-Serif Capital H, 𝖧;
U+1D5DB Mathematical Sans-Serif Bold Capital H, 𝗛;
U+1D60F Mathematical Sans-Serif Italic Capital H, 𝘏;
U+1D643 Mathematical Sans-Serif Bold Italic Capital H, 𝙃;
U+1D677 Mathematical Monospace Capital H, 𝙷.

The prospect here should be filling you with horror. What if someone wants to write in K00l 3133t Fraktüür, d00d, and pens their webpage in the Mathematical Fraktur range U+1D504 - U+1D537 (with occasional excursions into U+2100 - U+214F Letterlike Symbols for missing characters)? What will happen is that the webpage will contain a bunch of mathematical symbols, which will look like a normal vintage mid-'70s heavy metal album cover, but will not be searchable as anything but a bunch of pages on Lie Algebra. The elemental requirement of plain text, that characters which mean the same thing should be encoded the same way regardless of presentation, will have been violated—and that is not good news. The characters do have a compatibility decomposition down to the corresponding Latin letters; but that's not enough of an excuse to do this.

Since Mathematics pilfers from Greek almost as much as it does from Latin for its symbols, you can be sure Greek was not spared the duplications:

U+1D6AE Mathematical Bold Capital Eta, 𝚮;
U+1D6E8 Mathematical Italic Capital Eta, 𝛨;
U+1D722 Mathematical Bold Italic Capital Eta, 𝜢;
U+1D75C Mathematical Sans-Serif Bold Capital Eta, 𝝜;
U+1D796 Mathematical Sans-Serif Bold Italic Capital Eta, 𝞖.

The mathematicians may well need these symbols, and need them to be distinguished from the normal letters in plain text. The rest of you will have worked out why I termed these symbols "Anti-Greek": they are never, ever to be used in a Greek text context, but only in mathematics. Never, ever. Ever.

Hope that's clear.

.i ta'o mi ckire la djan.kau,n. noi pu clite kajdygau mi tu'a la'edi'u

Nick Nicholas, opoudjis [AT] optusnet . com . au
Created: 2003-06-15; Last revision: 2005-05-23
URL: http://www.opoudjis.net/unicode/unicode_astral.html

7. The "Astral Planes": Supplementary Planes and Greek