Script Mixing

		9. Script Mixing
		Home > Greek > Unicode
Language: ENG ELL EPO JBO TLH LAT		Home > Greek > Unicode

1. Wakhi Kurds

In the olden days, when typographical characters were just pieces of lead type, Greek characters were one bunch of lead slugs, Latin characters another, and Cyrillic a third. If for some reason you needed a Latin character in the middle of a Cyrillic word, you just picked a slug from tray C, and put it in amongst the slugs from tray L. No damage done.

I am ignoring the fact that—as Haralambous touches on (§1.7.4) in his discussion of recent adaptations of Latin fonts to Greek—the results will make the fastidious typographer exclaim "My eyes! My eyes!": Greek and Latin fonts have different design principles (in particular, Greek lowercase doesn't really have serifs as such), and it will take the font designer some work to get the mixed characters to harmonise in the one font. But that is a glyph issue, and not a codepoint issue, which is what I'm discussing here.

Why might you mix characters from different scripts? Almost always, because you're tinkering with your own script, making it do things it wasn't able to do. Occasionally, because you're showing off the fact that you know another script instead; the number of Greek pop singers who slot Latin letters into their stage names is distressingly large (and by distressingly large, I mean non-zero).

Now, if this is just showing off, and everyone knows that the 'real' form of the word involves just the one script, you might be able to dismiss it as formatting and markup: Unicode has no business differentiating between the canonical Modern Greek να σου πω "let me tell you", and Jannis Androutsopoulos' childhood friend's affected να sου πω—any more than it has distinguishing between italics and boldface.

But it's very often not just showing off; and indeed, even when it is, you still may not be able to reduce the combination to one script (see the Basilica below). Often enough, this represents orthographic reform. If you're writing Serbian for the first time in Cyrillic, for example, you are confronted with the fact that Serbian has a phoneme /j/ that Cyrillic doesn't treat adequately. The solution Serbian took was to incorporate j as a new grapheme in Cyrillic; thus, Патријаршија "Patriarchate". Going the other way, if you're studying Old Church Slavonic in Latin transliteration, you still need a way of handling the vowels ь and ъ (soft sign and hard sign, aka jeri and jeru). You may choose to transliterate them them as ĭ and ŭ or (ick) ' and "; but many Slavonicists just leave them in Cyrillic, giving transcriptions like sъbьrati instead of sŭbĭrati or (ick) s"b'rati for събьрати "to bestow".

There are more insidious ways of mixing scripts—for example, combining diacritics from one script with letters from another, something Samaritan does (Hebrew letters, Arabic vowel pointing). The horrific and thankfully marginal equivalent to that in Greek is coming up. But if it's merely a matter of mixing one tray of slugs with another, this is not a big deal.

But Unicode is not just a tray of slugs. The characters in Unicode have properties—in particular, they are identified as belonging to a given script. In the case of Serbian, ј is considered no less a part of the Cyrillic alphabet than ш, and it is accordingly given its own Cyrillic codepoint: U+0458 Cyrillic Small Letter Je. This means that text processing involving Serbian does not freak out when it encounters ј. For instance, U+0458 Cyrillic Small Letter Je sorts between U+0438 Cyrillic Small Letter I and U+043A Cyrillic Small Letter Ka—not between U+0069 Latin Small Letter I and U+006A Latin Small Letter K! The same was done for the Slavonicists (sort of): U+0184 Latin Capital Letter Tone Six, Ƅ is the Cyrillic soft sign, used as a tone letter in Zhuang, and you could in a pinch use it as a soft sign in a Slavonic transliteration context.

That this was not the rationale for the "Latin soft sign" being included in Latin script is proven by the fact that there is no corresponding crossover for the hard sign into Latin: Zhuang borrowed the Cyrillic soft sign into its Latin-based script, but not the Cyrillic hard sign.

So the time-old problem stalking Unicode turns up here too: to conflate, or to disunify? These issues always have a canonical rallying point on the Unicode mailing list, and this time, it is Kurdish q. Kurdish as written in the former Soviet Union uses Cyrillic, but the phoneme /q/ is written with the Latin grapheme q—corresponding to the Latin script Kurdish q and the Arabic script Kurdish ق. The question inevitably came up on what to do with it: make it a Cyrillic letter like Serbian je, or leave it as a Latin letter in the middle of Cyrillic text.

The proposal for disunifying q in Kurdish was rejected in 1997, but there is still debate on it. As a perusal of the threads shows (Coptic II; mixed-script writing systems; Cyrillic Q; Cyrillic Q, W for Kurdish; orthographic characters for glottal stop; phonetic superscripts, etc.; Pan-Cyrillic Ordering), the issues for and against disunifying are:

For

Typographically Kurdish Cyrillic Q does things Latin Q would never do: the capital version often looks like a big lowercase q. Not intractable to have languge-dependent handling of glyphs (Serbian italics have to do it), but annoying.
If you have a list of both Cyrillic and Latin words, where you fit q in sorting depends on whether it occurs in a Latin or Cyrillic context. Again, not intractable, but annoying.
We do this all the time in Unicode (je, soft sign, o, a...)

Against

Legacy: anyone that has typed Cyrillic Kurdish in Unicode (or an 8-bit equivalent) until now has mixed scripts, and disunifying would make that text invalid. (Though as Michael Everson on the "for" side points out to Asmus Freytag on the "against" side, where exactly is the Cyrillic Kurdish electronic corpus anyway?)
They look the same; characters that look identical and have never not looked identical should be handled the same way, because Unicode users should not have to use a magic ball to work out the script of a character before using it.
People are going to keep thinking they're the same, and using them anyway—particularly as there'll be more fonts with Latin or Cyrillic than with the new script-hoppinhg characters, if they're adopted. And does Unicode really need a hundred more duplicates?
The sorting rationale is a canard: Swedish and German sort differently (German sorts ö with o, Swedish puts ö at the end of the alphabet)—yet noone would seriously advocate disunifying German and Swedish ö just to make sorting work. If a solution involving language tagging of individual words is necessary to get Swedish and German mixed lists working, that same solution will work for Kurdish, and obviates any need for duplicating the characters involved. Mixed script sorting is in any case highly variable: while historically Greek and Latin script words have been kept separate, the recent trend in Greek scholarship is to weave the Greek words into the Latin sorting order via implicit transliteration.

I'm not convinced there are ultimately rational arguments to decide this, as opposed to biases. But though Ken Whistler did not affix "I am speaking ex cathedra as Unicode Consortium Technical Director" in his response on Wakhi (a language spoken in Afganistan and neighbouring countries, whose new script mixes Greek and Cyrillic with Latin)—I like what he said, and go along with it:

If everyone can hold off on the Kurdish rhetoric for the moment, it should be clear that such mixed orthographies as Peter [Constable] has shown in Wakhi are best handled by simply using the characters that are already encoded, rather than cloning more and more characters into Latin, Greek, and Cyrillic to deal with the artificial constraint that would claim that any LGC-based alphabet [Latin–Greek–Cyrillic—ed.] *must* consist only of a single script. In point of fact, people for centuries have been borrowing back and forth between Latin, Greek, and Cyrillic in particular, so that in some respects LGC is a kind of metascript and should be treated as such.

...

It isn't doing anyone any favors to keep cloning such cross-script borrowings into the character encoding standard, *unless* there is strong evidence of script-specific adaptation of the letters after their borrowing. The handling of Latin Q in the otherwise Cyrillic Kurdish alphabet is what makes it the marginal case it is and argues for encoding of a separate Cyrillic Q.

The executive summary is:

if the duplicate is in a legacy encoding as a distinct character, it stays;
if the duplicate has a distinct typographical tradition in the new script, it might stay;
if it missed the boat, and has never had an independent typographical tradition in the language, the script switching won't kill you.

But there isn't consensus on this, and prominent Unicoders (notably Michael Everson and John Cowan) remain pro splitting. Worthies both, but I'm allowed to disagree with them on this.

2. Hijinks in the Basilica

Byzantium inherited its legal code from the Roman State. In fact, as far as it was concerned, it was the Roman State; it just shifted its official language (and script) gradually from Latin to Greek. As it was doing so, it was translating the laws, but not necessarily transliterating the legal authorities it was citing; their names, and the names of Roman legal institutions, stayed in Latin script—at least to begin with. So far this is nothing unusual; any Greek scholarly article written from the 20th century on routinely leaves names and technical terms untransliterated as well.

But the Byzantines had a new toy at their disposal: diacritics. In the early Middle Ages, they were applying breathings marks consistently to documents that didn't have them originally, and breathings were obligatory for words starting with vowels. Constantinople is a long way from Rome, and regular rules are a powerful force. If every Greek word has to have these new fangled diacritics, the scribes figured, every Latin word should too.

Moreover, yes, the Latin names and terms are taken over untransliterated; but they still have to be inflected in Greek rather than Latin. (It is only contemporary Greek, confronted with loanwords from morphology-lite and morphology-free languages, that has given up and treats borrowed nominals as indeclinable.) So you would have Latin stems and Greek inflections—in Latin and Greek script respectively.

Which means that the manuscripts of early Byzantine legal texts are awash with script mixing; and for better or worse, their editors have respected that script mixing. The major quirk of the Novellae, Justinian's 6th century redaction, is the mixed inflections; that of the 8th century Basilica is the mixing of Latin names with Greek diacritics. Some samples of each to delight the eye:

Οὕτω τοίνυν ἡμῖν τῶν ἀρχῶν διακεκριμένων προσήκει τὸν ἐνταῦθα παραλαβόντα τὴν ἀρχὴν μετὰ τῆς τοῦ θεοῦ μνήμης ἐναντίον ἡμῶν, ἢ εἴπερ ἡμῖν οὐκ εἴη σχολή, ἐναντίον τῆς τε σῆς ὑπεροχῆς, καὶ τῶν ἀεὶ τὸν σὸν κατακοσμησόντων θρόνον, τοῦ τε ἀεὶ ἐνδοξοτάτου κόμητος τῶν θείων ἡμῶν largitionων τοῦ τε ἐνδοξοτάτου quaestorος τοῦ θείου ἡμῶν παλατίου τοῦ τε ἐνδοξοτάτου κόμητος τῶν ἁπανταχοῦ θείων ἡμῶν privatων (Novellae p. 69–70)

So having set out the authorities, it was fitting for us on taking on this authority, bearing in mind God's power over us—or, if indeed we do not care for that, over your eminence and over those who have ever set your throne in order—the ever glorious Count of our divine largitiones, and the glorious quaestor of our divine palace and the glorious Count of our divine privata throughout the land...

(3a.) Paúlu. Συνηγοροῦσι καὶ τοῖς ἄλλοις τοῖς παρ’ αὐτῶν κουρατωρευομένοις, οἷον κωφοῖς καὶ ἀσώτοις καὶ ἀφήλιξι καὶ ἀσθενέσι
(4.) Ulpianû. καὶ τοῖς διὰ τὴν διηνεκῆ νόσον μὴ δυναμένοις διοικεῖν τὰ ἴδια.
(5.) ᾿Idém. Οἱ γὰρ μὴ ἑκουσίως, ἀλλ’ ἐξ ἀνάγκης διοικοῦντες, κἂν ὦσι τῶν ὑπὲρ ἑαυτῶν μόνων συνηγορούντων, ἀκινδύνως καὶ τοῖς παρ’ αὐτῶν κηδεμονευομένοις συνηγοροῦσιν. Ὁ κωλυθεὶς ἀπὸ ἄρχοντος ἐπὶ τῶν χρόνων τῆς ἀρχῆς αὐτοῦ συνηγορῆσαι παρὰ τῷ διαδόχῳ συνηγορεῖ.
(6.) Gaḯu. Ὁ κωλυθεὶς συνηγορεῖν οὐδὲ τοῦ ἀντιδίκου αὐτοῦ συγχωροῦντος συνηγορεῖ.
(7.) Papianû. Ὁ πρὸς καιρὸν κωλυθεὶς συνηγορεῖν μετὰ τὸν χρόνον πᾶσι συνηγορεῖ, καὶ ὁ ἀπὸ ἐξορίας ὑποστρέψας. Καὶ ἀδιάφορον, διὰ ποῖον ἔγκλημα ἐκωλύθη ἢ ἐξωρίσθη. (Basilica 8.1)

(3a.) (Paul) They also legally represent the others in their custody, such as the deaf, the homeless, the elderly and the ill;
(4.) (Ulpian) And those who due to long-standing illness cannot administer their own affairs.
(5.) (Ibid.) Those who administer affairs not willingly but out of necessity, even if they are among those who legally represent themselves, may also represent those in their custody without danger. One who has been barred by a governor from representing his own authority due to age is represented by his heir.
(6.) (Gaius) One who has been barred from offering legal representation may not offer it even if his opponent assents to it.
(7.) (Papian) One who has been barred from offering legal representation for a given time may resume offering legal representation, as may one who has returned from exile. It is immaterial what crime has caused him to be barred or exiled.

Unlike the Novellae, the Basilica keep the Latin-ish words in Latin script, including the Greek genitives attached to them; but the diacritics are Greek in conception and placement. The circumflex may be Latin in the edition, but that is probably a typographical convenience; there is no reason to think the scribes differentiated yet between Greek and Latin circumflex—so the circumflex is probably more accurately represented as a perispomeni (Ulpianu͂, Papianu͂). The combination of i, acute and diaeresis is purely Greek. Even more Greek is the left positioning in titlecase of the breathing mark modifying the capital I in Idem: with default Unicode positioning of the diacritic post-modifying, this ends up instead as the barely legible I̓dém. (If you can read it, your font and browser are doing an exceptional job. Yay MacOSX!)

You'll be pleased to know that you won't see too much of this; this kind of mixing is pretty close to a one-off.

2.1. Script Mixing Abroad

... Well, close to a one-off. There are a couple of instances of script mixing abroad. If a Latin scribe learned Greek between the ninth and eleventh centuries, and then had to correct a missing h in a Latin manuscript, they thought it the height of fashion to use the new Greek technology, and pop in a rough breathing mark instead; thus, annibal would end up as a҅nnibal, with the archaic form of the rough breathing diacritic. More rarely (and ludicrously), they would even delete redundant h by using the smooth breathing, ҆. (Thompson 1912:64)

The other instance abroad is indicated by the current dwelling place of the archaic breathing signs in Unicode: U+0485 Combining Cyrillic Dasia Pneumata and U+0486 Combining Cyrillic Psili Pneumata. (Pneumata is Greek for "breathing marks".) The Marian Codex of the Gospel according to Matthew, for instance, reads at 6:16 не бѫдѣте ѣко ѵ҅покрити "be not as the hypocrites" (Greek μὴ γίνεσθε ὥσπερ οἱ ὑποκριταί). The breathings occured in Cyrillic words transliterated from Greek in the oldest Cyrillic manuscripts, but were also used as orthographical devices: Alexander Berdnikov reports и҆ was used as a variant of й. From what I gather, Cyrillic used the modern shapes of the breathings, as well as the older tacks. The breathings, of course, did not outlive the initial flurry of translating religious texts from Greek.

There are modern scholarly instances where Greek letters wander into Latin script; for instance, the transliteration of Etruscan routinely includes the Greek aspirates θ φ χ, but is otherwise in Latin script. The IPA and mathematical script mixes the two scripts routinely, and I describe them separately.

3. Heta

A more mainstream instance of script mixing arises in archaic Greek inscriptions, which is the Latin script version of heta. You may recall the instance of the inscription cited as hα στάλα ἔσστα "was set up". The italicisation of the h shows the ambivalence of typographers as to which script the h belongs to. The convention in Western European scholarship on Greek is that citations of linguistic forms in Latin script are italicised, to differentiate them from the surrounding (meta)text, but linguistic forms in Greek script are left in plain, because the switch in script is enough to signal the difference between text and metatext; so I would cite στάλα, but Latin stetit or Greek (in transliteration) stála. By italicising the h in hα στάλα ἔσστα and leaving the Greek alone, the typesetter is treating this overtly as a case of script mixing, and does not trust the fact that the h is embedded in Greek to make it behave as if it was Greek, and ignore italicisation.

So what is to be done with heta? Is it to be made a distinct character, or is this to be treated as script mixing? David Perry raised this issue on the Unicode list (h in Greek Epigraphy). Ken Whistler responded correctly, as you'd have expected him to: heta should be encoded as the Latin letter h (unless, I should add, it should be encoded as tack heta).

Of course, it is utterly arbitrary that yot should get a Greek codepoint because ELOT was aware of it and included it in their discussions with Unicode, whereas heta didn't, they weren't, and they didn't. Then again, I'm not exactly convinced the world needed yot as a codepoint.

4. Modern Greek Dialectology

Much more Latin gets wedged into Greek script in the transcription of Modern Greek dialect and non-Hellenic languages of Greece. Latin letters are used to transliterate phones which the linguist feels the Greek script cannot handle. Of course, one could just use the IPA for the lot; but that would be implying that the Greek text is something the Greek alphabet is not suited for, which Greek linguists are disinclined to do. IPA or other Latin-based transcriptions of Modern dialect have appeared only sporadically, typically from non-Greeks at the start of the 20th century (Pernot, Kretschmer, Roussel; Georges Drettas' recent grammar of Pontic is exceptional in many ways, not least of which is its avoidance of Greek script).

Very occasionally, the interposed Roman characters have leaked into lay orthographies for dialects (to the extent that non-linguists write dialect down) --- mostly Cyprus, which is written down by non-linguists the most, and which has long had access to Latin script. Usually, however, the distinctions involved are etic and not emic, and in almost all cases a dialect speaker can merely read out traditional Greek script, with slightly different phonetic value for the letters in their dialect.

The following is a list of Latin letters so used, and how widely they are used. I don't count dialects of Greek spoken in Italy, which are normally written in the Latin-based conventional Italian dialectal transcription with added Greek characters.

Letter	Phonetic Value	Distribution	Prevalence
b	[b] as distinct from [mb], < */mp/: μπ	Many dialects, prominently Cretan	The only form used in scholarship for most of the past century; no penetration into lay usage
d	[d] as distinct from [nd], < */nt/: ντ	Many dialects, prominently Cretan	As for b
g	[ɡ] as distinct from [ŋɡ], < */ŋɡ/: γκ, γγ	Many dialects, prominently Cretan	As for b
č	[tʃ]	Cappadocian	Used in Dawkins' Cappadocian grammar; displaced by τσ̌
ǰ	[dʒ]	Cappadocian	Used in Dawkins' Cappadocian grammar; displaced by τζ̌
sh, ch	[ʃ]	Cypriot (lay)	Used earlier last century in lay transcriptions of Cypriot (e.g. the poet Lipertis); displaced in the lay orthography by palatalised /s/ (σι + Vowel). In effect, what used to be written <sj> is now pronounced [sh], and <sj> is used wherever [sh] is needed. This does not apply at the end of words where no vowel is available; lay usage keeps <sh> in that case. E.g. Ancient χοῖρος "pig" is now pronounced as [ʃiros]; scholarly transcription has χ̌οίρος, preserving the etymology, but lay usage consistently uses σιοίρος <sjiros>, where σι = <sj> before a vowel becomes [ʃ]. On the other hand, Turkish baş "chief", in Cypriot [paʃ], is written in scholarly usage as πασ̌, but lay usage prefers παsh
e	[ə]	Albanian	One of several transliterations of Albanian ë; ἐ and ε̠ have also been used
ə	[ɯ]	Cappadocian	Only symbol used
h	[h]	Aroumin, Albanian, Ophitic Pontic	Occasional; speakers of Aroumin and Albanian in Greece usually assimilate the phoneme to the Greek χ [x] anyway
k	[kʰ]	Pontic	Once or twice; Pontic specialists rarely notice that there's a minimal pair of /k/:/kʰ/ in the dialect
q	[q]	Cappadocian	Dawkins' grammar
e	[e̝]	Corsica	Used in Blanken's grammar to indicate raising of stressed vowels
o	[o̝]	Corsica	Used in Blanken's grammar to indicate raising of stressed vowels. In Blanken's typeface, the difference between Greek ο and Latin o is clear
î	[ɨ]	Samothrace	Used in Katsanis' grammar
ê	[ə]	Samothrace	Used in Katsanis' grammar

While the preceding listing is presumably not exhaustive, it's hard for me to think of what I might have left out; Greek dialectologists avoid Latin whenever they can, even when this proves self-defeating.

A straightforward development like l > lˠ > w in Propontis Tsakonian, for instance, or v > w / _r in Samothrace, confounds Greek dialectologists, who use an alphabet which does not distinguish between [u] and [w]. They accordingly write άουο "/ˈauo/" = /ˈawo/ < άλλο /ˈalo/; μάουους "/ˈmauus/" = /ˈmawus/ < μαύρος /ˈmavros/. (Why it does not occur to them to use the non-syllabic diacritics so prominent in their treatment of Modern jot—άο͡υο, μάο͡υους—is a mystery to me. Why they don't use digamma is more clear: it's too obscure even for dialectologists. It's noteworthy that the only scholar ever to have used digamma to transcribe Propontis Tsakonian was Makris—not a professional linguist, but an antiquarian doctor.)

5. Casing

There's a problem with the Latin characters we have just seen; in uppercase, some are identical to Greek letters: H Η B Β E Ε O Ο K Κ. This is not a good thing, even though the problem resolves down to just H Η B Β, since the use of the others is marginal to begin with. The solution turns out to be simple: refuse to use the characters in uppercase.

I discuss the casing of latin heta in the appropriate page. For Modern Greek voiced stops, the approach has been simply to abandon capitalising those words. Recall that this only occurs in linguistic transcription (the occurrence of the unprenasalised stops is predictable), so noone is going to use it in a 'real' text, where the absence of casing is likely to be a problem. This occurs even when the result leads to bona fide ambiguity. The following is the epigraph to Loucopoulos & Loucatos' 1951 collection of proverbs from Pharasa:

Σὸ Βαρασ̌ὸ λέν' ἃν gατζ̌ί, / σὴν bόλ̣ην 'gούεται.
so varaˈʃo ˈlen an gaˈdʒi, sin ˈbol̠in ˈɡuete.
In Pharasa they say a word; it's heard in Constantinople.

Now there is a real ambiguity in Modern Greek between the common noun πόλη "city", and the proper name Πόλη "The City, Constantinople" (hence its Turkish name İstanbul). By ignoring casing on bόλ̣ην, the editors have allowed that semantic ambiguity to run unchecked. But the phonetic ambiguity with the /v/ of Βαρασ̌ὸ outweighed this; so the editors allowed the ambiguous case.

The editors did not extend their treatment of casing to the unambiguous Latin capitals, d, g: p. 83 Τὸ νερὸ 'ς τὸν Gούτσουρ' ἔν' θεό "(Rain) water from February is murky". Others do however: Contossopoulos (1996:148) has a Cretan story mention gαθάρια Δευτέρα, which is translated as Καθαρά Δευτέρα "Clean Monday" (the first Monday of Lent), a proper name for a feast day—not just any Monday involving a feather duster.

I have seen in manuscript fieldnotes casing done by using an outsized lowercase b. This is similar to any number of latterday capitals in Latin Extended-B, formed on the basis of lowercase letters -- e.g. U+0186 Latin Capital Letter Open O, Ɔ, from U+0254 Latin Small Letter Open O, ɔ -- or for that matter the outsized capital q which underlies the entire Cyrillic Kurdish Q controversy. I'd be reluctant to propose a novel codepoint for this b, though, especially since this glyph has not made it to print; at any rate it is clearly a capital version of U+0062 Latin Small Letter B, and we already have a codepoint for that: U+0042 Latin Capital Letter B. I'd prefer to see this handled Serbian Italic T-style, as a Greek-specific glyph variant of U+0042.

The one remaining Latin character which needs casing is yot. Though yot is used in contexts that militate against capital usage (in particular, it cannot be capitalised in titlecase, since it is not initial), it is possible for it to be used in all caps contexts. The default assumption would be that it capitalises just like Latin j → J, and I have no reason to think that it would decline capitalisation in an all caps context, the way b might. So a capital counterpart to it needs to be devised. My own malicious opinion is that this capital counterpart is U+004A Latin Capital Letter J; but I won't insist on it. The font New Athena Unicode has popped a capital Yot into the unallocated slot U+03FF, and Vusillus to the (now no longer unallocated) slot U+03F5; but that's naughty, and Vusillus' mishap with U+03F5 Greek Lunate Epsilon Symbol should be proof of that...

Sure enough: as of Unicode 4.1, U+03F5 is now the dotted antisigma.

Nick Nicholas, opoudjis [AT] optusnet . com . au
Created: 2003-08-03; Last revision: 2008-02-06
URL: http://www.opoudjis.net/unicode/unicode_mixing.html

9. Script Mixing