Greek Diacritics

1. Combining Diacritics

1.1. Diacritics Shared With Latin

U+0300 Combining Grave Accent [ ̀]; U+0301 Combining Acute Accent [ ́]; U+0308 Combining Diaeresis [ ̈]

The Unicode policy with diacritics in general is that they are encoded by shape, not by semantics or by language. So if two scripts have a diacritic in common that looks identical—even though it may have different functions from script to script or language to language—then it is assigned the same codepoint. In the case of Greek, this holds for grave, acute, and diaeresis. The titlecase combination of acutes and graves with breathing marks, moving to the right rather than above the other diacritic, is behaviour specific to Greek; but this does not alter the fact that the one codepoint covers the diacritic as used in Greek, French, Polish, or Russian. (The stacking of the acute, grave and—rarely—the circumflex over the diaeresis, on the other hand, is normal behaviour for multiple diacritics.)

The Unicode standard notes (§7.2) that the acute as used in Greek typography is steeper than it is for Western European languages; the same holds for the acute as used in Polish or Czech (§7.7). (As Victor Gaultney discusses in his Master's thesis (p. 7), this makes Greek and Polish archaic relative to Western European: the acute and grave were consistently steep in Western Europe as well until the early 20th century.) The Polish situation had led to Adam Twardoch advocating that the kreska (the Polish acute) be considered distinct from the Western acute; but of course, this is a classic Serbian Italics situation, and is resolved by having the glyphs, not the codepoints, accomodate to the language. Since fonts end up using precomposed glyphs, whether the underlying Unicode encoding is precomposed or not, and since the characters that acute combines with in Polish, Greek, and Western European do not overlap much (though Polish does have ó), this usually comes out in the wash; a more general solution, though, is language-specific fonts and glyphs, just as for Serbian. At any rate, this issue does not affect users of Greek.

An alternative solution, with is happening at the moment with fonts like Palatino Linotype and which Gaultney is in favour of, is designing an acute more compatible with more languages: since Western European accepts both a steep (old-fashioned) and a lower acute, but Polish and Czech require only the steep version, the Unicode-specific Palatino Linotype, as opposed to the older Palatino, opts for a steeper acute all round. But this kind of compromise is not always possible, and for fine typography should not be necessary; a Czech-specific font, as he describes (p. 18), with steeper acutes with flat tops, seems a better solution than imposing a Czech look on Italian, as Palatino Linotype has ended up doing (p. 21). Even if it does make sense for font foundries in terms of the bottom line.

When Unicode was under the impression that the monotonic accent of Greek is a vertical line, the tonos was identified with U+030D Combining Vertical Line Above. This diacritic is applied to Marshallese, and its origin is clear: backspace on a manual typewriter, and apostrophe overstriking the vowel. (U+030E Double Combining Vertical Line Above is the same trick done on Marshallese, this time with straight double quotation marks.) The diacritics, and the clunky old typewriters that begat them, have nothing to do with Greek.

1.2. Circumflex

U+0342 Combining Greek Perispomeni [ ͂]

With the circumflex, we have a character that theoretically should be conflated with the Latin circumflex, U+0302 Combining Circumflex Accent. After all, circumflex is a translation of the Greek περισπωμένη, so the characters are clearly historically identical. The form of the Latin circumflex also reflects the original Greek form: it was a combination of the acute and the grave (being a rising then falling pitch accent), and was originally also called oxybarys, "acute-grave" (Thompson 1912:61).

The catch is that on the one hand, Unicode does not encode characters by history, but by form. It would be untenable to encode the Greek circumflex with U+0302, since the glyph ˆ is never used in Greek typography. (I've seen some Renaissance use, but that was likely a local printer kludge.) On the other hand, there are two glyph variants of the Greek circumflex: the tilde and the inverted breve. Theoretically the circumflex could have been merged with U+0303 Combining Tilde and U+0311 Combining Inverted Breve; but that would have meant that the same underlying character would be encoded as two distinct codepoints depending on the typographical tradition—which would have made searching intolerably complicated. So Unicode has relaxed its constraint on encoding by form, and has encoded both glyphs for the Greek circumflex under the one codepoint, according to function. The perispomeni diacritic is thus specific to Greek, and joins Koronis, Dialytika Tonos and Ypogegrammeni as being the only script-specific diacritics in the Combining Diacritics block. (Admittedly the other diacritics are almost all limited to Latin. Other scripts keep their diacritics in their own blocks.)

The division of perispomeni shape among fonts available to me is as follows; non-Unicode fonts are italicised:

Tilde Inverted Breve
Alphabetum Unicode Aisa Unicode
Cardo Arial Unicode MS
Code 2000 Aristarcoj
FreeSerif Everson Mono Unicode
Gentium Galatia SIL
jGaramond Galilee Unicode Gk
MG Old Times UC GentiumAlt
Lucida Grande Porson
New Athena Unicode Attika
Palatino Linotype Symbol Greek II
TITUS Cyberbit Basic Graeca II
Tahoma GreekSansLS
Vusillus GraecaUBS
Athenian OdysseaUBS
Symbol Greek IIP Payne
Symbol Greek IIPMono PayneCondensed
Odyssea GreekWin
Korinthus Helena
Grecs du roi Kadmos
Grammata Metrp
GreekSans II Mounce
Hellenica SIL Galatia Extras
  OldFace Anglo

Note that Gentium is offered as two fonts, one for each shape of the circumflex. The Unicode code charts reference glyph is the tilde, which might explain its presence in a system font like Lucida or Tahoma; but the widespread use of the inverted breve shows that font designers (correctly) are not taking the reference glyphs as prescriptive.

As the Unicode standard notes (§7.2), the perispomeni is occasionally realised as U+0304 Combining Macron. My impression is this mostly happens in handwriting and ornamental type (I use it in my site logo); I have not seen any fonts that feature it. (The example Haralambous gives (§2.3), in correcting the standard's reference to U+0302 Combining Circumflex Accent, is a sans-serif font; in other sans-serif fonts, the wave in the tilde is slight enough to look like a macron.) Because macron is used in the linguistics of Ancient Greek, such usage is only tenable in Modern Greek typography, where it does not lead to confusion.

From Haralambous' discussion of the history of Greek typefaces (§5), it is clear that the default typeface in Greece (apla) uses the tilde, while the inverted breve is more common outside Greece (Porson, Belles Lettres)

The inverted breve is associated historically with the early 19th century English classicist Richard Porson, whose typeface design has dominated English classical studies since; it's what you'll see in the Loeb Classical Library (even after their recent typeface redesign). However, Porson's circumflex is in fact a revival of the older shape of the diacritic. Historically, the inverted breve is what appeared first on the papyri, as a quicker way of writing ˆ = ´ `; it continued in use in both uncials (early mediaeval) and minuscules (lowercase, late mediaeval) manuscripts. The very first facsimile of a manuscript in Thompson (1912) in which I can see a tilde circumflex is the very last one chronologically, dating from 1479 (Thompson 1912:268). The innovation of the tilde, of course, came just in time to be transferred into Greek printing, where it has been entrenched ever since.

The distribution of the form of the circumflex inside vs. outside Greece is only a tendency, of course; in fact, when Greek dialectologists use perispomeni as a diacritic over consonants (indicating palatalisation: σ̑ ζ̑ ξ̑ ψ̑ γ̑ χ̑ g̑ κ̑ λ̑ ν̑ /ʃ ʒ kʃ pʃ ʝ ç ɟ c ʎ ɲ/), the form of the perispomeni they use is invariably the inverted breve. (This practice is not very common for alveolars, but it has been done by Thanassis Costakis, who has published the most on Tsakonian. It is more frequent for velars, and even more for liquids. Note the combination of inverted breve with Latin g.)

1.3. Diaeresis + Acute

U+0344 Greek Dialytika Tonos [ ̈́]

The monotonic Greek system has three possible ways of modifying a letter: with an accent (tonos), with diaeresis (dialytika), and with both: ταινία "film", ταϊσμένος "fed", ταΐζω "feed". Early on, ELOT treated these as three mutually exclusive alternatives; the ELOT keyboard mapping has semicolon for tonos, shift-semicolon for dialytika, and option-semicolon for dialytika tonos. Unicode initially followed suit, and adopted the combination as a unique diacritic.

But of course, the combination is not a unique diacritic; it is a transparent combination of two. As a result, the diacritic canonically decomposes to U+0308 Diaeresis + U+0301 Combining Acute, and the use of U+0344 is now discouraged by Unicode:

There is no good reason to invent composite combining marks involving two accents together. (In fact, there are good reasons *not* to do so.) The few that exist, e.g. U+0344, cause implementation problems and are discouraged from use. (Ken Whistler)

Because the glyph is available, it has been occasionally used in the transcription of stressed [æ] in Pontic: [ˈðævolon], the Pontic equivalent of Standard Greek [ðjaˈvolos] (διάβολος) "devil", is transliterated as δα̈́βολον or δά̤βολον. (Since both historically and phonologically [æ] is underlyingly /ia/, Soviet Pontic avoided the difficulties of diacritics by just writing ια, and there is some use of ι͡α etc. in Greek Pontic.)

1.4. Breathings

U+0313 Combining Comma Above [ ̓]; U+0314 Combining Reversed Comma Above [ ̔]

Unlike the perispomeni, the breathing marks are also intended for use with Latin script, which is why their Unicode names are neutral. The Unicode Standard notes use of the Comma Above in Americanist transcription of Amerindian languages, to indicate glottalisation or ejectives—where the IPA would use U+02C0 Modifier Letter Glottal Stop, ˀ, for the former, and U+02BC Modifier Letter Apostrophe, ʼ, for the latter.

There is a stream of diacritics that look vaguely like the Greek breathings but should not be conflated with them; italicised diacritics are spacing:

Diacritics Usage
U+02BE Modifier Letter Right Half Ring, ʾ; U+02BF Modifier Letter Left Half Ring, ʿ These characters are in common academic use to transliterate Arabic hamza and ain (ء, ع = [ʔ, ʕ]), and similar phonemes elsewhere in Semitic.
U+0312 Combining Turned Comma Above, ̒ Latvian, as a formant of U+0123 Latin Small Letter G with Cedilla, ģ. Note that (as some glyph realisations of the character show) this is underlyingly meant to be a cedilla underneath the g, as in fact occurs for its uppercase equivalent, U+0122 Latin Capital Letter G with Cedilla, Ģ. So U+0123 canonically decomposes as U+0067 U+0327, and U+0312 is not used as a distinct diacritic codepoint in Latvian at all.
U+02BC Modifier Letter Apostrophe, ʼ; U+02BD Modifier Letter Reversed Comma, ʽ See below
U+0315 Combining Comma Above Right, ̕ I haven't been able to work out what this diacritic is for, although it is used in bibliographical transliteration. A breathing mark, however, it ain't.
U+0027 Apostrophe, '; U+2019 Right Single Quotation Mark, ’; U+201B Single High-Reversed-9 Quotation Mark, ‛ See Single Quotes.
U+0485 Combining Cyrillic Dasia Pneumata, ҅; U+0486 Combining Cyrillic Psili Pneumata, ҆ As discussed elsewhere, the Old Cyrillic diacritics are 9th century borrowings of the Greek diacritics, and as such usually have archaic forms.
U+0559 Armenian Modifier Letter Left Half Ring, ՙ; U+055A Armenian Apostrophe, ՚ The Armenian characters are backwards-compatibility redundancies, according to the Unicode Standard (§7.4). The apostrophe is identical to the Latin apostrophe U+2019 Right Single Quotation Mark, which is to be prefered (one wonders why it doesn't decompose to it, at least as a compatibility decomposition). The Left Half Ring does not exist in Armenian script at all, and the Standard thinks it is a stray duplicate of U+02BB Modifier Letter Turned Comma, which is used in transliterating Armenian into Latin script. (That the codepoint endures despite not existing in the target script is a good illustration of the Stare Decisis conservatism of Unicode.)
U+0351 Combining Left Half Ring Above, ͑; U+0357 Combining Right Half Ring Above, ͗ This is from the study in profligacy that is the Uralic Phonetic Alphabet, and as far as I'm concerned it should be quarantined there.
U+02BB Modifier Letter Turned Comma, ʻ The Unicode Standard describes this as a "typographical alternative for U+02BD [Modifier Letter Reversed Comma] or U+02BF [Modifier Letter Left Half Ring]." The former is the spacing version of the rough breathing; but this is not a typographical variant in use, and the ensuing confusion of codepoints in searching is not worth the trouble.

1.5. Iota Subscript

U+0345 Combining Greek Ypogegrammeni [ ͅ]

There is more discussion of the iota subscript and its tangle of issues than is humanely decent elsewhere. The diacritic is unique to Greek to my knowledge, although the kind of mediaeval ligature that gave rise to it may also be seen at work in the Latin Mediaeval superscript letters U+0363U+036F. As I have also discussed, the iota subscript is normally restricted to appearing with an alpha, iota, or omega, though there is a one-off instance with an upsilon that has made it to print (and there may be others in manuscripts that have been normalised in editions).

Victor Gaultney mentions in his Master's thesis (p. 5) that the iota subscript is not specific to Greek script, but is also used in North American Amerindian scripts. This comes as a surprise to me, and I suspect these are glyph variants of the somewhat similar U+0328 Combining Ogonek (  ̨)—a well-established borrowing into Amerindian orthographies from Polish, indicating nasality in both.

1.6. Ancient Editorial Diacritics

U+1DC0 Combining Dotted Grave Accent [ ᷀]; U+1DC1 Combining Dotted Acute Accent [ ᷁]

These diacritics were proposed by the TLG in November 2003, and included in Unicode 4.1, March 2005. Though they look like the dialytika - acute and grave combinations, they have nothing to do with diaeresis: they are ancient literary editorial diacritics, appearing for instance in the papyri of Sappho, and they indicate an editorial insertion or deletion of an acute or grave. (This means that properly speaking, they are a diacritic of a diacritic, and constitute an ancient editorial markup; but as you might well choose not to imagine, both implementing diacritics of diacritics and markup over diacritics would likely prove very awkward...)

1.7. Modern Editorial Diacritics

U+0359 Combining Asterisk Below [ ͙]

The asterisk below was proposed by the TLG in June 2003, and included in Unicode 4.1, March 2005. The asterisk is used in Dirk Obbink's edition of the Philodemus papyri, to indicate an apograph emendation. The sign is of course markup, but since it is applied letter by letter, the TLG considered it made sense to propose it as a diacritic.

All very well, but what's an apograph? An apograph is a drawn copy someone made once of an inscription or manuscript (in the days before photography). If the original has gone missing since, you rely on the apograph when you work on the text. In the case of Philodemus, we have both the apograph and the original, and things are rather complicated. The papyri of Philodemus were discovered in the 1750s, carbonised, in a villa in Herculaneum that was destroyed along with Pompeii. Technology being what it was in the 1750s, the bundles of charcoal, papyrus, mud and lava were rendered readable by slicing them open -- nuking the text at the slices, peeling out sheet by sheet of the papyrus, and making a drawn copy. Often enough the draughtsman would then scratch the sheet off rather than peel it -- whereupon bye-bye, sheet. In the centuries since, what's left of the the papyri has been fading and crumbling some more, so that some of the text that survived the 18th century has now been nuked by the elements as well.

When a papyrologist sees half a letter on the papyrus and conjectures what the rest is, she indicates this by using U+0323 Combining Dot Below. But with the Philodemus papyri, we have cases where half of the letter is visible in the 18th century drawing -- and none of the letter is visible today. So if Obbink is reconstructing a letter in that case, he is not relying on the evidence of the papyrus before him, but of the apograph, the 18th century copy. If the copy was wrong, Obbink certainly doesn't want to take the blame for it. In such cases, he uses an asterisk below, rather than a dot below.

Cases where we have both the apograph and the original of a text are rare enough that no standard notation to distinguish the two kinds of conjecture has developed. If this helps Obbink's notation become a standard, well, in my opinion papyrology could do with a few more standards anyway. :-)

2. Spacing Diacritics

U+0060 Grave Accent [`]; U+00A8 Diaeresis [¨]; U+00B4 Acute Accent [´]; U+02BC Modifier Letter Apostrophe [ʼ]; U+02BD Modifier Letter Reversed Comma [ʽ]; U+037A Greek Ypogegrammeni [ͺ]; U+0384 Greek Tonos [΄]; U+0385 Greek Dialytika Tonos [΅]; U+1FBE Greek Prosgegrammeni [ι]; U+1FBF Greek Psili [᾿]; U+1FC0 Greek Perispomeni [῀]; U+1FC1 Greek Dialytika And Perispomeni [῁]; U+1FCD Greek Psili And Varia [῍]; U+1FCE Greek Psili And Oxia [῎]; U+1FCF Greek Psili And Perispomeni [῏]; U+1FDD Greek Dasia And Varia [῝]; U+1FDE Greek Dasia And Oxia [῞]; U+1FDF Greek Dasia And Perispomeni [῟]; U+1FED Greek Dialytika And Varia [῭]; U+1FEE Greek Dialytika And Oxia [΅]; U+1FEF Greek Varia [`]; U+1FFD Greek Oxia [´]; U+1FFE Greek Dasia [῾]

2.1. Diacritic Abstractions

Greek in Unicode has oodles and oodles of spacing diacritics—as you can see from the listing above. The reason Unicode says they are there are (§7.2, Unicode Standard) is as follows:

Each has an alternative representation for use with systems that support nonspacing marks. ... The spacing forms are for keyboards and pedagogical use, and are not to be used in the representation of titlecase words. The compatibility decomposition of these spacing forms consist of the sequence U+0020 Space followed by the non-spacing form equivalents shown in Table 7-2.

So the official reason these spacing diacritics are there are as abstract symbols of the diacritics themselves, to be used in presentation of keyboards and other places where the diacritics are to be discussed in isolation; and in the teaching of the Greek script, where the diacritics are again discussed in isolation. In both cases, the compatibility decomposition is to space, followed by the combining diacritic; this is how Unicode recommends you treat diacritics in isolation generally. (Ken Whistler explains why this decomposition is compatibility and not canonical.) So there will not be a separate codepoint for a spacing version of U+034E Combining Upwards Arrow Below,  ͎; the diacritic is obscure enough as is, and combining it with preceding space does exactly what is needed.

U+034E was introduced in Unicode 4.0; it is one of the additions to the IPA for disordered speech, and represents whistled articulation.

The characters decompose as you would expect; I reproduce Table 7-2 of the Unicode standard below, adding codepoints that the table misses (in italics).

Spacing Form Nonspacing Form
U+0060 Grave Accent U+0300 Combining Grave Accent
U+00A8 Diaeresis U+0308 Combining Diaeresis
U+00B4 Acute Accent U+0301 Combining Acute Accent
U+02BC Modifier Letter Apostrophe U+0313 Combining Comma Above
U+02BD Modifier Letter Reversed Comma U+0314 Combining Reversed Comma Above
U+037A Greek Ypogegrammeni U+0345 Combining Ypogegrammeni
U+0384 Greek Tonos U+0301 Combining Acute Accent
U+0385 Greek Dialytika Tonos U+0308 Combining Diaeresis + U+0301 Combining Acute Accent
U+1FBD Greek Koronis U+0313 Combining Comma Above
U+1FBE Greek Prosgegrammeni U+03B9 Small Greek Letter Iota
U+1FBF Greek Psili U+0313 Combining Comma Above
U+1FC0 Greek Perispomeni U+0342 Combining Greek Perispomeni
U+1FC1 Greek Dialytika And Perispomeni U+0308 Combining Diaeresis + U+0342 Combining Greek Perispomeni
U+1FCD Greek Psili And Varia U+0313 Combining Comma Above + U+0300 Combining Grave Accent
U+1FCE Greek Psili And Oxia U+0313 Combining Comma Above + U+0301 Combining Acute Accent
U+1FCF Greek Psili And Perispomeni U+0313 Combining Comma Above + U+0342 Combining Greek Perispomeni
U+1FDD Greek Dasia And Varia U+0314 Combining Reversed Comma Above + U+0300 Combining Grave Accent
U+1FDE Greek Dasia And Oxia U+0314 Combining Reversed Comma Above + U+0301 Combining Acute Accent
U+1FDF Greek Dasia And Perispomeni U+0314 Combining Reversed Comma Above + U+0342 Combining Greek Perispomeni
U+1FED Greek Dialytika And Varia U+0308 Combining Diaeresis + U+0300 Combining Grave Accent
U+1FEE Greek Dialytika And Oxia U+0308 Combining Diaeresis + U+0301 Combining Acute Accent
U+1FEF Greek Varia U+0300 Combining Grave Accent
U+1FFD Greek Oxia U+0301 Combining Acute Accent
U+1FFE Greek Dasia U+0314 Combining Reversed Comma Above

There are two special cases in the foregoing:

2.2. Stealth Titlecase

It seems curious to have 18 codepoints of the Greek ranges assigned to discussion of keyboards and Greek graphology—especially if Unicode would rather treat them as space + combining diacritic combinations, and most of them are composed of multiple diacritics. This immediately tells you that the codepoints were not the Unicode Consortium's idea:

(Similar precomposed diacritical marks do not seem to exist for Vietnamese, which makes me think they've been included for compatibility with legacy encodings rather than for a good reason.

The Greek precomposed spacing accents came in as a lump from the Greek national body ELOT for polytoniko Greek, and had to be accepted into Unicode as part of the merger compromise with [ISO Standard] 10646.

Still, because their decompositions are not canonical, they need to be taken into account, which in my case complicates what would otherwise be somewhat cleaner code.)

True enough. The UTC [Unicode Technical Committee] disliked them from the beginning, but had no choice in the matter. (Ken Whistler, in response to Juliusz Chroboczek)

And the reason ELOT insisted on their inclusion was not concern with graphology. It was 8-bit practice. An 8-bit font has only 224 characters available—and that only because MacOS and Windows both used the range 0x800xAF, which is outside of Latin-1, and thereby proprietary—making it very difficult to exchange data encoded in such fonts between the two platforms. You'll notice that Greek Extended alone takes up that much space, and there is no way to fit the entire Greek reportoire represented in Unicode into 224 characters. So if a font had precomposed codepoints, as has been usual in Classics (though not Biblical studies), then how did they do it?

With some triage. Following the Mixed Adscript system, there was no need for capital iota subscript or adscript combinations; that dispensed with those characters. The diaeresis + circumflex combination is quite rare in Classical Greek, and non-existent in Modern Greek; that gets rid of those characters. The crucial piece of triage was the third: rather than have a single glyph for titlecase breathings to the left of the letter, fonts would use one codepoint for the breathing, and one glyph for the capital letter. So Ἄδμητος "Admetus" was encoded as the equivalent of U+1FCE Greek Psili And Oxia, U+0391 Greek Capital Letter Alpha, etc.

This is thoroughly wrong as far as Unicode is concerned (though it may well reflect lead-type reality). Diacritics are always meant to follow what they modify; a case insensitive search for ἄδμητος should match Ἄδμητος in a fuss-free way, and post-modifying diacritics allow this; premodifying diacritics—especially when as far as Unicode is concerned they aren't even diacritics, but symbols (Sk)—does not:

Word Codepoints Encoding
ἄδμητος 03B1 0313 0301 03B4 03BC 03B7 03C4 03BF 03C2 Decomposed
ἄδμητος 1F04           03B4 03BC 03B7 03C4 03BF 03C2 Precomposed
Ἄδμητος 0391 0313 0301 03B4 03BC 03B7 03C4 03BF 03C2 Decomposed
Ἄδμητος 1F0C           03B4 03BC 03B7 03C4 03BF 03C2 Precomposed
῎Αδμητος 1FCE 0391      03B4 03BC 03B7 03C4 03BF 03C2 Miscomposed

Moreover, at least theoretically, there ends up being too much space between the titlecase diacritic and the letter; however, since this kind of distance has been placed in 8-bit fonts for the past couple of decades, and by printers using separate slugs for diacritic and letter even longer, it's not like many readers of Greek would even notice this as a problem.

So the codepoints are there to do stealth titlecase treatment of premodifying diacritics. This is to be rejected, and the Unicode Standard explicitly rejects this use in the quoted passage above; it also represents Rule 6 (§1.6) of Haralambous' Guidelines.

Of course, much Unicode polytonic text comes from conversion of 8-bit text, for which it is simplest to map the codepoints one-to-one; so there is plenty such legacy text around. Moreover, Unicode users, unaware of the need to keep titlecase and lowercase diacritics in the same place, may persist in using spacing diacritics like they've always done in 8-bit. So if you're a user—don't do it. And if you're a developer, be aware that they're doing it anyway; your search engine may need an option to deal with spacing diacritics on the wrong side of vowels.

2.3. Cheshire Vowels

Back when the world was young (March 2001), I asked on the Unicode list whether, if all usage of the spacing diacritics in Greek was just stealth titlecase, we couldn't just canonically interpret them as being premodifiers. Ken Whistler replied that such a switcharound would be unacceptable for the way Unicode handles decomposition, but that "information to the contrary" can be exploited in converting premodifiers to canonical Unicode: if we know that a particular text is using stealth titlecase, we can convert it accordingly:

This is a knowing transformation of the data from one form to another form by a process aware of these equivalences. But that is comparable, for example, to doing a transliteration from one form to another form, rather than being a built-in normative equivalence defined by the Unicode Standard itself.

Above the engineering and political reasons not to disrupt the regular process of normalisation, there is another issue: whether in fact you can always assume that spacing diacritics in text are stealth titlecase.

There is one instance where spacing diacritics are not so used in Greek. In papyrology, a diacritic without a letter is used when all that is preserved on the papyrus is the diacritic—so the letter is missing in print as it is on the original, and all that's left of it, like a Cheshire Cat's smile, is its diacritic:

Ἥρα μιγεῖσα Ζηνὶ θυμοιδ[
δ]ύ̣σ̣α̣ρ̣κτ[ο]ν, αἰδὼς δ’ οὐκ ἐνῆ[ν] φ̣ρ̣[ον]ήματι·
[             ].υκτα τῶν ὁδοιπόρων βέλη
[             ].δως ἀγκύλαισιν ἀρταμων·  
[               ]῀̣ν ἔχ[αι]ρ̣ε κἀγέλα κακὸν
[                           ]ν ῎̣ζοι φόνος·
[                              ]μ̣ουμένη
[                  ].ιπρ[.....]γον χέρα
[                   ]οῦν ἐνδίκως̣ κ̣ικλήσκεται
[                  ]νιν ἔνδικ̣[.....].ος·
(...) (Aeschylus, Fragmenta (Radt) §281a)

Hera mingling with hot-tempered Zeus [...]
ungovernable, and there was no sense of shame in his spirit;
[...] the arrows of the wayfarers
[...] -ly tearing it apart with hooks;
[...] rejoiced and laughed at evil
[...] murder
[...] -ing
[...] hand
[...] shall be justly summoned
[...] summon [...]

This also occurs with other diacritics:

τίει δέ μιν [ἔξοχον ἄλλων
ἀθανάτων μετά γ’] αὐτὸν ἐρισθενέ[α Κρονίωνα
[               ]δι φίλην πόρε π̣[
[        ῎Ολυμπο]ν̣ ἀγάννιφον· .[ 
[               ]σ̣ι.φυὴν καὶ εἶδ[ος
[               ῾Ηρ]ακλῆϊ πτολι[πόρθωι
[                   ]ύ̣ρρ̣οον ἀργυρ[οδίνην
[                   ΄̣.]υῥέει εἰς ἅ[λα δῖαν
[                        ]΄̣γ [.].ν̣ [ (Hesiod, Fragmenta (Merkelbach & West: Fragmenta Hesiodea) §229)

For she [Hera] is honoured [as one outstanding amongs the other
immortals, along with] the mighty [Son of Cronus (Zeus)] himself
[...] friend presented [...]
[...] snow-capped O[lympus];
[...]-bred and form [...]
[...] to Hercules, [Sacker of] Cities [...]
[...]-flowing, silver-[eddying],
[...] flows to the [divine brine],

This does not lead to real ambiguity: the stealth titlecase requires the following character to be uppercase, while the Cheshire diacritic assumes the next character is lowercase, in the absence of Greek StudlyCaps. So a reasonable search engine can still anticipate this kind of distinction.

There is a potential context of an uppercase letter in the middle of a word: crasis where the second word is a proper name. Editorial practice is inconsistent: depending on the editor's compunctions, a capital may or may not appear in the middle of a word in Greek typography. So the crasis form of "to Aphrodite" (τῇ Ἀφροδίτῃ) is τἀφροδίτῃ in the current edition used by the TLG of Aristophanes Acharnians 793 (Coulon & van Daele): Ἀλλ’ οὐχὶ χοῖρος τἀφροδίτῃ θύεται, "But a pig does not get sacrificed to Aphrodite." But the TLG edition of Aeschylus Persians 997 (Page) has κἨγδαδάταν καὶ Λυθίμναν Τόλμον τ’ αἰχμᾶς ἀκόρεστον, "Cegdadatas/and Agdadatas/and Egdadatas, and Lythimnas, and Tolmus, insatiable in war". (The ambiguity is that κἨγ- could represent an underlying καὶ Ἀγ- or καὶ Ἐγ- --- and some editors ignore the coronis, and think this is just Κηγ-.)

Nonetheless, Greek StudlyCaps is unusual, and if the editor knew enough about the word to know this was crasis with a proper name, they would be filling in their guess as to what the two words involved in the crasis were. So if you find a spacing diacritic followed by an uppercase letter (in a text which has been normalised and published by a philologist, following the conventions of Greek capitalisation—even if they are using SPIonic font in TextEdit on MacOSX), you can be confident that this is stealth titlecase, and not StudlyCaps crasis.

2.4. Apostrophe

U+0027 Apostrophe [']; U+02BC Modifier Letter Apostrophe [ʼ]; U+2019 Right Single Quotation Mark [’]

The apostrophe in Greek is used for the same function as in Latin script: it indicates an absent vowel, whether from the beginning of a word (aphaeresis), the middle (syncope), or the end (elision). The relative frequencies with which it is seen in each function vary:

Haralambous discusses (§4.2.3; §1.8) the fact that the Latin "smart apostrophe" is the wrong shape for Greek, where the apostrophe needs to be the same shape as the smooth breathing mark—shorter than the apostrophe. As Haralambous recognises in his later paper, this is a Serbian Italics issues yet again: fonts need to be smart enough to produce a Greek shape (a smooth breathing) in Greek contexts, and a Latin shape in Latin contexts. Of course, since fonts haven't been smart, people are now used to seeing the higher apostrophe in Greek text generated by computer.

Haralambous further notes (§1.8) that Greek is the only language where space is systematically inserted after (word-final) apostrophe; I must admit that I was surprised to find out that this was not the case, say, in Italian. (However, it is also the case for Esperanto.)

The preferred character for apostrophe in Unicode (Unicode Standard §6.2, Encoding Characters With Multiple Semantic Values) is U+2019. A Greek font would preferably nudge this character up rather than U+02BC, particularly as U+02BC has already been designated as a spacing smooth breathing. Which, uh, is meant to look the same as the apostrophe in Greek anyway. Oh well. The point is that if you use U+0027 or U+2019, a search engine will know that you're using an apostrophe and not a smooth breathing.

Well, either that or a single quote. But as I argue elsewhere, you have to be pretty brain-damaged to use single quotes in polytonic Greek...

