11. Ordering

 
Home > Greek > Unicode

The sorting of polytonic Greek involves two issues: how Greek words sort with respect to diacritics (Level 2); and where non-canonical letters fit into the sorting scheme for base characters (Level 1). The latter issue is easier, since the non-Attic letters already have canonical positions thanks to their erstwhile positions as numerals (and lexica reflect this where they do not conflate non-Attic letters with extant letters, as is usually the case with koppa, san, and sampi): the ordering is

αβγδεϝζ[η⊢]θικλμνξοπϺϙρστυφχψωϠ

So digamma appears between epsilon and zeta; san then koppa appear between pi and rho; and sampi appears after omega. Heta and eta are variants of the same letter, and I do not know of a canonical decision on which comes first if both occur in an index (which is rare). Jeffery puts eta first, presumably because eta is a canonical letter and heta is not; that is as good a rationale as any. The ordering of sho has not been addressed until recently; I discuss it in the context of my general presentation of the letter. Ligatures canonical or not, to the extent that they belong in Unicode at all, would presumably sort as their expanded counterparts: even if a stigma were left as is in the word in an index (which is doubtful), it would sort under sigma tau.

The trickier issue, and that of relevance to a wider range of people, is how diacritics should be sorted. Though Unicode recognises that sorting is language-specific (as the notorious different treatment of ö in Swedish and German attests), there does need to be a default in place (the Unicode Collation Algorithm); and if the diacritics in question are mostly going to be used for Greek (as is the case for the perispomeni), all the more reason to get its sorting right for Greek.

Preparatory to formulating the algorithm, Carl-Martin Bunz and Marc Wilhelm Küster prepared in 1998 a survey of usage in Classical and Modern Greek dictionaries, comparing them with standards currently in place (European Ordering Rules, statements from ELOT), and ranging back to Henricus Stephanus' Thesaurus Graecae Linguae, the first modern dictionary of Classical Greek, dating from 1560–1572. (In case you were wondering, the similarity to the modern Thesaurus Linguae Graecae project is not coincidental.) The picture they present is messy, especially when lexica choose to use morphological rather than orthographic principles in sorting; but the overall story is:

Even more succinctly:

Iota Subscript > Breathing: (Smooth > Rough) > Accent (Acute > Grave; Circumflex?)

Forward direction for diacritics means that diacritics at the end of a word sort before the same diacritic at the start of a word. To explain this, consider the following—with the time-honoured minimal pair νομός "nome, prefecture"—νόμος "law":

νομος U+03BD U+03BF U+03BC U+03BF        U+03C2
νομός U+03BD U+03BF U+03BC U+03BF U+0301 U+03C2
νόμος U+03BD U+03BF U+0301 U+03BC U+03BF U+03C2

In its decomposed encoding, νομός differs from νομος only at its sixth codepoint, just like νομοταγής; so its first five codepoints are in common with νομος. But νόμος differs from νομος already at its third codepoint, just like νονός; so it sorts after νομός, just as νονός sorts after νομοταγής.

The TLG sorting algorithm is described in TLG technical note 002. As the implementation should make clear, the algorithm was inherited from a 1970s mainframe, which is why it originally used 5-bit nybbles of 16-bit words. (My contribution was limited to the treatment of coronis and hypodiastole.) The TLG did not have the luxury of using morphological criteria, in the absence of lemmatisation at the time, so it imposed a rigorous orthographic algorithm. The result is almost the same as what Bunz & Küster sketched, but not quite:

The Unicode Collation Algorithm defaults to forward directionality. It recognises that backwards directionality applies in certain languages (prominently accents in French, which sorts côte before coté); but that's why the Unicode Collation Algorithm is a default, and implementations specific to particular languages and contexts are expected to customise the sort. (Otherwise where would Swedish be?) Outside of that, the algorithm expects a Collation Element Table to assign primary (Level 1), secondary (Level 2), tertiary (Level 3: case) etc. weights for each Unicode codepoint, either explicitly or implicitly. A collation element weight of 0000.0021.0002 for U+0030 Combining Grave, for instance, indicates that the grave is to be ignored at Level 1, but assigned the value 0021 at Level 2. Though the implementor is meant to take care of the Collation Element Table as required, there is a Default Unicode Collation Element Table (DUCET) which can be used in the absence of further information.

Note that as specified by Unicode, backward or forward directionality applies to an entire level, not to individual codepoints within a level. So if an implementor wanted to make Unicode follow the TLG in reverse directionality for breathings but forward directionality for accents, they would have to customise the DUCET, demoting acutes to Level 3. (By the same token hypodiastoles would end up in Level 4, since they are reverse again in TLG sorting.) The implementor would have to decide if this is worth the hassle. To be honest, I don't think it is, particularly given the chaotic history of Greek sorting presented by Bunz & Küster.

So with abundant provisos that the DUCET is intended to be customised and not just taken off the shelf, here is how the DUCET treats Greek diacritics as of version 3.1.1:

Codepoint Secondary Weight Tertiary Weight Quaternary Weight
U+0313 Combining Comma Above (Smooth Breathing) 0x0022 0x0002 0x0313
U+0343 Combining Greek Koronis 0x0022 0x0002 0x0343
U+0314 Combining Reversed Comma Above (Rough Breathing) 0x002A 0x0002 0x0314
U+0301 Combining Acute Accent 0x0032 0x0002 0x0301
U+0300 Combining Grave Accent 0x0035 0x0002 0x0300
U+0306 Combining Breve 0x0037 0x0002 0x0306
U+0342 Combining Greek Perispomeni 0x0045 0x0002 0x0342
U+0308 Combining Diaeresis 0x0047 0x0002 0x0308
U+0304 Combining Macron 0x005A 0x0002 0x0304
U+0323 Combining Dot Below 0x0079 0x0002 0x0323
U+0345 Combining Greek Ypogegrammeni 0x0096 0x0002 0x0345

This means that the DUCET hierarchy is breathing > accent > iota subscript, which is in accordance with TLG practice; this is also the ordering of combining diacritics on letters that Unicode imposes as normative. The disruption with breve and macron sorting either side of the circumflex does not actually affect Greek: circumflexes unambiguously mark their vowels as long, so they do not get combined with quantity diacritics. A Greek-specific implementation should probably demote breve and macron below circumflex and iota subscript, though, since the latter are canonically part of Greek orthography and the former are not.

The DUCET ordering gives the following for Greek:

  1. ας
  2. ἀς
  3. ἄς
  4. ᾄς
  5. ἂς
  6. ᾂς
  7. ἀ̆ς
  8. ἆς
  9. ᾆς
  10. ἀ̄ς
  11. ᾀς
  12. ἁς
  13. ἅς
  14. ᾅς
  15. ἃς
  16. ᾃς
  17. ἇς
  18. ᾇς
  19. ᾁς
  20. άς
  21. ά̆ς
  22. ά̄ς
  23. ᾴς
  24. ὰς
  25. ᾲς
  26. ᾰς
  27. ᾶς
  28. ᾷς
  29. ᾱς
  30. ᾳς

As for the base characters of Greek, DUCET interleaves the Anti-Greek mathematical variants and the Mathematical symbol variants with the corresponding Greek letters; the current ordering is given below. Note that certain codepoints are asterisked; this means that such codepoints in the table have a variable weighting, according to what the implementer decides. The default treatment of such codepoints is "shifted", meaning that it is ignored at levels 1–3, and the character is discriminated at the quaternary level as 0xFFFF. So the keraia by default is ignored until all base letters, diacritics, and casing has been dealt with. (This is the standing by default of punctuation in the DUCET.) The table also skips any precomposed combinations of letters and diacritics, since they will be handled in sorting by decomposition.

ʹU+0374 [*02E9.0020.0002.0374] # GREEK NUMERAL SIGN; QQC
͵U+0375 [*02EA.0020.0002.0375] # GREEK LOWER NUMERAL SIGN
;U+037E [*0235.0020.0002.037E] # GREEK QUESTION MARK; QQC
΄U+0384 [*020D.0020.0002.0384] # GREEK TONOS; QQC
·U+0387 [*025F.0020.0002.0387] # GREEK ANO TELEIA; QQC
αU+03B1 [.0C91.0020.0002.03B1] # GREEK SMALL LETTER ALPHA
ΑU+0391 [.0C91.0020.0008.0391] # GREEK CAPITAL LETTER ALPHA
βU+03B2 [.0C92.0020.0002.03B2] # GREEK SMALL LETTER BETA
ϐU+03D0 [.0C92.0020.0004.03D0] # GREEK BETA SYMBOL; QQK
ΒU+0392 [.0C92.0020.0008.0392] # GREEK CAPITAL LETTER BETA
γU+03B3 [.0C93.0020.0002.03B3] # GREEK SMALL LETTER GAMMA
ΓU+0393 [.0C93.0020.0008.0393] # GREEK CAPITAL LETTER GAMMA
δU+03B4 [.0C94.0020.0002.03B4] # GREEK SMALL LETTER DELTA
ΔU+0394 [.0C94.0020.0008.0394] # GREEK CAPITAL LETTER DELTA
εU+03B5 [.0C95.0020.0002.03B5] # GREEK SMALL LETTER EPSILON
ϵU+03F5 [.0C95.0020.0004.03F5] # GREEK LUNATE EPSILON SYMBOL; QQK
ΕU+0395 [.0C95.0020.0008.0395] # GREEK CAPITAL LETTER EPSILON
ϝU+03DD [.0C96.0020.0002.03DD] # GREEK SMALL LETTER DIGAMMA
ϜU+03DC [.0C96.0020.0008.03DC] # GREEK LETTER DIGAMMA
ϛU+03DB [.0C97.0020.0002.03DB] # GREEK SMALL LETTER STIGMA
ϚU+03DA [.0C97.0020.0008.03DA] # GREEK LETTER STIGMA
ζU+03B6 [.0C98.0020.0002.03B6] # GREEK SMALL LETTER ZETA
ΖU+0396 [.0C98.0020.0008.0396] # GREEK CAPITAL LETTER ZETA
ηU+03B7 [.0C99.0020.0002.03B7] # GREEK SMALL LETTER ETA
ΗU+0397 [.0C99.0020.0008.0397] # GREEK CAPITAL LETTER ETA
θU+03B8 [.0C9A.0020.0002.03B8] # GREEK SMALL LETTER THETA
ϑU+03D1 [.0C9A.0020.0004.03D1] # GREEK THETA SYMBOL; QQK
ΘU+0398 [.0C9A.0020.0008.0398] # GREEK CAPITAL LETTER THETA
ϴU+03F4 [.0C9A.0020.000A.03F4] # GREEK CAPITAL THETA SYMBOL; QQK
ͺU+037A [.0C9B.0020.0002.037A] # GREEK YPOGEGRAMMENI; QQK
ιU+03B9 [.0C9B.0020.0002.03B9] # GREEK SMALL LETTER IOTA
ΙU+0399 [.0C9B.0020.0008.0399] # GREEK CAPITAL LETTER IOTA
ϳU+03F3 [.0C9C.0020.0002.03F3] # GREEK LETTER YOT
κU+03BA [.0C9D.0020.0002.03BA] # GREEK SMALL LETTER KAPPA
ϰU+03F0 [.0C9D.0020.0004.03F0] # GREEK KAPPA SYMBOL; QQK
ΚU+039A [.0C9D.0020.0008.039A] # GREEK CAPITAL LETTER KAPPA
ϗU+03D7 [.0C9D.0020.0004.03D7][.0C91.0020.0004.03D7][.0C9B.0020.001F.03D7] # GREEK KAI SYMBOL; QQKN
λU+03BB [.0C9E.0020.0002.03BB] # GREEK SMALL LETTER LAMDA
ΛU+039B [.0C9E.0020.0008.039B] # GREEK CAPITAL LETTER LAMDA
μU+03BC [.0C9F.0020.0002.03BC] # GREEK SMALL LETTER MU
ΜU+039C [.0C9F.0020.0008.039C] # GREEK CAPITAL LETTER MU
νU+03BD [.0CA0.0020.0002.03BD] # GREEK SMALL LETTER NU
ΝU+039D [.0CA0.0020.0008.039D] # GREEK CAPITAL LETTER NU
ξU+03BE [.0CA1.0020.0002.03BE] # GREEK SMALL LETTER XI
ΞU+039E [.0CA1.0020.0008.039E] # GREEK CAPITAL LETTER XI
οU+03BF [.0CA2.0020.0002.03BF] # GREEK SMALL LETTER OMICRON
ΟU+039F [.0CA2.0020.0008.039F] # GREEK CAPITAL LETTER OMICRON
πU+03C0 [.0CA3.0020.0002.03C0] # GREEK SMALL LETTER PI
ϖU+03D6 [.0CA3.0020.0004.03D6] # GREEK PI SYMBOL; QQK
ΠU+03A0 [.0CA3.0020.0008.03A0] # GREEK CAPITAL LETTER PI
ϟU+03DF [.0CA4.0020.0002.03DF] # GREEK SMALL LETTER KOPPA
ϞU+03DE [.0CA4.0020.0008.03DE] # GREEK LETTER KOPPA
ρU+03C1 [.0CA5.0020.0002.03C1] # GREEK SMALL LETTER RHO
ϱU+03F1 [.0CA5.0020.0004.03F1] # GREEK RHO SYMBOL; QQK
ΡU+03A1 [.0CA5.0020.0008.03A1] # GREEK CAPITAL LETTER RHO
σU+03C3 [.0CA6.0020.0002.03C3] # GREEK SMALL LETTER SIGMA
ϲU+03F2 [.0CA6.0020.0004.03F2] # GREEK LUNATE SIGMA SYMBOL; QQK
ΣU+03A3 [.0CA6.0020.0008.03A3] # GREEK CAPITAL LETTER SIGMA
ςU+03C2 [.0CA6.0020.0019.03C2] # GREEK SMALL LETTER FINAL SIGMA; QQK
τU+03C4 [.0CA7.0020.0002.03C4] # GREEK SMALL LETTER TAU
ΤU+03A4 [.0CA7.0020.0008.03A4] # GREEK CAPITAL LETTER TAU
υU+03C5 [.0CA8.0020.0002.03C5] # GREEK SMALL LETTER UPSILON
ΥU+03A5 [.0CA8.0020.0008.03A5] # GREEK CAPITAL LETTER UPSILON
ϒU+03D2 [.0CA8.0020.000A.03D2] # GREEK UPSILON WITH HOOK SYMBOL; QQK
φU+03C6 [.0CA9.0020.0002.03C6] # GREEK SMALL LETTER PHI
ϕU+03D5 [.0CA9.0020.0004.03D5] # GREEK PHI SYMBOL; QQK
ΦU+03A6 [.0CA9.0020.0008.03A6] # GREEK CAPITAL LETTER PHI
χU+03C7 [.0CAA.0020.0002.03C7] # GREEK SMALL LETTER CHI
ΧU+03A7 [.0CAA.0020.0008.03A7] # GREEK CAPITAL LETTER CHI
ψU+03C8 [.0CAB.0020.0002.03C8] # GREEK SMALL LETTER PSI
ΨU+03A8 [.0CAB.0020.0008.03A8] # GREEK CAPITAL LETTER PSI
ωU+03C9 [.0CAC.0020.0002.03C9] # GREEK SMALL LETTER OMEGA
ΩU+03A9 [.0CAC.0020.0008.03A9] # GREEK CAPITAL LETTER OMEGA
ϡU+03E1 [.0CAD.0020.0002.03E1] # GREEK SMALL LETTER SAMPI
ϠU+03E0 [.0CAD.0020.0008.03E0] # GREEK LETTER SAMPI
ϣU+03E3 [.0CAE.0020.0002.03E3] # COPTIC SMALL LETTER SHEI
ϢU+03E2 [.0CAE.0020.0008.03E2] # COPTIC CAPITAL LETTER SHEI
ϥU+03E5 [.0CAF.0020.0002.03E5] # COPTIC SMALL LETTER FEI
ϤU+03E4 [.0CAF.0020.0008.03E4] # COPTIC CAPITAL LETTER FEI
ϧU+03E7 [.0CB0.0020.0002.03E7] # COPTIC SMALL LETTER KHEI
ϦU+03E6 [.0CB0.0020.0008.03E6] # COPTIC CAPITAL LETTER KHEI
ϩU+03E9 [.0CB1.0020.0002.03E9] # COPTIC SMALL LETTER HORI
ϨU+03E8 [.0CB1.0020.0008.03E8] # COPTIC CAPITAL LETTER HORI
ϫU+03EB [.0CB2.0020.0002.03EB] # COPTIC SMALL LETTER GANGIA
ϪU+03EA [.0CB2.0020.0008.03EA] # COPTIC CAPITAL LETTER GANGIA
ϭU+03ED [.0CB3.0020.0002.03ED] # COPTIC SMALL LETTER SHIMA
ϬU+03EC [.0CB3.0020.0008.03EC] # COPTIC CAPITAL LETTER SHIMA
ϯU+03EF [.0CB4.0020.0002.03EF] # COPTIC SMALL LETTER DEI
ϮU+03EE [.0CB4.0020.0008.03EE] # COPTIC CAPITAL LETTER DEI

What this table means is that, at least in the DUCET:

The Greek letters new to Unicode 4.0 have not yet been incorporated into DUCET, but where they will go is mostly predictable: numeric koppa will sort after archaic koppa, reverse epsilon symbol after lunate epsilon symbol, capital theta symbol after capital theta, capital lunate sigma after lowercase lunate sigma, san before both koppas. There is no reason to think sho will not go after omega (and probably also sampi), per Bactrianist practice.

Nick Nicholas, opoudjis [AT] optusnet . com . au
Created: 2003-09-07; Last revision: 2008-02-06
URL: http://www.opoudjis.net/unicode/unicode_ordering.html