11. Ordering |
||
Language:
ENG ELL EPO JBO TLH
LAT |
The sorting of polytonic Greek involves two issues: how Greek words sort with respect to diacritics (Level 2); and where non-canonical letters fit into the sorting scheme for base characters (Level 1). The latter issue is easier, since the non-Attic letters already have canonical positions thanks to their erstwhile positions as numerals (and lexica reflect this where they do not conflate non-Attic letters with extant letters, as is usually the case with koppa, san, and sampi): the ordering is
αβγδεϝζ[η⊢]θικλμνξοπϺϙρστυφχψωϠ
So digamma appears between epsilon and zeta; san then koppa appear between pi and rho; and sampi appears after omega. Heta and eta are variants of the same letter, and I do not know of a canonical decision on which comes first if both occur in an index (which is rare). Jeffery puts eta first, presumably because eta is a canonical letter and heta is not; that is as good a rationale as any. The ordering of sho has not been addressed until recently; I discuss it in the context of my general presentation of the letter. Ligatures canonical or not, to the extent that they belong in Unicode at all, would presumably sort as their expanded counterparts: even if a stigma were left as is in the word in an index (which is doubtful), it would sort under sigma tau.
The trickier issue, and that of relevance to a wider range of people, is how diacritics should be sorted. Though Unicode recognises that sorting is language-specific (as the notorious different treatment of ö in Swedish and German attests), there does need to be a default in place (the Unicode Collation Algorithm); and if the diacritics in question are mostly going to be used for Greek (as is the case for the perispomeni), all the more reason to get its sorting right for Greek.
Preparatory to formulating the algorithm, Carl-Martin Bunz and Marc Wilhelm Küster prepared in 1998 a survey of usage in Classical and Modern Greek dictionaries, comparing them with standards currently in place (European Ordering Rules, statements from ELOT), and ranging back to Henricus Stephanus' Thesaurus Graecae Linguae, the first modern dictionary of Classical Greek, dating from 1560–1572. (In case you were wondering, the similarity to the modern Thesaurus Linguae Graecae project is not coincidental.) The picture they present is messy, especially when lexica choose to use morphological rather than orthographic principles in sorting; but the overall story is:
Even more succinctly:
Iota Subscript > Breathing: (Smooth > Rough) > Accent (Acute > Grave; Circumflex?)
Forward direction for diacritics means that diacritics at the end of a word sort before the same diacritic at the start of a word. To explain this, consider the following—with the time-honoured minimal pair νομός "nome, prefecture"—νόμος "law":
νομος | U+03BD U+03BF U+03BC U+03BF U+03C2 |
νομός | U+03BD U+03BF U+03BC U+03BF U+0301 U+03C2 |
νόμος | U+03BD U+03BF U+0301 U+03BC U+03BF U+03C2 |
In its decomposed encoding, νομός differs from νομος only at its sixth codepoint, just like νομοταγής; so its first five codepoints are in common with νομος. But νόμος differs from νομος already at its third codepoint, just like νονός; so it sorts after νομός, just as νονός sorts after νομοταγής.
The TLG sorting algorithm is described in TLG technical note 002. As the implementation should make clear, the algorithm was inherited from a 1970s mainframe, which is why it originally used 5-bit nybbles of 16-bit words. (My contribution was limited to the treatment of coronis and hypodiastole.) The TLG did not have the luxury of using morphological criteria, in the absence of lemmatisation at the time, so it imposed a rigorous orthographic algorithm. The result is almost the same as what Bunz & Küster sketched, but not quite:
The Unicode Collation Algorithm defaults to forward directionality. It recognises that backwards directionality applies in certain languages (prominently accents in French, which sorts côte before coté); but that's why the Unicode Collation Algorithm is a default, and implementations specific to particular languages and contexts are expected to customise the sort. (Otherwise where would Swedish be?) Outside of that, the algorithm expects a Collation Element Table to assign primary (Level 1), secondary (Level 2), tertiary (Level 3: case) etc. weights for each Unicode codepoint, either explicitly or implicitly. A collation element weight of 0000.0021.0002 for U+0030 Combining Grave, for instance, indicates that the grave is to be ignored at Level 1, but assigned the value 0021 at Level 2. Though the implementor is meant to take care of the Collation Element Table as required, there is a Default Unicode Collation Element Table (DUCET) which can be used in the absence of further information.
Note that as specified by Unicode, backward or forward directionality applies to an entire level, not to individual codepoints within a level. So if an implementor wanted to make Unicode follow the TLG in reverse directionality for breathings but forward directionality for accents, they would have to customise the DUCET, demoting acutes to Level 3. (By the same token hypodiastoles would end up in Level 4, since they are reverse again in TLG sorting.) The implementor would have to decide if this is worth the hassle. To be honest, I don't think it is, particularly given the chaotic history of Greek sorting presented by Bunz & Küster.
So with abundant provisos that the DUCET is intended to be customised and not just taken off the shelf, here is how the DUCET treats Greek diacritics as of version 3.1.1:
Codepoint | Secondary Weight | Tertiary Weight | Quaternary Weight |
---|---|---|---|
U+0313 Combining Comma Above (Smooth Breathing) | 0x0022 | 0x0002 | 0x0313 |
U+0343 Combining Greek Koronis | 0x0022 | 0x0002 | 0x0343 |
U+0314 Combining Reversed Comma Above (Rough Breathing) | 0x002A | 0x0002 | 0x0314 |
U+0301 Combining Acute Accent | 0x0032 | 0x0002 | 0x0301 |
U+0300 Combining Grave Accent | 0x0035 | 0x0002 | 0x0300 |
U+0306 Combining Breve | 0x0037 | 0x0002 | 0x0306 |
U+0342 Combining Greek Perispomeni | 0x0045 | 0x0002 | 0x0342 |
U+0308 Combining Diaeresis | 0x0047 | 0x0002 | 0x0308 |
U+0304 Combining Macron | 0x005A | 0x0002 | 0x0304 |
U+0323 Combining Dot Below | 0x0079 | 0x0002 | 0x0323 |
U+0345 Combining Greek Ypogegrammeni | 0x0096 | 0x0002 | 0x0345 |
This means that the DUCET hierarchy is breathing > accent > iota subscript, which is in accordance with TLG practice; this is also the ordering of combining diacritics on letters that Unicode imposes as normative. The disruption with breve and macron sorting either side of the circumflex does not actually affect Greek: circumflexes unambiguously mark their vowels as long, so they do not get combined with quantity diacritics. A Greek-specific implementation should probably demote breve and macron below circumflex and iota subscript, though, since the latter are canonically part of Greek orthography and the former are not.
The DUCET ordering gives the following for Greek:
As for the base characters of Greek, DUCET interleaves the Anti-Greek mathematical variants and the Mathematical symbol variants with the corresponding Greek letters; the current ordering is given below. Note that certain codepoints are asterisked; this means that such codepoints in the table have a variable weighting, according to what the implementer decides. The default treatment of such codepoints is "shifted", meaning that it is ignored at levels 1–3, and the character is discriminated at the quaternary level as 0xFFFF. So the keraia by default is ignored until all base letters, diacritics, and casing has been dealt with. (This is the standing by default of punctuation in the DUCET.) The table also skips any precomposed combinations of letters and diacritics, since they will be handled in sorting by decomposition.
ʹ | U+0374 | [*02E9.0020.0002.0374] # GREEK NUMERAL SIGN; QQC |
͵ | U+0375 | [*02EA.0020.0002.0375] # GREEK LOWER NUMERAL SIGN |
; | U+037E | [*0235.0020.0002.037E] # GREEK QUESTION MARK; QQC |
΄ | U+0384 | [*020D.0020.0002.0384] # GREEK TONOS; QQC |
· | U+0387 | [*025F.0020.0002.0387] # GREEK ANO TELEIA; QQC |
α | U+03B1 | [.0C91.0020.0002.03B1] # GREEK SMALL LETTER ALPHA |
Α | U+0391 | [.0C91.0020.0008.0391] # GREEK CAPITAL LETTER ALPHA |
β | U+03B2 | [.0C92.0020.0002.03B2] # GREEK SMALL LETTER BETA |
ϐ | U+03D0 | [.0C92.0020.0004.03D0] # GREEK BETA SYMBOL; QQK |
Β | U+0392 | [.0C92.0020.0008.0392] # GREEK CAPITAL LETTER BETA |
γ | U+03B3 | [.0C93.0020.0002.03B3] # GREEK SMALL LETTER GAMMA |
Γ | U+0393 | [.0C93.0020.0008.0393] # GREEK CAPITAL LETTER GAMMA |
δ | U+03B4 | [.0C94.0020.0002.03B4] # GREEK SMALL LETTER DELTA |
Δ | U+0394 | [.0C94.0020.0008.0394] # GREEK CAPITAL LETTER DELTA |
ε | U+03B5 | [.0C95.0020.0002.03B5] # GREEK SMALL LETTER EPSILON |
ϵ | U+03F5 | [.0C95.0020.0004.03F5] # GREEK LUNATE EPSILON SYMBOL; QQK |
Ε | U+0395 | [.0C95.0020.0008.0395] # GREEK CAPITAL LETTER EPSILON |
ϝ | U+03DD | [.0C96.0020.0002.03DD] # GREEK SMALL LETTER DIGAMMA |
Ϝ | U+03DC | [.0C96.0020.0008.03DC] # GREEK LETTER DIGAMMA |
ϛ | U+03DB | [.0C97.0020.0002.03DB] # GREEK SMALL LETTER STIGMA |
Ϛ | U+03DA | [.0C97.0020.0008.03DA] # GREEK LETTER STIGMA |
ζ | U+03B6 | [.0C98.0020.0002.03B6] # GREEK SMALL LETTER ZETA |
Ζ | U+0396 | [.0C98.0020.0008.0396] # GREEK CAPITAL LETTER ZETA |
η | U+03B7 | [.0C99.0020.0002.03B7] # GREEK SMALL LETTER ETA |
Η | U+0397 | [.0C99.0020.0008.0397] # GREEK CAPITAL LETTER ETA |
θ | U+03B8 | [.0C9A.0020.0002.03B8] # GREEK SMALL LETTER THETA |
ϑ | U+03D1 | [.0C9A.0020.0004.03D1] # GREEK THETA SYMBOL; QQK |
Θ | U+0398 | [.0C9A.0020.0008.0398] # GREEK CAPITAL LETTER THETA |
ϴ | U+03F4 | [.0C9A.0020.000A.03F4] # GREEK CAPITAL THETA SYMBOL; QQK |
ͺ | U+037A | [.0C9B.0020.0002.037A] # GREEK YPOGEGRAMMENI; QQK |
ι | U+03B9 | [.0C9B.0020.0002.03B9] # GREEK SMALL LETTER IOTA |
Ι | U+0399 | [.0C9B.0020.0008.0399] # GREEK CAPITAL LETTER IOTA |
ϳ | U+03F3 | [.0C9C.0020.0002.03F3] # GREEK LETTER YOT |
κ | U+03BA | [.0C9D.0020.0002.03BA] # GREEK SMALL LETTER KAPPA |
ϰ | U+03F0 | [.0C9D.0020.0004.03F0] # GREEK KAPPA SYMBOL; QQK |
Κ | U+039A | [.0C9D.0020.0008.039A] # GREEK CAPITAL LETTER KAPPA |
ϗ | U+03D7 | [.0C9D.0020.0004.03D7][.0C91.0020.0004.03D7][.0C9B.0020.001F.03D7] # GREEK KAI SYMBOL; QQKN |
λ | U+03BB | [.0C9E.0020.0002.03BB] # GREEK SMALL LETTER LAMDA |
Λ | U+039B | [.0C9E.0020.0008.039B] # GREEK CAPITAL LETTER LAMDA |
μ | U+03BC | [.0C9F.0020.0002.03BC] # GREEK SMALL LETTER MU |
Μ | U+039C | [.0C9F.0020.0008.039C] # GREEK CAPITAL LETTER MU |
ν | U+03BD | [.0CA0.0020.0002.03BD] # GREEK SMALL LETTER NU |
Ν | U+039D | [.0CA0.0020.0008.039D] # GREEK CAPITAL LETTER NU |
ξ | U+03BE | [.0CA1.0020.0002.03BE] # GREEK SMALL LETTER XI |
Ξ | U+039E | [.0CA1.0020.0008.039E] # GREEK CAPITAL LETTER XI |
ο | U+03BF | [.0CA2.0020.0002.03BF] # GREEK SMALL LETTER OMICRON |
Ο | U+039F | [.0CA2.0020.0008.039F] # GREEK CAPITAL LETTER OMICRON |
π | U+03C0 | [.0CA3.0020.0002.03C0] # GREEK SMALL LETTER PI |
ϖ | U+03D6 | [.0CA3.0020.0004.03D6] # GREEK PI SYMBOL; QQK |
Π | U+03A0 | [.0CA3.0020.0008.03A0] # GREEK CAPITAL LETTER PI |
ϟ | U+03DF | [.0CA4.0020.0002.03DF] # GREEK SMALL LETTER KOPPA |
Ϟ | U+03DE | [.0CA4.0020.0008.03DE] # GREEK LETTER KOPPA |
ρ | U+03C1 | [.0CA5.0020.0002.03C1] # GREEK SMALL LETTER RHO |
ϱ | U+03F1 | [.0CA5.0020.0004.03F1] # GREEK RHO SYMBOL; QQK |
Ρ | U+03A1 | [.0CA5.0020.0008.03A1] # GREEK CAPITAL LETTER RHO |
σ | U+03C3 | [.0CA6.0020.0002.03C3] # GREEK SMALL LETTER SIGMA |
ϲ | U+03F2 | [.0CA6.0020.0004.03F2] # GREEK LUNATE SIGMA SYMBOL; QQK |
Σ | U+03A3 | [.0CA6.0020.0008.03A3] # GREEK CAPITAL LETTER SIGMA |
ς | U+03C2 | [.0CA6.0020.0019.03C2] # GREEK SMALL LETTER FINAL SIGMA; QQK |
τ | U+03C4 | [.0CA7.0020.0002.03C4] # GREEK SMALL LETTER TAU |
Τ | U+03A4 | [.0CA7.0020.0008.03A4] # GREEK CAPITAL LETTER TAU |
υ | U+03C5 | [.0CA8.0020.0002.03C5] # GREEK SMALL LETTER UPSILON |
Υ | U+03A5 | [.0CA8.0020.0008.03A5] # GREEK CAPITAL LETTER UPSILON |
ϒ | U+03D2 | [.0CA8.0020.000A.03D2] # GREEK UPSILON WITH HOOK SYMBOL; QQK |
φ | U+03C6 | [.0CA9.0020.0002.03C6] # GREEK SMALL LETTER PHI |
ϕ | U+03D5 | [.0CA9.0020.0004.03D5] # GREEK PHI SYMBOL; QQK |
Φ | U+03A6 | [.0CA9.0020.0008.03A6] # GREEK CAPITAL LETTER PHI |
χ | U+03C7 | [.0CAA.0020.0002.03C7] # GREEK SMALL LETTER CHI |
Χ | U+03A7 | [.0CAA.0020.0008.03A7] # GREEK CAPITAL LETTER CHI |
ψ | U+03C8 | [.0CAB.0020.0002.03C8] # GREEK SMALL LETTER PSI |
Ψ | U+03A8 | [.0CAB.0020.0008.03A8] # GREEK CAPITAL LETTER PSI |
ω | U+03C9 | [.0CAC.0020.0002.03C9] # GREEK SMALL LETTER OMEGA |
Ω | U+03A9 | [.0CAC.0020.0008.03A9] # GREEK CAPITAL LETTER OMEGA |
ϡ | U+03E1 | [.0CAD.0020.0002.03E1] # GREEK SMALL LETTER SAMPI |
Ϡ | U+03E0 | [.0CAD.0020.0008.03E0] # GREEK LETTER SAMPI |
ϣ | U+03E3 | [.0CAE.0020.0002.03E3] # COPTIC SMALL LETTER SHEI |
Ϣ | U+03E2 | [.0CAE.0020.0008.03E2] # COPTIC CAPITAL LETTER SHEI |
ϥ | U+03E5 | [.0CAF.0020.0002.03E5] # COPTIC SMALL LETTER FEI |
Ϥ | U+03E4 | [.0CAF.0020.0008.03E4] # COPTIC CAPITAL LETTER FEI |
ϧ | U+03E7 | [.0CB0.0020.0002.03E7] # COPTIC SMALL LETTER KHEI |
Ϧ | U+03E6 | [.0CB0.0020.0008.03E6] # COPTIC CAPITAL LETTER KHEI |
ϩ | U+03E9 | [.0CB1.0020.0002.03E9] # COPTIC SMALL LETTER HORI |
Ϩ | U+03E8 | [.0CB1.0020.0008.03E8] # COPTIC CAPITAL LETTER HORI |
ϫ | U+03EB | [.0CB2.0020.0002.03EB] # COPTIC SMALL LETTER GANGIA |
Ϫ | U+03EA | [.0CB2.0020.0008.03EA] # COPTIC CAPITAL LETTER GANGIA |
ϭ | U+03ED | [.0CB3.0020.0002.03ED] # COPTIC SMALL LETTER SHIMA |
Ϭ | U+03EC | [.0CB3.0020.0008.03EC] # COPTIC CAPITAL LETTER SHIMA |
ϯ | U+03EF | [.0CB4.0020.0002.03EF] # COPTIC SMALL LETTER DEI |
Ϯ | U+03EE | [.0CB4.0020.0008.03EE] # COPTIC CAPITAL LETTER DEI |
What this table means is that, at least in the DUCET:
The Greek letters new to Unicode 4.0 have not yet been incorporated into DUCET, but where they will go is mostly predictable: numeric koppa will sort after archaic koppa, reverse epsilon symbol after lunate epsilon symbol, capital theta symbol after capital theta, capital lunate sigma after lowercase lunate sigma, san before both koppas. There is no reason to think sho will not go after omega (and probably also sampi), per Bactrianist practice.
Nick
Nicholas, opoudjis [AT] optusnet . com . au Created: 2003-09-07; Last revision: 2008-02-06 URL: http://www.opoudjis.net/unicode/unicode_ordering.html
|