Rourke on Lunate Sigma
Language:ELL EPO JBO TLH LAT
From Greek Unicode mailing list (2003–05–16).
Greek Lunate Sigma Symbol, U+03F2, is not your father's lunate sigma. Unless you are in a context in which lunate sigma is distinguished from non-lunate sigma (for instance, a paleographical context), I would strongly recommend avoiding this character, because using it will cause normalization issues. If you wish a lunate sigma letter-form as a typographical flourish, you should probably encode your text with the standard code-points for sigma - U+03C2 for terminal, and U+03C3 for initial-medial, and use a font which provides the same lunate sigma glyph for both code-points. Otherwise you will have to rely upon any search resources, web services (in the specialized sense of "web service" used by developers), etc. having the sense to do normalization for you, and risk the possibility that any user doing anything other than simply reading your text will be unable to recognize your texts for what they are. (For instance, I do not know, and perhaps someone from Perseus could tell us, if the Perseus morphological lookup normalizes the lunate sigma symbol to an abstract sigma character: if anyone would have provided this normalization, it would be Perseus.)
The lunate sigma problem is ultimately an issue with the positional shaping of the non-lunate sigma. Nick Nicholas has argued elsewhere (pretty compellingly, and against my own comments in one case, by the way) that there is a semantic distinction between the initial-medial sigma and the terminal sigma, because the use of initial-medial before a period indicates an abbreviation, etc. I'm not doing his argument justice here, but the point is that there is information preserved by the use of two code points for sigma that is lost if one uses a single code point for sigma - so conversion between lunate and non-lunate sigmas is not conservative of information. By adding the lunate sigma symbols to the repertoire (Unicode 4 introduces an upper case lunate sigma), the repertoire designers have introduced a normalization problem that can only be resolved by eschewing the use of the lunate sigma codepoints (as I suspect they intended) except for those rare occasions when both forms are used and the distinction is meaningful (for instance, when the lunate sigma is used as a symbol in a stemma, etc.).
There are already normalization issues, of course, between the extended Greek characters and the basic Greek characters: for instance, one must normalize away the distinction between the basic Greek accented vowels and the extended Greek vowels with acute, though proper usage (NFC) prescribes the use of the former in exclusion to the latter. Because the glyph shapes shown in the Unicode Standard code charts have been wrongly taken as prescriptive, and normative (they are quite explicitly described as non-normative), and quite possibly because many of them are ignorant of the Greek language (especially the modern language: and justifiably, the names of the accents are given in the Unicode Standard in modern Greek for Greek characters), font designers have introduced spurious distinctions between the letter-forms for the vowels with" tonos" and the vowels with acute that mirror differences between contemporary monotonic typographical usage and traditional polytonic typographical usage. Users see that the accent in their font is nearly vertical on the (correct) alpha "with tonos" while it is more horizontal on the alpha with "oxia," and so naturally use the alpha with "oxia" (U+1F71), even though the correct character to use is the alpha with" tonos" (U+03AC) because that is what they want to see on the page; but they are in fact magnifying the font designer's error by doing this.
Fortunately, because there is really no legitimate symbolic usage of U+1F71 that is distinct from the usage of U+03AC (in the same way that one might want to distinguish between U+03F1, the Greek Rho Symbol and the standard U+03C1 lower-case rho, as the former is used in some mathematical and technical contexts in a way that distinguishes it from the latter), normalizing all U+1F71s to U+03ACs is easy: no information is lost. And so these normalizations are commonly provided by applications.
But if one normalizes U+03F2, one needs to determine whether to use U+03C2 or U+03C3, and because the semantic connotations of the medial sigma when it occurs in what would normally be a terminal position is something that cannot be determined programmatically, one is stuck. The same problem crops up in recasing (switching between uppercase and lowercase), and Nick Nicholas's discussion on the subject is an interesting read.
If Greek had been an Indic language with little tradition in encoding or typography to deal with, I suspect that the extended characters would never have been encoded, nor even the vowels with tonos, but only the basic characters and combining diacriticals, and that the various sigmas would have been encoded as two sigmas (one upper, one lower) with a refining non-spacing invisible character for use in those cases when a sigma terminal by position is medial by context. But the rich traditions of Greek typography have given us a far more complex beast to work with.
Created: 2004-01-01; Last revision: 2004-01-01