Better casing functions (German ß, etc.)

Discussion:

박신환

2018-07-11 06:59:37 UTC

Current Haskell has 'simple' `Char`-to-`Char` casing functions (as specified by Unicode), namely `toUpper`, `toLower` and `toTitle`.

So to convert cases of a `String`, Haskell intends `fmap toUpper`, etc. But this has some bugs.

Case 1. German Ã (Eszett)

'Ã' (U+00DF), Latin Small Letter Sharp S, is a lowercase letter itself, but Unicode doesn't specify its 'simple' uppercase counterpart.
It's because its uppercase counterpart is not a single character, but two characters, "SS".

Case 2. Turkish Ä° and Ä±
Rather than the common 'I' and 'i' case pair, Turkish language has the 'Ä°' (U+0130) and ï»¿'i' pair and the 'I' and 'Ä±' (U+0131) pair. Those are, dotted I pair and dotless I pair, respectively.

Case 3. Greek Î£ (Sigma)
Greek 'Î£' (U+03A3) must be lowercase mapped to 'Ï' (U+03C2) if followed by a whitespace, rather than normal 'Ï' (U+03C3).

Case 4. Greek iota subscript (Ypogegrammeni)
Greek 'Capital' letters with iota subscripts (for example, 'áŸ' (U+1F88)), though they are the 'simple' uppercase counterpart of their lowercase counterpart, they themselves are actually treated as titlecase characters. For example, the actual uppercase counterpart of 'áŸ' (U+1F80) is "áŒÎ" (U+1F08 U+0399). That is, an actual capital iota instead of the iota subscript.

Case 5. Precomposed letters without upper/lowercase counterpart
For example, Î (U+03B0) doesn't have precomposed uppercase counterpart. It must be effectively mapped to "ÎªÌ" (U+03AA U+0301).

In Summary, we need more elaborated casing functions which are `String`-to-`String`.

Bibliography:
The Unicode Standard Version 11.0 â Core Specification, Section 5.18.

Francesco Ariis

2018-07-11 07:35:35 UTC

Permalink

Hello 박신환,

Post by ë°ì í
Case 4. Greek iota subscript (Ypogegrammeni)

I think not even Data.Text handles this correctly!

Mario Blažević

2018-07-11 12:33:31 UTC

Permalink

Post by ë°ì í
Current Haskell has 'simple' `Char`-to-`Char` casing functions (as
specified by Unicode), namely `toUpper`, `toLower` and `toTitle`.
So to convert cases of a `String`, Haskell intends `fmap toUpper`,
etc. But this has some bugs.

I've never tested the cases you list, but I believe the text-icu library
covers them. See
http://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU.html#g:4

Post by ë°ì í
Case 1. German ß (Eszett)
'ß' (U+00DF), Latin Small Letter Sharp S, is a lowercase letter
itself, but Unicode doesn't specify its 'simple' uppercase counterpart.
It's because its uppercase counterpart is not a single character, but two characters, "SS".
Case 2. Turkish İ and ı
Rather than the common 'I' and 'i' case pair, Turkish language has the
'İ' (U+0130) and 'i' pair and the 'I' and 'ı'(U+0131) pair. Those
are, dotted I pair and dotless I pair, respectively.
Case 3. Greek Σ (Sigma)
Greek 'Σ' (U+03A3) must be lowercase mapped to 'ς' (U+03C2) if
followed by a whitespace, rather than normal 'σ' (U+03C3).
Case 4. Greek iota subscript (Ypogegrammeni)
Greek 'Capital' letters with iota subscripts (for example, 'ᾈ'
(U+1F88)), though they are the 'simple' uppercase counterpart of their
lowercase counterpart, they themselves are actually treated as
titlecase characters. For example, the actual uppercase counterpart of
'ᾀ' (U+1F80) is "ἈΙ" (U+1F08 U+0399). That is, an actual capital iota
instead of the iota subscript.
Case 5. Precomposed letters without upper/lowercase counterpart
For example, ΐ (U+03B0) doesn't have precomposed uppercase
counterpart. It must be effectively mapped to "Ϊ́" (U+03AA U+0301).
In Summary, we need more elaborated casing functions which are
`String`-to-`String`.
/The Unicode Standard Version 11.0 – Core Specification/, Section 5.18.
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

Mikhail Glushenkov

2018-07-11 15:26:50 UTC

Permalink

Hi,

Post by ë°ì í
[...]
Case 1. German ß (Eszett)
'ß' (U+00DF), Latin Small Letter Sharp S, is a lowercase letter itself, but Unicode doesn't specify its 'simple' uppercase counterpart.
It's because its uppercase counterpart is not a single character, but two characters, "SS".

Capital sharp s is now also considered valid:
https://medium.com/@typefacts/the-german-capital-letter-eszett-e0936c1388f8