2005-06-18

Unicode: The Chillu Challenge

Abstact
  • Chillu-C1 + C2 is not equivalant to C1 + Virama + C2, in contrast to Rachana's claims.
  • Chillu issue is has an existance independant of Samvruthokaram confusion
Details

It is proved thru a counter-example.

If Chillu-C1 + C2 is equivalant to C1 + Virama + C2, provide different unicode representations for the following two malayalam words:
  • /pin_nilaavum/
  • /pinnilaavum/

If there is no solution, that would indicate that Chillu-C1 + C2 is not equivalant to C1 + Virama + C2 and the current chillu representation in Malayalam Unicode is flawed; thus making the section 3 of Rachana document incorrect.

In other words, the half-form (notation of vowellessness) of C1 is not same as chillu-C1.

Why should they have different encodings?

The words /pin_nilaavum/ and /pinnilaavum/ are different in all 3 essential attributes of a word:
  • Meaning. /pin_nilaavum/ means 'and shadow of moonlight'. /pinnilaavum/ means 'will be behind'
  • Pronounciation. The second 'na' of /pinnilaavum/ is an alveolar and that of /pin_nilaavum/ is dental.
  • Orthography. The first 'na' of /pin_nilaavum/ is chillu while we have the conjunct double of 'na' in /pinnilaavum/.
So these two words should have two different Unicode encodings. This argument is exactly same as why 'apple' and 'banana' should have two different Unicode encodings.

The difference should be in some non-joiner characters

See pages 389 to 391 in chapter 15 of Unicode 4.0.0
"ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER are format control
characters. Like other such characters, they should be ignored by
processes that analyze text content. For example, a spelling-checker
or find/replace operation should filter them out. (See Section 2.11,
Special Characters and Noncharacters, for a general discussion of
format con- trol characters.)"
(thanks to Mahesh Pai)

More examples

/van_yavanika/ meaning 'big curtain'
/vanyavanika/ meaning 'wild forest'

/kaN_valayam/ meaning 'eye boundary'
/kaNvalayam/ meaning 'peace of Kanvan - the mythical character'.


/than_vinayam/ meaning 'his/her modesty'
/thanvinayam/ meaning 'policy of a woman'


/man_vikshObham/ meaning 'explosion of mind'
/manvikshObham/ meaning 'fury of a lady'

General patterns of these examples are (both the forms of character capable of forming chillu) + (semi-vowels) and (ന in chillu or pure form) + ന. Above examples are just few from vast number possible with these pattern rules.


Counter-challenge from
Kenneth Whistler

If separate characters are encoded for Malayalam Chillus, so that the "challenge" distinction were to be encoded as:

"nn" is U+0D28, U+0D4D, U+0D28

"n_n" is U+0DXX, U+0D4D, U+0D28

implementers are then faced with determining what to do with the following sequence:

"???" is U+0D28, U+0D4D, U+200D, U+0D28

That sequence, of course, exists now, and would be a legitimate and possible sequence even if a Chillu-n is encoded. So how would a rendering engine render that sequence, and how would
it be distinguished, by an end user or a text process such as a search engine, from the proposed U+0DXX, U+0D4D, U+0D28 sequence for "n_n"?

That counter-challenge needs a "solution" for the encoding of Chillu characters to make sense for Malayalam. For if there is no solution forthcoming, addition of Chillu characters would potentially be *increasing* the ambiguity potential for the Unicode representation of Malayalam text, rather than decreasing it.


Solution to Ken's challenge

Half-form of NA (ന) is not chillu. It is described in detail here.


Now, we can use the rules in ZERO WIDTH JOINER in Indic Scripts standard to see the behavior of the challenge sequence. It will form the conjunct double ന്ന /nna/ as per the second bullet in the section 7-proposal of above pr#37.

2 comments:

  1. Hi,

    This is not really a solution, since I really believe the Unicode 4.x standard is unclear (I would not say flawed) and my point was and is that it requires clarification on this issue as soon as possible.


    Here is my reasoning.

    First, how would you enter this using ISCII? (say, a widely available implementation, say, CDAC ;-)) Well, that's easy, just put type na+HAL+NUK to get the cillu, and then everything is OK. 'will be behind' is encoded as C8 DB C6 E8 C6 DB D1 DA D4 DD A2, while 'moonlight' get an addition nukta to end as 'C8 DB C6 E8 E9 C6 DB D1 DA D4 DD A2'.

    Then, just applying the usual rules to pass from ISCII to Unicode (HAL+NUK is replaced with VIR+ZWNJ), and we end with [hope I can enter UTF8 here]: പിന്നിലാവും for the verbal form, പിന്‍നിലാവും for the moonlight... (looks like OK here with Microsoft's Uniscribe 1.473.4067.15 and Kartika, but YMMV). The decomposed sequences are <U+0D2A U+0D3F U+0D28 U+0D4D U+0D28 U+0D3F U+0D32 U+0D3E U+0D35 U+0D41 U+0D02> for the first, and <U+0D35 U+0D41 U+0D02
    U+0D2A U+0D3F U+0D28 U+0D4D U+200D U+0D28 U+0D3F U+0D32 U+0D3E U+0D35 U+0D41 U+0D02> for the second.

    Of course, this "solution" is not welcome since it means that the cillu require more codepoints than other letters, and is awkward (to say the least) to be typed in.


    Antoine

    ReplyDelete
  2. Antoine wrote "Of course, this "solution" is not welcome since it means that the cillu require more codepoints than other letters, and is awkward (to say the least) to be typed in."

    dear, this solution is not welcome, not because of more codepoints for the chills. It is understood that ZWJ which is only a font directive and not carry any simantic value. apart from the ZWJ there is no difference between these two words. a search for one of this word will bring both of them.

    ReplyDelete