2005-07-20

Unicode: issues with /nta/ & /ta/ encoding

Even though, following arguments are detailed only for /nta/, they are equally applicable to /ta/ as well.


Representation of /nta/ is closely related to what stands for. Malayalam’s behavior of representing /ta/ and /rra/ with same letter had definitely contributed to the confusion of what * is - /nta/ or /nrra/. Here are the details of the two ways in which it being used:

  1. * is used to represent /nrra/ for writing English words like ‘Henry’ () or ‘Enroll’ () in Malayalam, while represents /nta/. The syllable /nrra/ is not native to Malayalam and used only for foreign words. A casual reader typically overlooks this difference and considers both * and to be different the renderings of the same syllable. Their difference in pronunciation, if required for a foreign word, is inferred from the context.
  2. * is used in many new orthography fonts to represent /nta/. Typically a font places itself to be somewhere in the middle of new and old orthographies choosing only a convenient subset of conjuncts from old orthography. As an example, Mathrubhumi is the ASCII Malayalam font, which is the closest available to old orthography. Even in that, /nta/ is rendered as *. So this usage is very common.

These facts give way to two quite reasonable inputting scenarios:

  1. There can be Malayalam keyboard layouts with specific keys for Chillu-NA() and RRA (); and nothing for /nta/.

  1. Even if, there is a specific method for inputting /nta/, a writer may choose to input /nta/ as Chillu-NA() and then RRA() adjacent to it.

Along with above inputting scenarios following not-so-obvious facts should also be considered:

  1. Since Malayalam has, just one letter to represent /ta/ and /rra/, and * are essentially the same in the graphemic deep structure: Chillu-NA + RRA

  1. When used as /nta/, * is considered a single unit. While in /nrra/ context, it is used as having two separate components. It is evident from the reordering of left vowel signs in following examples: , . The reordering position of the left vowel sign cannot be deduced without knowing the word context from higher text processing. This is the serious unsolved problem if we want * to be considered a fallback for stacked .

Thus, we end up with following choices for encoding stacked :

  1. Graphically , can be considered as Chillu-NA + subscript form of RRA. Since Chillus are not encoded (yet), as per pr#37, the encoding for will be NA + VIRAMA + ZWJ + ZWJ + VIRAMA + RRA. This definitely is an obscure sequence! There is one more drawback to this encoding. When the font lacks subscript RRA, it rendering engine will fall back to explicit virama + full RRA form. This is not at all desired.
  2. Consider as a conjunct ligature of NA and RRA. This will lead to much simpler encoding NA + VIRAMA + RRA for and Chillu-NA + RRA forms *. This choice invalidates use of * for /nta/. Now about fallback.. From pr#37, the fallback to level 2 can result half-form of NA (Chillu-NA) + full RRA form. This infact is the desired result. If chillu-NA is separately encoded, we would loose this fallback advantage in this method. Nevertheless, it will have the reordering problem mentioned above. That can healed to an extend by considering NA + VIRAMA + RRA as /nta/ always and NA + VIRAMA + ZWJ + RRA as /nrra/. On detecting, NA + VIRAMA + RRA, rendering engine can make sure that the left vowel sign is placed before the NA-form regardless of whether subscript-RRA is present or not.
  3. Chillu-NA + RRA forms the conjunct () to represent /nta/. This conjunct is rendered only if the font has the subscript form of RRA. As per the Zero-Width-Non-Joiner’s usual meaning, Chillu-NA + RRA + ZWNJ should produce * in every font. Then, as per Uniqueness rule, NA + VIRAMA + RRA should not form /nta/ conjunct. To avoid the reordering problem, on detecting Chillu-NA + RRA sequence without trailing ZWNJ, rendering engine can make sure that the left vowel sign is placed before the NA-form regardless of whether subscript-RRA is present. This option will allow Chillu-NA to be encoded.

1 comment:

  1. I cant agree to encode nta as codepoint, only to satisfy the inclusion of english words, henry and sort... Even in english sh is pronounced differently in different contexts.. so nta and nrra will be pronounced correctly on context.
    But I support encoding the chillus.

    ReplyDelete