2005-07-07

Unicode: Uniqueness Rule on Encoding

Two different encodings should not render same, irrespective of the font or joiners used.

Two see why this rule is required, assume there is a conjunct formation rule for a subset of Chillu-C1 + C2 permutations and as per that rule, Chillu-NNA + DDHA (ൺ + ഢ) can form the conjunct (ണ്ഢ) in an old orthography font. Of course, NNA + VIRAMA + DDHA (ണ + ചന്ദ്രക്കല + ഢ) will also form the same conjunct.

There fore, a document (eg: a wiktionary.org document) written by multiple people using various inputting tools can quite possibly have both spellings for ണ്ഢ, without reader or writer being aware of it. This can cause many problems including ineffective searches and inconsistent sorted list of words.

Antoine Leca write this related text:
"Right now (for fifteen years really), we have a similar problem with Latin in Europe: our accentuated letters have two spellings, one which is the legacy one (using unique codepoints, for example U+00EE (î) which is the one everybody uses; and the other is the genuine Unicode encoding, the one we ought to use but nobody does in reality, using the base (English) letter and then another codepoint for the accent, i.e. U+0069, U+0302 (î)for my example above). You cannot normally see the difference, and if you do, it is just because of an imperfect Unicode support which does not render correctly the second form (things are getting better here, but still are not perfect). But if you are searching, the different spellings MAY be viewed as different, when of course it should not. Similarly, you could be allowed to enter both forms in a database field as "unique" key, when of course it should be prevented.

As this stuff is pretty evident to anyone in Europe developping in Unicode, this problems has been identified for years; and a "fix" has been developped, that is those two sequences are considered "canonically equivalent", so a "fully conforming" Unicode process should merge the two encodings for processes like searching or inserting. Please note that the majority of the tools used nowadays which deals with Unicode contents do not do that; only the tools specially prepared does it, and this comes with a noticeable performance impact."

No comments:

Post a Comment