if it were...: Unicode: virama model Vs subjoined model

Copied from Indic Mailing list Archive[username=unicode-ml password=unicode]

This message is motivated by UTC agenda items for Myanmar,
one of which, taken literally, asks for the encoding of subjoined
characters in what is a virama model.

The distinction between the virama model and the subjoined model is often
seen as the source of the dissatisfaction of some of our users. In my
opinion, it is only a secondary issue, that is not enough to explain
that those users keep coming back. There is a primary issue that we
not yet identified, and therefore cannot address. It is that primary
issue that we need to find.

I am not sure what the primary issue is. I have a candidate and the
rest of this document is an attempt at describing this candidate. It
is quite possible (and even likely!) that I am entirely wrong; that
does not matter - what matters is that we look for this primary
issue. While I try to be precise and factually correct, there are
certainly intentional white lies and ommisions, and arbitrary
simplifications; forests and trees, as they say. If you give up on
reading this, please read the last paragraph anyway.

---


The starting point of the Indic scripts is an inventory of
letters. There is a bunch corresponding to vowel sounds: a, a:, i, i:,
u, u:, etc. There is a bunch corresponding to consonant sounds: ka,
kha, ga, ... At this point, we don't distinguish the dependent and
independent forms of vowels, we have a single "a" in our inventory.

In Latin based writing systems, we also have an inventory of letters
(a, b, c, d, etc). A written word or even a whole text reflects
directly a sequence of those letters. There may be ligatures, but once
you have a sequence of letters and just that, you can legibly render
that word: for each letter, take its letterform, put it at the end of
the line assembled so far.

For Indic scripts, the sequence of letters is not enough: in addition,
you need to know how to segment it. The rendering is: for each
segment, build its letterform, put it at the end fo the line assembled
so far. Of course, the segments I am speaking about are the "aksaras"
or something close to that. They are typically a few letters in size
(say, 1 to 5).

The letterform for a segment of course reflects somehow the letters in
that segment.  In some cases, it is a letterform unique to that
sequence of letters, where it is not possible to match a fragment of
the letterform with any of the letters in the segment. In other cases,
some of the letters (but not necessarily all) can be identified with
fragments of the letterform. Yet in another cases, the letterform is
obtained by a rather straightforward combination of the letters. These
are not rigid categories: there is a continuum from the "unique shape"
to the "straighforward assembly"; that explains why we end up with a
bunch of exceptions every time we try to come up with rigid
categories.  (This also says that, in my opinion, any formalization in
the encoding of those categories is bound to failure.)

When fragments of the letterform for a segment can be identified with
letters of that segment, then we can observe that they are arranged
with a lot of freedom: the fragment for the first letter may end up on
the far right, the fragment for the last letter may end on the far
left, a letter in the middle may end with multiple disconnected
fragments all over the place. Within a segment, everything goes. On
the other hand, there is essentially no typographic interaction
accross segments. That's why I like to call the segments "typographic
clusters".

To me, it is this basic inventory of letters and this concept of
typographic clusters that characterize first and foremost the Indic
scripts, and survives more or less accross all the historical
transformations and adaptations of the scripts.


So far, we have not discussed at all how the typographic clusters are
determined. In the Sanskrit/Devanagari case, a typographic cluster
corresponds to a /c_1...c_n v/ sound sequence (i.e. some consonants
followed by a vowel), where the /v/ sound is typically not written if
it is the inherent vowel sound /a/ the script. The sequence of letters
in the typographic cluster will therefore be of the form "[c_1...c_n
v?]". Because the vowel letter is optional, we need some way to
disinguish the two cluster text "[c_1] [c_2 v_2]" (which writes more
or less the sounds /c_1, a, c_2, v_2/) from the one cluster text "[c_1
c_2 v_2]" (which writes more or less the sound /c_1, c_2, v_2/).

We have two encoding models to do that:

- virama model : encode a single coded character per consonant letter,
encode a combining coded character for each vowel letter, encode a
consonant linking coded character, V. Then the typographic cluster
"c_1...c_n v" is represented by the sequence of coded characters "c_1
V ... V c_n v_combining", i.e. we stick the V consonant linking coded
character between the consonants of a cluster. Now we know that "c_1
c_2 v_combining" is segmented "[c_1] [c_2 v]" while "c_1 V c_2
v_combining" is segmented "[c_1 c_2 v]".

- subjoined model : encode two coded charaters per consonant letter,
one non-combining and one combining, encode a combining coded
character for each vowel letter; the cluster "c_1 ... c_n v" is
represented by the sequence of coded characters "c_1_non_combining
c_2_combining ... c_n_combining v_combining". Now we know that
"c_1_non_combining c2_non_combining v2_combining" is segmented "[c_1]
[c_2 v2]" while "c_1_non_combining c_2_combining v_combining" is
segmented "[c_1 c_2 v_2]".

In both cases, we also have a non combining coded character of the
vowels, to mark those clusters which have no consonants.

It should be clear that these two encoding models are just technical
variations, really two serialization schemes to go from a sequence of
typographic clusters to a sequence of coded characters; somehow we
need some way to retain the segments in what is otherwise just a
sequence of coded characters, hence the additional coded
characters. We could dream of other serializations; for example, instead
of having separate combining and non-combining vowels we could use our
V linker: "[v] [cv]" would be "v c V v" instead of our traditional
"v_non_combining c v_combining".

Since the two models are a priori equally viable, which one gets used
depends on factors such as the precedent set by other encoding
standards (and do we have a third model, the visual model,
because of that, but we ignore it here). Another factor is the way the
users of the script look at it, which is more or less aligned with the
way the letterform for a cluster is built: for example, if we can
systematically recognize in it the standalone letterform of the first
consonant and deformations of the letterforms of the remaining
consonants, this matches well with a subjoined model.

If the two models are equally viable, why do we get so many fights on
which one to use? Surely, the native writers can see through that; yet
no matter how often we repeat that it should not matter, they keep
coming back.

One interpretation is that they can't accept a less-than-perfect match
between the encoding model and the way they look at their script. Take
Myanmar: the second consonant of most clusters is the one that takes a
so-called medial form, they learn those medial forms as separate
entities, and they get surprised when this does not directly translate
in the encoding of medial forms.

This is a plausible interpretation, but I don't think it is enough to
explain the insistance of the users to change the encoding model. I
personally trust that they are smart enough to realize that "V ya"
vs. "ya_combining" does not make that much of a difference, and that
*changing* the existing encoding is far more damaging than the
gain. There has to be something else to motivate their requests.

In looking at the Myanmar document L2/04-273 (which proposes what
amounts to four subjoined consonants in a virama model), I was really
intrigued by their examples of representation on page 10. Look at the
string of coded characters "70 bytes", i.e. the current Unicode
representation. In two cases, they write the virama coded character as
"V"; in the other cases, they mark it as "". It is the same coded character in all
cases, why do they make a distinction? They could have used "V"
everywhere or "" everywhere, the point they are making does not
truly depend on that distinction. One observation is that they use
"" when the typographic clustering is indicated visually by this
hook, and "V" when the typographic cluster is indicated visually by
modification of the consonants. May be they don't think of those two
uses of the virama coded character as equivalent (hence the desire to
distinguish more strongly those uses in the encoding).

In looking at a random piece of text in Myanmar, one has to be struck
by the frequency of the visible virama sign. In Hindi/Devanagari, a
visible virama is fairly rare. In fact, that should be expected: a
visible virama essentially appears when the letterform of a
typographic cluster is at its "lowest" and least desirable form, with
multiple individual consonants in their standalone form, and the
visible virama serves to indicate that they form a single cluster.
One could say that the whole point of typographic clusters is to avoid
those cases, hence their low frequency in practice. (This also explains
why the occasional attempts at reform that result in increasing the
frequency of visible viramas, may be driven by technology limitations,
seem to fail so miserably).

So what about Myanmar? The syllabification of Myanmar is strongly
either (C(C)V) or (C(C)VC); they say that they have open (ending in a
vowel) and closed (ending in a consonant) syllables. The observation
is that the visible virama occurs essentially on the final consonant
of closed syllables. If you take a fictive word like "pur-ti", its
written form shows those two syllables: the "r" is not allowed to
interact typographically with the following "ti", so you end with
"..., ...". By contrast, in
Hindi/Devanagari, the same syllabification is not reflected in the
written form, the "r" is allowed to interact with the "ti", as you
would expect from this typographic cluster; and interact it does,
since visually you have "p_uitr". In short, Hindi/Devanagari does not
care to reflect the syllables in the written form, Myanmar does, and
that is achieved by a high use of visible viramas.


Hyphothesis: the real typographic clusters of Myanmar are not [c_1,
.., c_n, v], they are the syllables. That is the segment were
typographic interaction occurs. Our encoding of Indic scripts imposes
[c_1, ..., c_n, v] clusters and that is were the discomfort is, not in
how a cluster is encoded (virama or subjoined). All the attempts at
modifying the encoding are first and foremost attempts to make the
real typographic clusters surface, but since the two alternatives for
encoding are fundamentally equivalent, they fail.

I don't have any "proof" of that hypothesis. All I can say is that if
you start with "The authors of L2/04-273 want to tell us that they
have different typographic clusters; forget the virama/subjoined
battle, it is just noise", then their discourse, and the progression
of their argument becomes much more meaningful. If instead you read it
as "The battle is about virama/subjoined", then the document seems
just like another silly attempt.

It is also interesting to remember the Devanagari eyelash ra. From
what we gathered, in Nepali/Devanagari, there is a desire to
graphically convey whether a ra is the final of a syllable or the
initial of the next. The typographic clusters of "pur-ti" are "[pu]
[rti]", while the typographic clusters of "pu-rti" are "[pur]
[ti]". The graphic device used here is different from what Myanmar
does (because the typographic clustering is the *opposite* of the
syllable clustering), but the overall goal is the same: reflect
syllables in the written form, by constraining the typographic
clusters.

One could also wonder whether the Malayalam chillus are not yet
another manifestation of this: a graphic device to reflect in writing
the syllabification by constraining the formation of typographic
clusters.

---

Is this theory correct? If so, how do we use it? As I said, I don't
care to be the person who will identify the precise disconnect with
our users, but I am pretty sure it is more than virama vs. subjoined, and
I think it's important we find it and fix it. That is the point of
this contribution.


Eric.
if it were...

2005-06-21

Unicode: virama model Vs subjoined model - by Eric Muller

No comments:

Post a Comment