Lontar

Issues in Khmer syllable validation

Norbert Lindenberg
January 13, 2019

The Unicode Standard and the OpenType documentation don’t agree on the definition of a valid Khmer syllable, and Khmer syllable validation in OpenType shaping engines and in fonts produces inconsistent results.

Contents

What’s syllable validation?

In complex writing systems, it’s not always obvious in which order characters that are pronounced together or that form a cluster in rendering should be stored in a Unicode character sequence. Glyphs are often rendered in a different sequence than the corresponding sounds in spoken language are pronounced, and commonly some glyphs are shown above or below other glyphs. However, a defined character sequence is often important for correct processing of the text – sorting strings, searching for specific words, finding line breaks, and rendering the text using fonts.

Both the Unicode Standard and the OpenType script development documents therefore commonly define the structure of a valid syllable or cluster in a script. OpenType shaping engines usually validate incoming text as the first step in transforming a character sequence into a two-dimensional arrangement of glyphs, and insert a dotted circle ◌ whenever they find a character where they don’t expect it. Fonts implemented using Apple Advanced Typography have to implement validation themselves.

Unfortunately the Unicode Standard and the OpenType script development documents don’t always agree with each other on the structure of valid syllables, or even have internal inconsistencies, and shaping engine implementations don’t always follow either of the two documents.

What’s a valid Khmer syllable?

In the case of Khmer [Unicode, pages 637–648], the Unicode Standard, the OpenType Khmer documentation, and the implementations in OpenType Khmer shaping engines and AAT Khmer fonts diverge significantly.

The Unicode Standard [Unicode, page 646] defines a Khmer cluster as

B {R | C} {S {R}}* {{Z} V} {O} {S}

where

The OpenType Khmer documentation [OpenType] defines “consonant based syllables” as

Cons + {COENG + (Cons | IndV)} + [PreV | BlwV] + [RegShift] + [AbvV] + {AbvS} + [PstV] + [PstS]

where (clarifying the OpenType definitions):

Obviously, these two definitions of a Khmer cluster differ already in the notation. Here’s an attempt at rewriting the Unicode definition into the syntax used by OpenType, with * indicating an arbitrary number of occurrences:

(Cons | IndV) + [Robat | RegShift] + (COENG + (Cons | IndV) + [Robat])* + [[Z] + (PreV | BlwV | AbvV | PstV)] + [(AbvS | PstS)] + [COENG + (Cons | IndV)].

For comparison again the OpenType definition:

Cons + {COENG + (Cons | IndV)} + [PreV | BlwV] + [RegShift] + [AbvV] + {AbvS} + [PstV] + [PstS]

While the developers of rendering systems and fonts usually look at the Unicode and OpenType specifications, users commonly refer to a tutorial on typing in Khmer [OpenForum]. The descriptions in this document, which only covers modern Khmer, can be summarized as:

(Cons | IndV) + [ ((COENG + (Cons | IndV))* + [RegShift] + [PreV | BlwV | AbvV | PstV] + [AbvS] + [PstS]) | (Robat + [PreV | BlwV | PstV] + [PstS])]

The following sections look at the differences between these definitions and test how different implementations handle them. The comparison tables have the following columns, some of them only in Safari:

This document does not propose a new specification of Khmer syllables, as there’s currently a project underway at SIL International in Phnom Penh to comprehensively document usage of the Khmer script, which will likely provide a better foundation for such a specification.

Cluster bases

The Unicode syllable definition and the Open Forum tutorial allow both consonants and independent vowels as bases, the OpenType definition only consonants. However, the OpenType definition later mentions both independent vowels and numbers as beginning syllables. Neither of them mentions the dotted circle, which is commonly used on keyboards and in script documentation for showing combining marks.

This table tests with the consonant , the independent vowel , the digit , and the dotted circle by attaching the post-base vowel  ា, the below-base vowel  ុ, the below-base conjunct form of the consonant , and the below-base conjunct form of the independent vowel .

TextEdgeFirefox 64Firefox 66SafariBusraSangam
សា
សុ
ស្ត
ស្ឫ
ឧា
ឧុ
ឧ្ត
ឧ្ឫ
១ា
១ុ
១្ត
១្ឫ
◌ា
◌ុ
◌្ត
◌្ឫ

Robat

The robat ◌៌ is an initial vowel-less consonant r that is written as a mark above the following consonant, which thus becomes the base consonant. While robat phonetically precedes the base consonant, it is encoded after the base.

The Unicode syllable definition allows robat in two places: Right after the base, or after a subjoined consonant or independent vowel. The OpenType documentation neglects to assign this character a spot in the cluster definition, even though it mentions it in the text (some of that text discusses a conversion of a sequence of the consonant ro and Khmer sign coeng to robat, which contradicts the purpose of having a separate robat character and would leave the robat in the wrong place). The Unicode syllable definition indicates that robat cannot be combined with a consonant shifter, but can be combined with vowels and other signs. The Open Forum tutorial states that robat cannot be combined with coeng, consonant shifters, or above-base vowels or signs.

This table tests placement of robat both right after the base and after a subjoined consonant, and combinations with consonant shifters, vowels, and other signs.

TextEdgeFirefox 64Firefox 66SafariBusraSangam
ស៌
ស៌ុ
ស៌្ត
ស៌៊
ស៊៌
ស៌ិ
ស៌ំ
ស្ត៌
ស្ត៌ុ
ស្ត៌ិ
ស្ត៌ំ

Coeng

Coeng is the Khmer word for the conjunct forms of consonants and independent vowels. They are often called subjoined, even though a few are rendered partially post-base (e.g., ្យ)and one partially pre-base (្រ).

The Unicode syllable definition allows an arbitrary number of coeng within a syllable, and also allows a final coeng after vowels and signs. OpenType does not allow a final coeng, allows at most two coeng per syllable, and places further restrictions on them: The ones that are entirely below the base must come first, then the (at most) one with an arm to the left side of the base, then the ones with an arm to the right of the base. Note that this order prevents some words, such as លក្ស្មី, from being phonetically and visually correctly represented. The Open Forum tutorial allows an arbitrary number of coeng, but none can follow a pre-base coeng, and there’s no final coeng.

This table tests use of more than two coeng, use of a final coeng, and the additional restrictions imposed by OpenType.

TextEdgeFirefox 64Firefox 66SafariBusraSangam
ស្ត្រ
ស្ត្ត្ត
ស្រ្យ
ស្យ្រ
ស្ត្រ្យ
ទាំ្ង
សិ្ង
សា្ង
សា្ងះ
សាះ្ង
ក្ស្ម
ស្ត្រ
ស្រ្ត
ស្រ្រ
ស្យ្យ
ស្យ្យ្យ
ស្ត្ត្រ្យ្យ

Consonant shifters

Khmer consonants are divided into two series, or registers, which influence the pronunciation of the following (inherent or written) vowel. The consonant shifters muusikatoan ◌៉ and triisap ◌៊ move certain consonants from one series to another, thus changing the pronunciation of the vowel (in one case, a consonant shifter can also change the pronunciation of a consonant itself).

The Unicode syllable definition places consonant shifters right after the base, while OpenType (which also calls them RegShift) places them after coeng and pre-base and below-base vowels. Both describe that consonant shifters move below the base when followed by certain vowels or signs, and how the character zero width non-joiner can be used to prevent this move. Both are inconsistent in where this character should be placed, before the consonant shifter or between the consonant shifter and the vowel. The Open Forum tutorial places consonant shifters after coeng, but before any vowels, and doesn’t mention zero width non-joiner.

This table tests placement of consonant shifters both right after the base and after coeng and vowels, and various placements of zero width non-joiner.

TextEdgeFirefox 64Firefox 66SafariBusraSangam
ស៊េ
សេ៊
ស៊ុ
សុ៊
ស៊្ហ
ស្ហ៊
ស៊្ហិ
ស្ហ៊ិ
ស៊្ហ៊ិ
ស‌៊ិ
ស៊‌ិ
ស‌៊្ហិ
ស៊‌្ហិ
ស៊្ហ‌ិ
ស្ហ‌៊ិ
ស្ហ៊‌ិ
ស្ហុ៊ំ
ស្ហុ‌៊ំ
ស្ហុ៊‌ំ

Vowels and signs

Unicode, OpenType, and Open Forum agree on allowing only one dependent vowel per syllable. Apart from robat and consonant shifters, Unicode allows only one sign, which must be placed after any vowels, while OpenType allows up to two above-base signs placed after any above-base, but before any post-base vowel, and one post-base sign at the end of the syllable. The Open Forum tutorial allows one above-base and one post-base sign at the end of the syllable.

Implementations show significant variation; in particular, placing the above-base sign  ំ after the post-base vowel  ា is more widely supported than the reverse order.

TextEdgeFirefox 64Firefox 66SafariBusraSangam
សិ
សុ
សុិ
សិំ
សំិ
សំា
សាំ
ស៎ា
សា៎
សុំា
សុាំ
សើ
សេិ
សោ
សេា
សាះ
សះា
សោះ
សុំះ
សាំះ
សំាះ
សិិ
សុុ
ស្្
សំំ
ស៊៊

Confusables

This table includes the strings from HarfBuzz bug 667. They are all different character sequences, but can render the same if validation is too loose and the font doesn’t stack duplicate marks.

TextEdgeFirefox 64Firefox 66SafariBusraSangam
ស៊ើ
ស៉ើុ
ស៉េីុ
ស៉ីេុ
ស៉ើុុ
ស៉េីុុ
ស៉ីេុុ
ស៊ើុ
ស៊េីុ
ស៊ីេុ
ស៊ើុុ
ស៊េីុុ
ស៊ីេុុ
ស៉ើី
ស៉ីើ
ស៉ើីុ
ស៉ីើុ
ស៊ើី
ស៊ីើ
ស៊ើីុ
ស៊ីើុ
សើីុ
សើុី
សីើុ
សីុើ

Marks without bases

Marks without bases are not valid syllables, and OpenType recommends inserting dotted circles to indicate that. The implementations tested here get that right most of the time. One difference is whether sequences of two marks get one or two dotted circles – it seems one should be enough.

TextEdgeFirefox 64Firefox 66SafariBusraSangam
្ត
្រ
្យ
ុំ◌◌◌◌◌◌◌◌
ាំ◌◌◌◌◌◌◌◌
៊ំ◌◌✓◌◌◌◌◌◌◌

References

Busra: D. Kanjahn: Khmer Busra. Font version 7. Part of The Mondulkiri Font Family. SIL International, 2014.

Edge: Microsoft Edge. Browser version 42.17134.1.0. Included in Microsoft Windows 10 version 1803. Microsoft, 2018.

Firefox 64: Firefox. Browser version 64.0.2. Mozilla, 2018.

Firefox 66: Firefox. Browser version 66.0a1 (2019-01-10). Mozilla, 2019.

Kanjahn: Didi Kanjahn: The orthographical syllable in Khmer and rules for the rendering of register shifters. SIL International, 2012.

Noto: Danh Hong and the Monotype Design Team: Noto Sans Khmer Regular. Font version 2.001. Google, 2016.

OpenForum: How to Type Khmer Unicode. Open Forum of Cambodia, 2004-11-18.

OpenType: Developing OpenType Fonts for Khmer Script. Microsoft, dated 02/07/2018, accessed 2019-01-10.

Safari: Safari. Browser version 12.0.2. Included in macOS 10.14.2. Apple, 2018.

Sangam: Muthu Nedumaran: Khmer Sangam MN. Font version 14.0d1e8. Included in macOS 10.14.2. Apple, 2018.

Unicode: The Unicode Consortium: The Unicode Standard, Version 11.0. The Unicode Consortium, 2018. For Khmer, in particular section 16.4 Khmer, pages 637-648.