Order and disorder in Unicode
Norbert Lindenberg
May 28, 2025
The Unicode Standard lacks well-defined encoding orders for the orthographic syllables of Brahmic scripts. This creates problems such as missing search results, incorrectly rendered text, and security holes. This article discusses the causes of these problems and various localized attempts at solving them. It proposes a new form of normalization as a more generic solution.
This article requires web fonts to be rendered correctly. Please read it in a browser and mode that supports web fonts (“reader” views don’t).
Contents
- Disorder
- Defining encoding order in the Unicode Standard
- Defining encoding order outside of the Unicode Standard
- The dotted circle problem
- A new type of normalization?
- Acknowledgments
- References
Using copyrighted material without license to create AI systems is theft.
Disorder
Equal inputs, different outputs
Let’s start with a little experiment: Using your favorite browser, go to your favorite search engine, and search for the three strings ស្ត្រី, ស្រ្តី, and ស្រី្ត, one after another.
As of May 2025, these three strings look the same in all major browsers (showing the word “woman” in Khmer), but produce very different search results in all major search engines. Here are the first pages of results I got from Google:
This is pretty disturbing. How can strings that look identical produce such different results? To understand this, we need to look at one aspect of the Unicode Standard that’s poorly understood, poorly specified, and poorly implemented: The order of characters within Unicode-encoded strings for Brahmic scripts, or encoding order. The three strings use the same Unicode characters, but in different sequences:
Encoding order | Rendering |
---|---|
ស ◌្ត ◌្រ ◌ី | ស្ត្រី |
ស ◌្រ ◌្ត ◌ី | ស្រ្តី |
ស ◌្រ ◌ី ◌្ត | ស្រី្ត |
Each of these encoding orders conforms to the grammar for Khmer orthographic syllables given in the Unicode Standard. However, the standard says nothing about whether they should be treated as equivalent in search or whether fonts and font rendering systems should render them in a way that makes them distinguishable.
It’s worth noting that the four characters in ស្ត្រី each denote a separate sound. Transliterated into Latin, but arranging the characters into two-dimensional orthographic syllables the way Khmer does, the table above would look like this:
Encoding order | Rendering |
---|---|
S t r ī | |
S r t ī | |
S r ī t |
Looking good here; not there
Now take a look at these two strings: ① மொழி and ② மாெழி. As of May 2025, browsers render them (the word “language” or “word” in Tamil) in different ways:


Safari on the Mac, or any browser on iPhone or iPad, render string ② with a dotted circle, and with the vowel ◌ெ in the wrong place. Other browsers render it without that dotted circle and exactly like string ①. String ① is rendered without dotted circle everywhere.
Dotted circles are inserted by font rendering systems to indicate that a character sequence is invalid. Here Apple’s font rendering system, CoreText, considers string ② invalid, while the font rendering system used by other browsers, HarfBuzz, considers it valid. Such differences between font rendering systems are quite common, as I’ve documented for Khmer and Devanagari.
The difference between strings ① and ② is the encoding order:
String | Encoding order | Rendering CoreText | Rendering HarfBuzz |
---|---|---|---|
① | ம ◌ெ ◌ா ழ ◌ி | ![]() |
![]() |
② | ம ◌ா ◌ெ ழ ◌ி | ![]() |
![]() |
Is the second encoding order invalid, as the CoreText rendering indicates, or valid, as the HarfBuzz rendering implies? The Tamil section of the Unicode Standard shows several examples with ◌ெ ◌ா, and none with ◌ா ◌ெ, but it doesn’t explicitly say that the latter is not allowed. This is a problem because there are Tamil keyboards, mainly on non-Apple devices, that let users enter the second version, which then renders as broken when sent to Apple devices.
From dumb to undirected smart
Keyboards used to be dumb mappings from keys to characters, but today can be much smarter. Ensuring a correct encoding order could be one aspect of their smartness, but there’s no standard giving this smartness direction.
In the 1990s, keyboards simply mapped from keys on a physical keyboard to characters in the character encoding used by the computer. Initially one keystroke mapped to one character. The ability was added to merge a dead key with a subsequent character, such as ◌̀ + e → è, and some keys might result in two characters generated. The context in which text was input never played a role. Getting text right was entirely up to the user.
Separately, input methods (or “input method editors”) were developed, the complicated software packages that enable text input for writing systems with very large character sets, especially Chinese and Japanese. Input methods analyze series of keystrokes, look at the surrounding text in the document being written, and produce a list of candidate character sequences from which the user can pick the right one. Eventually they even predicted what the user might want to write. Operating systems provide special APIs through which input methods and applications can communicate with each other to produce optimal results.
This separation has gone away. First, some developers realized that they could use APIs designed for input methods to enable smart input for scripts other than Chinese, Japanese, or Korean. API functions that let an input method look at the surrounding text in the document being written proved particularly useful. Smart keyboards for Brahmic scripts can thus reorder the characters within an orthographic syllable, replace character sequences that the Unicode Standard says should not be used with their recommended alternatives, and correct text in other ways. Then the Keyman keyboard framework enabled the use of these facilities in a platform independent way – instead of dealing with input method APIs, keyboard developers could just describe which transformations should be applied to input text. Finally, when text input APIs were added to Android and iOS, they provided the capabilities of input method APIs, even though the API on iOS uses the word “keyboard” throughout. Keyboards provided by Lontar GmbH use these capabilities to reorder and otherwise correct input text.
For at least a decade – the iOS keyboard extension API was released in 2014 – the technology thus has existed to let smart keyboards ensure that text is input in valid character sequences. What’s been missing is direction from a standard defining what “valid” really means. This lack of direction meant that many keyboards today remain dumb and don’t correct the encoding order. It even affects the work of other parts of the Unicode Consortium itself: The technical committee for the Common Locale Data Repository provides a standard for describing keyboards in the Keyboards part of the Locale Data Markup Language. This standard, among other things, enables the specification of transformations and reordering of keyboard input – but has nothing to point to that would define expected outcomes.
Lack of interoperability
It’s important to understand that the Unicode Standard is the foundation for many different processes:
- Keyboards let the user enter text into documents. Designers of keyboards and their input transformations need to know which character sequences are considered valid with respect to their encoding order (or “well-formed”) so that they can enable valid input and avoid invalid input.
- Spelling checkers implement a different level of correctness, deciding which words are valid within a language, and how invalid ones might be mapped to the intended valid ones. However, besides knowing that ស្ត្រី is a valid Khmer word, they also need to know which is the valid Unicode character sequence for it.
- Fonts and font rendering systems need to make sure that different character sequences look different. They commonly validate text by inserting dotted circles into sequences they consider invalid (in the case of OpenType, this is normally done by the rendering system; in the case of Graphite or Apple Advanced Typography, by the font). After this is done, fonts can rely on input being well-formed.
- Search algorithms need to be able to rely on text being well-formed, need to know which character sequences can be treated as equivalent, and need to be able to rely on non-equivalent text to actually mean something different to users.
- Optical character recognition systems that convert images of text into digital text need to be able to create well-formed Unicode character sequences.
- Speech input systems that convert speech audio recordings into digital text need to be able to create well-formed Unicode character sequences.
- Text-to-speech systems that convert digital text into speech audio output need to be able to rely on well-formed input.
- Text normalization tools might convert text that is not well-formed into well-formed text.
- Generative artificial “intelligence” systems that analyze existing texts, generate models from them, and use the models to generate derivative texts should create well-formed output.
- Other text processing systems that need to rely on and produce consistently encoded text.
A clear specification of the encoding order is necessary to enable interoperability between all these different processes. In short, text producers need to know what’s valid output, and text consumers need to know what’s valid input, and what to do with invalid input. What we saw in the experiments above were breakdowns in interoperability: In the first case, keyboards let users enter text that fonts render the same way but that search engines treat as different. In the second case, keyboards let users enter text that can be rendered differently in different environments.
Insecure domain names
To see another major problem arising from the lack of a well-specified encoding order, open the following three links: ស្ត្រី.com, ស្រ្តី.com, and ស្រី្ត.com.
You’ll see that these links lead to three different web sites. Now check the URL field of the browser. In this case, the results differ between browsers. Some browsers show the original Khmer word in the URL field, and there’s no visible difference between the URLs. Others browsers show the ASCII representations of the domain names, the Punycode, which all browsers use internally. Using Punycode, the URLs are clearly different:
Encoding order | Domain name with Punycode | Domain name |
---|---|---|
ស ◌្ត ◌្រ ◌ី | xn--x2ewn6hrfb.com | ស្ត្រី.com |
ស ◌្រ ◌្ត ◌ី | xn--x2evo6hrfb.com | ស្រ្តី.com |
ស ◌្រ ◌ី ◌្ត | xn--x2evo5hsfc.com | ស្រី្ត.com |
In this case, I designed the three web sites to look clearly different. The risk exists, however, that criminals might design alternate web sites that spoof existing web sites in order to obtain users’ passwords or other information.
Insecure source code
A similar problem may occur in programming languages, even in relatively safe ones such as Swift:
let ស្ត្រី = true
func validate() -> Bool {
let ស្រ្តី = false
print(ស្រ្តី)
return ស្ត្រី
}
print(validate())
Code reviewers might think that this prints “false” twice, because the validate
function first prints the value of a variable, then returns the value of that variable, which the caller prints again. That’s not what happens; the output is first “false”, then “true”. The ស្ត្រី in the return statement of validate
is not the one otherwise used in the function; instead it refers to the constant defined outside the function. Here’s what the compiler sees, with different variables in different colors:
let ស្ត្រី = true
func validate() -> Bool {
let ស្រ្តី = false
print(ស្រ្តី)
return ស្ត្រី
}
print(validate())
When different representations of code in memory have the same appearance on screen, and this leads to different interpretations of the code by compilers and human readers, it’s called source code spoofing. It can lead to inadvertent errors, but can also enable bad actors to introduce malicious code despite code reviews.
Using copyrighted material without license to create AI systems is theft.
Defining encoding order in the Unicode Standard
Encoding order is the order in which the Unicode characters representing a piece of text are stored in a computer’s memory. (The Unicode Standard uses the term “logical order”, but there’s nothing particularly logical about it.) Unicode text represents writing, which in turn represents speech. The order in which the spacing characters are visually arranged in writing is called “visual order”, the order in which sounds of the language are pronounced “phonetic order”. In simple cases, such as English written in the Latin script, the characters are written in the order the sounds they represent are spoken. In Brahmic scripts, things are usually not so simple. Let’s look at the simple case first.
For simple scripts: Unicode normalization
If a writing system represents text using only spacing characters and by arranging those characters visually in the order in which they’re pronounced, always moving in the main writing direction, things are easy: Phonetic order and visual order are the same, so the encoding order can simply match them. Indonesian written with Latin characters, Chinese written with Han ideographs, and Bantawa written with the simplified Brahmic script Kirat Rai are such writing systems.
In Unicode, however, many of the characters that users see as one entity can be, and sometimes must be, encoded as sequences of Unicode code points. For example, the character l̥̄, which is used in the ISO 15919 transliteration of several Brahmic scripts, consists for Unicode of three parts: a base character l, a nonspacing mark ◌̄, and a nonspacing mark ◌̥. Since they’re arranged perpendicular to the writing direction, visual order doesn’t define their encoding order. The Unicode Standard solves this with two rules: First, nonspacing marks are encoded after the base they’re attached to. Second, the order of nonspacing marks that don’t interact typographically, such as in this case one mark above and one below the base, doesn’t matter. The sequence l ◌̄ ◌̥ is “canonical equivalent” to l ◌̥ ◌̄ and should be treated as equal nearly everywhere. Marks that interact typographically because they attach on the same side are ordered outwards from the base, so their order does matter.
The “doesn’t matter” rule is implemented through Unicode normalization, which maps equivalent strings to one preferred form. As part of the Unicode data, nonspacing marks are assigned values of the Unicode property Canonical_Combining_Class (ccc) that reflect their position relative to the base. The value for below-base marks such as ◌̥ is 220, and the value for above-base marks such as ◌̄ 230. Bases get the value 0. The canonical ordering algorithm, which is part of Unicode normalization, then reorders marks that attach to the same base into a canonical order with ascending ccc values. Since 220 is less than 230, the canonical order for l̥̄ is l ◌̥ ◌̄.
Another part of Unicode normalization deals with the precomposed characters that Unicode had to include for compatibility with earlier character encodings or because users saw them as entities that must be encoded atomically. For example, the character ệ, used in Vietnamese, can be represented by the single Unicode code point ệ, or by the sequence ẹ ◌̂, or ê ◌̣, or e ◌̂ ◌̣, or e ◌̣ ◌̂. These should all be treated as equal. The canonical decomposition algorithm maps any of these to the last one (labeled “normalization form D”), the canonical composition algorithm can then map back to the first one (labeled “normalization form C”).
Encoding order | Rendering | Normalization form |
---|---|---|
l ◌̄ ◌̥ | l̥̄ | — |
l ◌̥ ◌̄ | l̥̄ | NFC, NFD |
ệ | ệ | NFC |
ẹ ◌̂ | ệ | — |
ê ◌̣ | ệ | — |
e ◌̂ ◌̣ | ệ | — |
e ◌̣ ◌̂ | ệ | NFD |
Brahmic scripts: not so simple
Few Brahmic scripts are that simple. Text in most Brahmic scripts has to be seen as a sequence of orthographic syllables. These orthographic syllables are two-dimensional arrangements of marks surrounding a central base glyph, which may be a consonant or independent vowel or a ligature involving at least one of these. Our earlier Khmer example ស្ត្រី consists of one such orthographic syllable; the Tamil example மொழி of two, மொ and ழி. While the complete orthographic syllables are arranged continuously in writing direction (usually left-to-right for Brahmic scripts, but top-to-bottom for Phags-pa), the marks within an orthographic syllable are often not arranged that way.
Many Brahmic scripts have features where the visual order clearly differs from the phonetic order:
- Vowels that are pronounced after a consonant may be written to the left of it. For example, in the Hindi name for the language itself, हिन्दी hindī, the short vowel ◌ि -i is pronounced after the consonant ह h-, but written to the left of it.
- Some syllable-initial consonants may be written as nonspacing marks above the second consonant. Most commonly this happens with the consonant r-, in which case this representation is called repha. For example, in the Sanskrit word सूर्य sūrya, the consonant र्◌ r- is pronounced before the consonant य ya, but written on top of it. In Unicode, nonspacing marks are normally encoded after the base character they attach to, but repha are pronounced before the base consonant.
- Syllable-final consonants may be written above the syllable-initial consonant despite intervening spacing vowels. For example, in the Balinese syllable ᬩᭀᬃ bor, the consonant ◌ᬃ -r is pronounced last, but it is written on top of the consonant ᬩ ba and therefore to the left of ◌ᬵ, which is part of the split vowel ◌ᭀ o.
- Medial consonants may be written to the left of the syllable-initial consonant, even though they’re pronounced after them. We’ve seen this in the earlier Khmer example ស្ត្រី strī: ◌្រ -r- is pronounced after the initial ស s- (and after ◌្ត -t-), but written to its left.
In addition, canonical ordering is fundamentally incompatible with several characteristics of the encoding of many Brahmic scripts:
- Many Brahmic scripts use nonspacing reduced forms of consonants, or sometimes of independent vowels, to indicate that the previous consonant has lost its inherent vowel. In our earlier example ស្ត្រី, ◌្ត is the reduced form of ត ta. In Unicode these reduced forms are usually encoded as multi-character sequences. ◌្ត is in fact encoded as the sequence <U+17D2 ◌្ KHMER SIGN COENG, U+178F ត KHMER LETTER TA>. Canonical_Combining_Class values can only be assigned to individual Unicode characters, not to character sequences.
- Several scripts have dependent vowels that have both a spacing and a nonspacing component, and do not have a canonical decomposition. Examples are U+17BE ◌ើ KHMER VOWEL SIGN OE or U+1925 ◌ᤥ LIMBU VOWEL SIGN OO. The nonspacing component may typographically interact with other marks attached to the same base and would need a non-zero ccc value describing its position relative to the base, while the spacing component requires ccc=0. A few dependent vowels, such as U+1B3C ◌ᬼ BALINESE VOWEL SIGN LA LENGA, even have two nonspacing components in different positions, which would require two different non-zero ccc values. A Unicode character can have only one ccc value.
- Some characters, such as U+102F ◌ု MYANMAR VOWEL SIGN U can change from nonspacing to spacing. As a nonspacing mark, it would need ccc=220; as a spacing character ccc=0.
- The Khmer consonant shifters U+17C9 ◌៉ KHMER SIGN MUUSIKATOAN and U+17CA ◌៊ KHMER SIGN TRIISAP can change from above-base to below-base marks; one position would require ccc=230, the other ccc=220.
- The characters U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER are used within sequences of nonspacing marks to influence rendering; their ccc=0 assignments prevent canonical ordering around them.
Given these issues, and also given precedents set by earlier character encodings, two separate models have evolved for defining the encoding orders of Brahmic scripts. For a few Brahmic scripts the encoding order is primarily based on the visual order; for all others primarily on the phonetic order. In neither case, however, does the underlying visual or phonetic order fully determine the encoding order – a more precise specification is required.
Using visual order
Visual order is used for four Brahmic scripts: Lao, New Tai Lue, Tai Viet, and Thai.
When basing the encoding order on the visual order, spacing characters are stored in the order in which they occur in writing direction. For example, the Thai word for “Thai” is ไทย. Using visual order, the characters are encoded from left to right, ไ ท ย, even though ไ ai is pronounced after ท th.
The issue then is what to do with nonspacing marks, which Lao, Tai Viet, and Thai have. As we’ve seen above, for simple scripts this is handled using Unicode normalization. Tai Viet is simple enough, and its ccc values are assigned such that this can work.
For Thai and Lao encoding decisions were made that break normalization. First, both scripts encode am vowels, Thai ◌ำ and Lao ◌ຳ, that represent combinations of a nonspacing and a spacing mark, without canonical decompositions – one of the fundamentally incompatible features discussed above. Second, ccc values were assigned in a number of incompatible ways. Most nonspacing marks in these scripts received “fixed position” ccc values whose existence the standard describes as a historical artifact of an earlier stage in its development. The virama characters received the ccc value 9 that’s used for most such characters across Unicode. Both assignments interact badly with the ccc values of script independent marks that are also used in Thai. Some other marks received ccc=0, which prevents reordering. The result is that canonical ordering in some cases reorders characters such that syllables are rendered incorrectly, and doesn’t always treat as equivalent syllables that look the same.
The sections of the Unicode Standard on both Thai and Lao discuss some workarounds; another proposal with more workarounds should be integrated into the standard eventually. In the meantime, the leading OpenType rendering systems have given up on validating Thai text, and Thai font developers have added ad-hoc normalization routines into their fonts to achieve the rendering users expect.
Unfortunately, the incorrect ccc values can not be fixed because of the Unicode policy of Normalization Stability.
Using phonetic order
When basing the encoding order on phonetic order, consonants and vowels are stored in the order they are pronounced. Virama marks that indicate the absence of inherent vowels of consonants are placed immediately after those consonants. In our initial Khmer example ស្ត្រី strī, the correct encoding order based on pronunciation is ស ◌្ត ◌្រ ◌ី representing s- -t- -r- -ī.
This general idea leaves a number of issues unspecified.
First, as we saw earlier, it’s not clear in which order the components of multi-part vowels, such as the Tamil vowel ◌ொ o, should be stored.
Second, it doesn’t clarify what to do with other characters whose position in speech is not so well-defined, such as nuktas, tone marks, consonant shifters, syllable modifiers, or ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER.
Here the Unicode Standard fails almost entirely. Only for a few scripts does it try to fully specify the orderings of syllable components, and only for Cham and Lepcha do they look complete and reasonable. For Khmer, it provides a regular expression that contradicts other parts of the same section, and has various other flaws. For Myanmar, it covers only modern Burmese and then refers to a comprehensive Unicode Technical Note on Myanmar, which however is no longer up to date, and to Microsoft documentation that partially conflicts with the UTN and is no longer up to date either. For Sundanese, information about Old Sundanese is missing.
Third, there can be cases where two or more strings look the same that are pronounced differently, and therefore would be encoded differently when using a purely phonetic order. For such cases, it needs to be decided whether only a single encoding is valid, or whether a disambiguating rendering is required, or whether search operations should treat the different sequences as equal. Sometimes language-specific requirements come into play. Our initial Khmer example is one such case: ស្ត្រី represents strī, ស្រ្តី represents srtī, and ស្រី្ត represents srīt. In the modern Khmer language, only the first pronunciation is valid; in this language, the medial consonant ◌្រ -r- must come after any other medial consonant, and final consonants are written as separate orthographic syllables. However, in the Tampuan language ◌្រ -r- comes before ◌្វ -v-, and in Middle Khmer, an older form of the Khmer language, final consonants can be written by using the same marks as for medial consonants, but after the vowel. For Tampuan, some fonts distinguish the different orders of medial consonants. For Middle Khmer, the use of the mark of a medial consonant as a final consonant can only sometimes be visually distinguished from its use as a medial consonant; in most cases, the reader has to derive the correct pronunciation from the context.
Finally, phonetic order raises the question of how to prevent interference with Unicode normalization. Such interference can come in two ways:
- The canonical ordering algorithm, as mentioned above, reorders marks based on their ccc values, and the canonical order might not correspond to the phonetic order. The encoding of most Brahmic scripts minimizes this issue by assigning ccc=0 to most of their combining marks. However, virama-like characters usually have ccc value 9, and older nukta characters ccc value 7. These values may be reasonable relative to each other because the nukta modifies the preceding consonant itself while the virama-like characters indicates the absence of its inherent vowel. However, they also imply that a sequence of a virama followed by a nukta is canonical equivalent and therefore should render the same as the nukta followed by the virama, and that doesn’t always work out. In addition, a number of other marks also received ccc values other than 0 and may therefore be reordered, resulting in a deviation from an encoding order based on phonetic order as well as in rendering problems.
- Canonical decomposition decomposes certain characters into other characters representing their components. For example, U+0DDA ◌ේ SINHALA VOWEL SIGN DIGA KOMBUVA is decomposed into U+0DD9 ◌ෙ SINHALA VOWEL SIGN KOMBUVA and U+0DCA ◌් SINHALA SIGN AL-LAKUNA. Such decompositions may conflict with an encoding order based on phonetic order that doesn’t take them into consideration.
Neither incorrect ccc values nor inappropriate canonical decomposition mappings can ever be fixed, because of the Unicode policy of Normalization Stability.
Indic data
The Unicode Standard provides some data that, outside the standard, has become useful in defining encoding orders: The Unicode character properties Indic_Syllabic_Category (InSC) and Indic_Positional_Category (InPC), collectively the “Indic data”. This data covers the Brahmic scripts encoded in the Unicode Standard as well as the structurally similar Kharoshthi script. The syllabic category specifies the function each character normally has within an orthographic syllable, while the positional category specifies for each combining mark how it is normally positioned relative to its base character. For example, U+17BE ◌ើ KHMER VOWEL SIGN OE has InSC=Vowel_Dependent and InPC=Top_And_Left. The documentation for the properties has room for improvement, and some categories are poorly named – “virama” here means the shapeshifters, while the characters representing real-world visible virama marks are indicted as “pure killers”. But the main issue is that the Unicode Standard doesn’t actually use Indic data to define an encoding order. The next section describes the only known specification that does.
Using copyrighted material without license to create AI systems is theft.
Defining encoding order outside of the Unicode Standard
There have been a number of attempts outside the Unicode Standard to specify the encoding orders of Brahmic scripts. The most influential ones are those of the OpenType shaping engines developed by Microsoft, which are – with variations – implemented in all major OpenType rendering systems. Another interesting effort is that by the Internet Corporation for Assigned Names and Numbers (ICANN), whose goal is to prevent spoofing of domain names. Finally, researchers at SIL Global and others have worked to define encoding orders for the most complex Brahmic scripts, including Myanmar and Khmer, as part of extended script documentation. We’ll take a look at these efforts and their relationships to the Unicode Standard.
OpenType shaping engines
Shaping engines are the components of OpenType rendering systems that take an initial sequence of representative glyphs for the characters in the text to be rendered and transform it into a two-dimensional arrangement of glyphs by applying glyph reordering, substitution, and positioning operations. There’s one relatively simple shaping engine for all simple scripts (strangely called the “standard” scripts), a variety of single-script shaping engines for more complicated scripts, and finally the “universal” shaping engine that attempts to handle all scripts that weren’t already handled by some other shaping engine. Each engine is described in its own document.
Among Brahmic scripts, Bengali (Bangla), Devanagari, Gujarati, Gurmukhi, Kannada, Khmer, Lao, Malayalam, Myanmar, Oriya (Odia), Tamil, Telugu, and Thai are currently described as being handled by dedicated single-script engines, the engine for the New Tai Lue script is unspecified, and all others are handled by the Universal Shaping Engine (USE).
One key feature of the shaping engines for Brahmic scripts is validation: They’re documented to check for invalid character sequences, and to insert the character U+25CC ◌ DOTTED CIRCLE before any combining mark found in an invalid position (base characters are valid anywhere). What’s valid is in most engine documentations described by regular expressions, thus defining an encoding order for each script. The USE uses several properties of the Unicode Character Database, such as the general category and the Indic syllabic and positional categories, to classify characters for the regular expressions; the documentation of the other engines sometimes uses explicit lists of characters, and sometimes relies on the intuition of the reader.
Unfortunately, the documentation of most of these shaping engines has always been imprecise, and was abandoned at least a decade ago. Technical bugs have not been fixed, changes in the implementations not reflected, many characters not classified, inconsistencies with the Unicode Standard and other script documentation (such as Representing Myanmar in Unicode) not resolved, and imprecise statements not clarified. In the meantime, the major implementations have diverged substantially, as documented for Devanagari and Khmer.
Nevertheless, the best available specification of encoding orders for many Brahmic scripts is currently provided by the documentation of the OpenType Universal Shaping Engine (USE). This specification classifies characters based on several Unicode properties, including the Indic data mentioned above, and provides a regular expression describing a generic syllable structure for Brahmic scripts. As the encoding order resulting from this generic syllable structure and the property values provided by the Unicode Standard doesn’t always allow all syllables that can be found in real texts, the USE overrides the Indic data where necessary. Deriving the specific syllable structure for any particular script takes a bit of effort, so Encoding orders of Brahmic scripts provides the complete set as of Unicode 16.
The USE specification is good, but not free of issues:
- It does not cover the most widely used Brahmic scripts, as those are still handled by dedicated single-script shaping engines.
- Its generic syllable structure is insufficient for the Tai Tham script, a particularly complex script that, despite several past efforts, is also particularly underspecified in the Unicode Standard.
- It’s hidden inside the documentation of a rendering system. Implementers of all the systems listed above that should interoperate are not very likely to look into the specification of a rendering system, unless specifically told so. Neither the Unicode Standard nor, for example, the Unicode Consortium’s specification for keyboard descriptions within the Locale Data Markup Language mention it.
- The USE specification was largely abandoned around 2020. While newly encoded scripts were added, some small issues corrected, and the associated character override data updated, major issues in the specification have not been addressed. Implementations have started to drift apart.
- The USE was not designed to be compatible with canonical ordering. For the majority of Brahmic scripts that it supports that’s not a problem because the ccc values for all characters except virama-like characters and nuktas are set to 0, thus disabling reordering. Some Brahmic scripts, however, have characters with ccc values such that canonical ordering may produce an order that’s incompatible with the USE cluster structure, resulting in dotted circles unless the USE character data provides appropriate overrides. For example, within each class of combining marks, the USE expects above-base marks before below-base marks, while canonical ordering results in the reverse order.
As far as I know, Microsoft has never brought a proposal to the Unicode Consortium that would have specified the encoding orders of Brahmic scripts in the Unicode Standard based on the validation rules of their OpenType shaping engines.
Domain names
ICANN has defined label generation rulesets that restrict the set of valid top-level (root zone) and second-level domain names in various scripts in order to prevent spoofing of such domain names. The rulesets use a custom XML scheme defined in RFC 7940: Representing Label Generation Rulesets Using XML. Unlike regular expressions, these rulesets have the capability to identify variants of a given name, that is, other names that could be confused with the given one. Rather than declaring one variant valid and all others invalid, this allows to register the first variant applied for and then block registration of all others.
Interestingly, the rules for top-level domain names (which ICANN controls) are not the same as the rules for second-level names (which are only recommendations for registrars). For example, the rules for second-level names for Khmer allow only one of the encodings of ស្ត្រី, while the rules for top-level names allow two of them. (The .com registrars seem to ignore the rules entirely.)
The ICANN rulesets prioritize security above all: they support only major languages, prohibit control characters that are only needed to control rendering, and rule out character sequences that could create confusion between different domain names. They are therefore primarily of interest to other areas that have to prioritize security over expressiveness, such as programming language identifiers. They wouldn’t be useful as a general specification of encoding orders for the Unicode Standard, whose goal is that “everyone in the world should be able to use their own language on phones and computers”. On the other hand, well-defined encoding orders in the Unicode Standard would very likely have simplified the development of the ICANN rulesets.
Extended script documentation
Several people have written documents with information about specific scripts that goes beyond the short introductions in the Unicode Standard. Researchers at SIL Global have been particularly active in this regard.
Some of these documents include descriptions of encoding orders and have been published as Unicode Technical Notes (UTNs), for Myanmar, Javanese, Kawi, and most recently Khmer. The problem with UTNs is that they are not reviewed by the Unicode Technical Committee (UTC) and not normative. They’re basically personal opinions, and readers have to judge for themselves whether they find them trustworthy, and to what extent they want to follow them when implementing fonts, keyboards, and other software for these scripts.
The UTN on Khmer is an outcome of a project started around 2020 to resolve the serious issues in the encoding of Khmer identified in Spoof-Vulnerable Rendering in Khmer Unicode Implementations and Issues in Khmer syllable validation. Contributors were SIL Global researchers Makara Sok, Martin Hosken, Diethelm Kanjahn, Marc Durdin, as well as I. This project exemplifies the amount of work that is necessary to define the encoding order of a complex Brahmic script:
- Linguistic research: Makara Sok provided a solid foundation with Khmer Character Specification/Usages, which documents the characters of the Khmer script with their properties, their uses in the different languages that use the script, including possible character combinations, and the differences in rendering requirements between these languages. His earlier thesis “Phonological Principles and Automatic Phonemic and Phonetic Transcription of Khmer Words” included a detailed analysis of the use of register shifters, a particularly complex aspect of the Khmer script and its encoding.
- Definition of encoding order: A new encoding order was developed based on the linguistic research, on a detailed analysis of existing definitions of encoding orders for Khmer, and on encoding practice in fonts, keyboards, and encoded text as they have evolved since Khmer was added to the Unicode Standard in 1999. The goals for this encoding order were to allow the representation of text in any language for which the script is or has been used, to allow only a single encoding of any correct visual rendering, and to be as compatible as possible with current encoding practice. One particularly difficult area was the development of encoding and rendering rules for register shifters, where Makara’s analysis helped find a solution. Another was how to prevent common mistakes in writing the modern Khmer language while still providing all the capabilities needed to write Middle Khmer or Tampuan. The new encoding order and rationale for the decisions that led to it are now documented in the Unicode technical note.
- Collaboration with stakeholders: Problems and possible solutions were extensively discussed with linguistic and technical experts in Cambodian government organizations, including the Royal Academy of Cambodia and the Cambodia Academy of Digital Technology. Cambodia doesn’t have a national standards body, but in other countries such a body would be a very important partner to work with as well.
- Development of enabling technology: The new encoding order then became the basis for updates to the smart Khmer keyboard developed by SIL Global, the Angkor keyboard. The technical note also provides reference implementations for functions that normalize pre-existing Khmer text to the new encoding order or verify that text is well-formed according to the new encoding order. Both functions take a language parameter and handle Modern and Middle Khmer according to their respective requirements.
- Transition plan: A particularly hard problem in moving towards a better defined and therefore stricter encoding order is what to do with all the existing text, and text still being produced, that does not conform to the new encoding order. If only one encoding should be possible for ស្ត្រី, then the others have to be prohibited, which ultimately means rendering them with dotted circles. The transition plan for the Khmer encoding order tries to minimize the impact of this: It starts with a number of steps enabling the proposed new encoding order, including improved keyboards and other text producers. It then proposes to eventually convert all existing text that needs to be maintained long-term to the new encoding order, and finally turns on validation of the new encoding order in font rendering by default. (I take responsibility for this transition plan.)
Despite all this work, experts in the Unicode Script Ad Hoc (now Script Encoding Working Group) did not like the outcome. Their most important comment in Recommendations to UTC #174 on the 2022 precursor document of the UTN: “Most experts agreed that we want to make sure existing Khmer text should render as it has been, instead of dotted circles starting to appear suddenly.” Let’s consider why.
Using copyrighted material without license to create AI systems is theft.
The dotted circle problem
Using dotted circles for validation as part of font rendering has a major problem: Most of the time they’re just a nuisance. That’s because most of the time users are reading, not editing text. When readers see dotted circles in the text they read, there’s nothing they can do about it. Dotted circles are only helpful while an author is editing text (and maybe while reading in a security sensitive context). Theoretically one might imagine showing dotted circles only during editing, but that would require that font rendering systems (and possibly fonts) know when they’re being used for editing and when not. They don’t.
As the primary job of a font rendering system is to make text readable, not to act as an encoding validator, the trend has been to reduce dotted circle insertion, especially when resolving differences between rendering systems on different platforms. In fact, this did happen for ស្ត្រី: In January 2019, the font rendering system on Apple platforms inserted dotted circles into two of the three character sequences; now it doesn’t do that anymore. Shaping engine implementations for Thai seem to have stopped validating entirely.
Making font rendering systems responsible for validation may have been a good idea back when keyboards were generally dumb and a single font rendering system controlled rendering on the vast majority of all computer screens. It may not be the best solution today.
Using copyrighted material without license to create AI systems is theft.
A new type of normalization?
How normalization can help
Let’s look again at the failures in rendering and search discussed above:
Consumer | Rendering | Search |
---|---|---|
Text received | ம ◌ா ◌ெ ழ ◌ி | ស ◌្ត ◌្រ ◌ី |
Result | ![]() |
ស្ត្រី ≠ ស្រ្តី |
visible failure | silent failure |
Instead of highlighting invalid encoding orders in rendering, and failing to find matches in search, a better solution may be to use a new form of normalization. Instead of separating “correct” and “incorrect” encoding orders, normalization selects one preferred encoding order among several “equivalent” or “similar” ones, and maps from equivalent or similar encodings to the preferred one.
Normalization is particularly useful in processes interpreting text (“consumers”), as it immediately yields better results.
Consumer | Rendering | Search |
---|---|---|
Text received | ம ◌ா ◌ெ ழ ◌ி | ស ◌្ត ◌្រ ◌ី |
→ normalization → | ||
Text normalized | ம ◌ெ ◌ா ழ ◌ி | ស ◌្ត ◌្រ ◌ី |
Result | மொழி | ស្ត្រី = ស្ត្រី |
success | success |
However, normalization should also be used in processes generating text (“producers”), such as smart keyboards or speech input systems. This has the main advantage of getting better results from consumers that haven’t been upgraded yet to use the new normalization. However, it also improves performance in consumers that have been upgraded – it’s cheaper to check whether text is normalized than to normalize it.
Producer | Smart keyboard | |
---|---|---|
Text input | ம ◌ா ◌ெ ழ ◌ி | ស ◌្ត ◌្រ ◌ី |
→ normalization → | ||
Text normalized | ம ◌ெ ◌ா ழ ◌ி | ស ◌្ត ◌្រ ◌ី |
→ transmission → | ||
Consumer | Rendering | Search |
Text received | ம ◌ெ ◌ா ழ ◌ி | ស ◌្ត ◌្រ ◌ី |
→ normalization verification → | ||
Result | மொழி | ស្ត្រី = ស្ត្រី |
success | success |
Requirements for Brahmic normalization
Normalization for Brahmic scripts would have to differ from existing Unicode normalization forms in several ways:
- Its basic elements would in many cases have to be sequences of characters, not individual characters. Conjunct forms represented as conjoiner-consonant sequences are the most common case, but there are also cases such as the Myanmar mark ◌်် asat, which can attach to a variety of other characters and must remain with the character it’s attached to.
- It cannot distinguish between spacing and nonspacing characters. Instead, it needs to distinguish between the bases of orthographic syllables, characters that can precede such bases (for example, repha in certain scripts), and other syllable components.
- It may have to come with several different variants per script. Smart keyboards and other producers may need language-specific variants, for example, to cover the differences between the usages of the Khmer script for the Modern Khmer, Middle Khmer, and Tampuan languages. A variant for rendering has to maintain all differences that matter to any of the languages, even if they seem irrelevant to or are overlooked by users of other languages, in the same way that diacritics in the Latin script have to be rendered accurately even if they seem irrelevant to users of languages that don’t use them. A variant for searching, on the other hand, might match all “similar” character sequences, erasing differences that matter only for some languages, in the same way that search functions for the Latin script offer an option to ignore diacritics and case distinctions.
- It can not have stability guarantees, as it must remain possible to correct it when better or more complete information about a script’s use becomes available. This in turn means that interchange between systems based on different versions of Unicode cannot rely on text normalized on one system to remain normalized on the other system.
However, normalization for Brahmic scripts cannot ignore Unicode normalization in the way the specifications for OpenType shaping engines do, as that could lead to incompatible results depending on whether text has gone through Unicode normalization already. Instead, it will need to perform Unicode normalization as its first step and then correct the results, avoiding the introduction of unnecessary differences such as reversing the order of above- and below-base marks.
Using normalization that extends beyond the current canonical equivalence was proposed before by Lawrence Wolf-Sonkin. His proposal however seems limited to applying the variety of equivalences that were at the time only informally documented in the Unicode Standard, and which starting in Unicode 16 are also collected in the file DoNotEmit.txt. Normalizing encoding order for Brahmic scripts would go beyond this, and require more work, because encoding order is currently largely undocumented.
The Arabic Mark Transient Reordering Algorithm (AMTRA) specified in Unicode Arabic Mark Rendering, an annex to the Unicode Standard, is another example of correcting the encoding order that may result from Unicode normalization. Rendering Arabic marks in the way implied by their ccc values can lead to them being stacked incorrectly. AMTRA reorders them such that marks are positioned the way users expect.
Not a simple fix
Developing a new standardized normalization for Brahmic scripts is not easier than fully specifying encoding orders. It still requires collecting information about the requirements of all the different languages that use a script. For many scripts it may turn out that two or more different normalization variants are needed, as discussed above.
If and when the owners of search engines realize that languages such as Khmer are worth supporting well, it’s very likely that they will develop an algorithmic solution, and not try to encourage the owners of web sites to convert text to a standardized encoding order as the UTN on Khmer proposes. Normalization is the obvious solution. The big question then is whether each search engine will get its own normalization, or whether the owners will collaborate and develop a standard encoding order and normalization. Standardization would have the advantage that the same normalized encoding order could also be used by text producers, whether smart keyboards or speech input systems, leading to better results overall.
Unicode may be in a similar situation as HTML was 21 years ago, before the WHATWG was founded. While HTML at the time had a well-defined syntax, and the W3C was promoting XHTML as a stricter variant, many web pages used incorrect syntax, “tag soup”. Browsers tried to make sense of tag soup and render it, but in different ways. Then the WHATWG was formed to develop the HTML5 standard, part of which standardizes the interpretation of tag soup. This was a big multi-year multi-company project, but it paid off.
Using copyrighted material without license to create AI systems is theft.
Acknowledgments
I’d like to thank Martin Hosken, Makara Sok, Didi Kanjahn, and Marc Durdin for many discussions in the Khmer encoding project that helped me better understand the issues around encoding order as well as possible solutions. I’d also like to thank Alolita Sharma, Andrew Glass, Aditya Bayu Perdana, Behdad Esfahbod, Ben Mitchell, Deborah Anderson, Lee Collins, Liang Hai, Maaike Langerak, Makara Sok, Marc Durdin, Martin Hosken, Muthu Nedumaran, Peter Constable, Peter Lofting, and Richard Ishida for providing information for this document or feedback on drafts of it.
Using copyrighted material without license to create AI systems is theft.
References
Deborah Anderson, Ken Whistler, Roozbeh Pournader, and Peter Constable: Recommendations to UTC #174 January 2023 on Script Proposals. UTC document L2/23-012. The Unicode Consortium, 2023.
Peter Constable: Canonical Ordering of Marks in Thai Script. UTC document L2/18-216. The Unicode Consortium, 2018.
Kim Davies, Asmus Freytag: RFC 7940: Representing Label Generation Rulesets Using XML. Internet Engineering Task Force 2016.
Mark Davis et al.: Unicode Locale Data Markup Language (LDML). Version 47. Unicode Technical Standard 35. The Unicode Consortium, 2025. See in particular Steven Loomis et al.: Keyboards.
Joshua Horton, Makara Sok, Marc Durdin, Rasmey Ty: Spoof-Vulnerable Rendering in Khmer Unicode Implementations. European Language Resources Association, 2019.
Martin Hosken: Representing Myanmar in Unicode. Details and Examples. Version 4. Unicode Technical Note 11. The Unicode Consortium, 2012.
Martin Hosken: Khmer Encoding Structure (Nov 2022). Contributors: Makara Sok, Norbert Lindenberg. UTC document L2/22-290. The Unicode Consortium, 2022.
Martin Hosken: Khmer Encoding Structure. Contributors: Makara Sok, Norbert Lindenberg. Unicode Technical Note 61, Version 2. The Unicode Consortium, 2025.
Internet Corporation for Assigned Names and Numbers: Root Zone Label Generation Rules. ICANN 2022. See in particular Root Zone LGR for script: Khmer (Khmr).
Internet Corporation for Assigned Names and Numbers: Second-Level Reference Label Generation Rules. ICANN, 2024. See in particular Reference LGR for script: Khmer (Khmr).
Robin Leroy, Mark Davis: Unicode Source Code Handling. Unicode Technical Standard 55. The Unicode Consortium, 2024.
Norbert Lindenberg: Issues in Khmer syllable validation. Lindenberg Software LLC, 2019.
Norbert Lindenberg et al.: Unilateral change to USE cluster model. HarfBuzz issue 3498. GitHub 2022.
Norbert Lindenberg: Issues in Devanagari cluster validation. Lindenberg Software LLC, 2020.
Norbert Lindenberg: Implementing Javanese. Unicode Technical Note 47. The Unicode Consortium, 2022.
Norbert Lindenberg: Implementing Kawi. Unicode Technical Note 48. The Unicode Consortium, 2022.
Norbert Lindenberg: Encoding orders of Brahmic scripts. Lontar GmbH, 2024.
Norbert Lindenberg: Introduction to Brahmic scripts. Lontar GmbH, 2025.
Lontar GmbH: Keyboards and fonts for iPhone and iPad. Lontar GmbH, 2014-2024.
Microsoft Corporation: Creating and supporting OpenType fonts for the Universal Shaping Engine. Microsoft Corporation, dated 2024-09-15. See also shaping engine documentation for Bengali (Bangla), Devanagari, Gujarati, Gurmukhi, Kannada, Khmer, Lao, Malayalam, Myanmar, Oriya (Odia), Tamil, Telugu, and Thai.
Roozbeh Pournader, Bob Hallissy, Lorna Evans: Unicode Arabic Mark Rendering. Unicode Standard Annex 53. The Unicode Consortium, 2024.
SIL Global: Khmer Angkor. SIL Global, 2025.
Makara Sok: Phonological Principles and Automatic Phonemic and Phonetic Transcription of Khmer Words. Payap University, 2016.
Makara Sok: Khmer Character Specification/Usages. Version of 2024-08-02. GitHub, 2024.
Richard Wordingham: Decomposed Sinhala Vowels. Microsoft Typography Issues, issue 905. GitHub 2022.
The Unicode Consortium: Topical Document List: Tai Tham. Accessed 2025-05-22.
The Unicode Consortium: Normalization Stability. The Unicode Consortium, dated 2024-01-09.
The Unicode Consortium: The Unicode Standard, Version 16.0.0. The Unicode Consortium, 2024. See in particular Core Specification.
Lawrence Wolf-Sonkin: Beyond Canonical Equivalence: A discussion on a way forward. UTC document L2/23-056. The Unicode Consortium, 2023.