Lontar

Order and disorder in Unicode

Norbert Lindenberg
May 28, 2025

The Unicode Standard lacks well-defined encoding orders for the orthographic syllables of Brahmic scripts. This creates problems such as missing search results, incorrectly rendered text, and security holes. This article discusses the causes of these problems and various localized attempts at solving them. It proposes a new form of normalization as a more generic solution.

This article requires web fonts to be rendered correctly. Please read it in a browser and mode that supports web fonts (“reader” views don’t).

Contents

Using copyrighted material without license to create AI systems is theft.

Disorder

Equal inputs, different outputs

Let’s start with a little experiment: Using your favorite browser, go to your favorite search engine, and search for the three strings ស្ត្រី, ស្រ្តី, and ស្រី្ត, one after another.

As of May 2025, these three strings look the same in all major browsers (showing the word “woman” in Khmer), but produce very different search results in all major search engines. Here are the first pages of results I got from Google:

ស្ត្រីស្រ្តីស្រី្ត

This is pretty disturbing. How can strings that look identical produce such different results? To understand this, we need to look at one aspect of the Unicode Standard that’s poorly understood, poorly specified, and poorly implemented: The order of characters within Unicode-encoded strings for Brahmic scripts, or encoding order. The three strings use the same Unicode characters, but in different sequences:

Encoding orderRendering
ស  ◌្ត  ◌្រ  ◌ីស្ត្រី
ស  ◌្រ  ◌្ត  ◌ីស្រ្តី
ស  ◌្រ  ◌ី  ◌្តស្រី្ត

Each of these encoding orders conforms to the grammar for Khmer orthographic syllables given in the Unicode Standard. However, the standard says nothing about whether they should be treated as equivalent in search or whether fonts and font rendering systems should render them in a way that makes them distinguishable.

It’s worth noting that the four characters in ស្ត្រី each denote a separate sound. Transliterated into Latin, but arranging the characters into two-dimensional orthographic syllables the way Khmer does, the table above would look like this:

Encoding orderRendering
S  t  r  īText “strī” laid out like a Khmer syllable
S  r  t  īText “strī” laid out like a Khmer syllable
S  r  ī  tText “strī” laid out like a Khmer syllable

Looking good here; not there

Now take a look at these two strings: மொழி and மாெழி. As of May 2025, browsers render them (the word “language” or “word” in Tamil) in different ways:

மொழி and மொழி  on Safari: with dotted circle
Rendering in Safari
மொழி and மொழி  on other browsers: no dotted circle
Rendering in Firefox and Chromium-based browsers

Safari on the Mac, or any browser on iPhone or iPad, render string with a dotted circle, and with the vowel ◌ெ in the wrong place. Other browsers render it without that dotted circle and exactly like string . String is rendered without dotted circle everywhere.

Dotted circles are inserted by font rendering systems to indicate that a character sequence is invalid. Here Apple’s font rendering system, CoreText, considers string invalid, while the font rendering system used by other browsers, HarfBuzz, considers it valid. Such differences between font rendering systems are quite common, as I’ve documented for Khmer and Devanagari.

The difference between strings and is the encoding order:

StringEncoding orderRendering
CoreText
Rendering
HarfBuzz
ம  ◌ெ  ◌ா  ழ  ◌ி மொழி without dotted circle மொழி without dotted circle
ம  ◌ா  ◌ெ  ழ  ◌ி மொழி with dotted circle மொழி without dotted circle

Is the second encoding order invalid, as the CoreText rendering indicates, or valid, as the HarfBuzz rendering implies? The Tamil section of the Unicode Standard shows several examples with ◌ெ  ◌ா, and none with ◌ா  ◌ெ, but it doesn’t explicitly say that the latter is not allowed. This is a problem because there are Tamil keyboards, mainly on non-Apple devices, that let users enter the second version, which then renders as broken when sent to Apple devices.

From dumb to undirected smart

Keyboards used to be dumb mappings from keys to characters, but today can be much smarter. Ensuring a correct encoding order could be one aspect of their smartness, but there’s no standard giving this smartness direction.

In the 1990s, keyboards simply mapped from keys on a physical keyboard to characters in the character encoding used by the computer. Initially one keystroke mapped to one character. The ability was added to merge a dead key with a subsequent character, such as ◌̀ + e → è, and some keys might result in two characters generated. The context in which text was input never played a role. Getting text right was entirely up to the user.

Separately, input methods (or “input method editors”) were developed, the complicated software packages that enable text input for writing systems with very large character sets, especially Chinese and Japanese. Input methods analyze series of keystrokes, look at the surrounding text in the document being written, and produce a list of candidate character sequences from which the user can pick the right one. Eventually they even predicted what the user might want to write. Operating systems provide special APIs through which input methods and applications can communicate with each other to produce optimal results.

This separation has gone away. First, some developers realized that they could use APIs designed for input methods to enable smart input for scripts other than Chinese, Japanese, or Korean. API functions that let an input method look at the surrounding text in the document being written proved particularly useful. Smart keyboards for Brahmic scripts can thus reorder the characters within an orthographic syllable, replace character sequences that the Unicode Standard says should not be used with their recommended alternatives, and correct text in other ways. Then the Keyman keyboard framework enabled the use of these facilities in a platform independent way – instead of dealing with input method APIs, keyboard developers could just describe which transformations should be applied to input text. Finally, when text input APIs were added to Android and iOS, they provided the capabilities of input method APIs, even though the API on iOS uses the word “keyboard” throughout. Keyboards provided by Lontar GmbH use these capabilities to reorder and otherwise correct input text.

For at least a decade – the iOS keyboard extension API was released in 2014 – the technology thus has existed to let smart keyboards ensure that text is input in valid character sequences. What’s been missing is direction from a standard defining what “valid” really means. This lack of direction meant that many keyboards today remain dumb and don’t correct the encoding order. It even affects the work of other parts of the Unicode Consortium itself: The technical committee for the Common Locale Data Repository provides a standard for describing keyboards in the Keyboards part of the Locale Data Markup Language. This standard, among other things, enables the specification of transformations and reordering of keyboard input – but has nothing to point to that would define expected outcomes.

Lack of interoperability

It’s important to understand that the Unicode Standard is the foundation for many different processes:

A clear specification of the encoding order is necessary to enable interoperability between all these different processes. In short, text producers need to know what’s valid output, and text consumers need to know what’s valid input, and what to do with invalid input. What we saw in the experiments above were breakdowns in interoperability: In the first case, keyboards let users enter text that fonts render the same way but that search engines treat as different. In the second case, keyboards let users enter text that can be rendered differently in different environments.

Insecure domain names

To see another major problem arising from the lack of a well-specified encoding order, open the following three links: ស្ត្រី.com, ស្រ្តី.com, and ស្រី្ត.com.

ស្ត្រីស្រ្តីស្រី្ត

You’ll see that these links lead to three different web sites. Now check the URL field of the browser. In this case, the results differ between browsers. Some browsers show the original Khmer word in the URL field, and there’s no visible difference between the URLs. Others browsers show the ASCII representations of the domain names, the Punycode, which all browsers use internally. Using Punycode, the URLs are clearly different:

Encoding orderDomain name with PunycodeDomain name
ស  ◌្ត  ◌្រ  ◌ី xn--x2ewn6hrfb.com ស្ត្រី.com
ស  ◌្រ  ◌្ត  ◌ី xn--x2evo6hrfb.com ស្រ្តី.com
ស  ◌្រ  ◌ី  ◌្ត xn--x2evo5hsfc.com ស្រី្ត.com

In this case, I designed the three web sites to look clearly different. The risk exists, however, that criminals might design alternate web sites that spoof existing web sites in order to obtain users’ passwords or other information.

Insecure source code

A similar problem may occur in programming languages, even in relatively safe ones such as Swift:

let ស្ត្រី = true

 

func validate​() -> Bool {

let ស្រ្តី = false

print​(ស្រ្តី)

return ស្ត្រី

}

 

print​(validate​())

Code reviewers might think that this prints “false” twice, because the validate function first prints the value of a variable, then returns the value of that variable, which the caller prints again. That’s not what happens; the output is first “false”, then “true”. The ស្ត្រី in the return statement of validate is not the one otherwise used in the function; instead it refers to the constant defined outside the function. Here’s what the compiler sees, with different variables in different colors:

let ស្ត្រី = true

 

func validate​() -> Bool {

let ស្រ្តី = false

print​(ស្រ្តី)

return ស្ត្រី

}

 

print​(validate​())

When different representations of code in memory have the same appearance on screen, and this leads to different interpretations of the code by compilers and human readers, it’s called source code spoofing. It can lead to inadvertent errors, but can also enable bad actors to introduce malicious code despite code reviews.

Using copyrighted material without license to create AI systems is theft.

Defining encoding order in the Unicode Standard

Encoding order is the order in which the Unicode characters representing a piece of text are stored in a computer’s memory. (The Unicode Standard uses the term “logical order”, but there’s nothing particularly logical about it.) Unicode text represents writing, which in turn represents speech. The order in which the spacing characters are visually arranged in writing is called “visual order”, the order in which sounds of the language are pronounced “phonetic order”. In simple cases, such as English written in the Latin script, the characters are written in the order the sounds they represent are spoken. In Brahmic scripts, things are usually not so simple. Let’s look at the simple case first.

For simple scripts: Unicode normalization

If a writing system represents text using only spacing characters and by arranging those characters visually in the order in which they’re pronounced, always moving in the main writing direction, things are easy: Phonetic order and visual order are the same, so the encoding order can simply match them. Indonesian written with Latin characters, Chinese written with Han ideographs, and Bantawa written with the simplified Brahmic script Kirat Rai are such writing systems.

In Unicode, however, many of the characters that users see as one entity can be, and sometimes must be, encoded as sequences of Unicode code points. For example, the character l̥̄, which is used in the ISO 15919 transliteration of several Brahmic scripts, consists for Unicode of three parts: a base character l, a nonspacing mark ◌̄, and a nonspacing mark ◌̥. Since they’re arranged perpendicular to the writing direction, visual order doesn’t define their encoding order. The Unicode Standard solves this with two rules: First, nonspacing marks are encoded after the base they’re attached to. Second, the order of nonspacing marks that don’t interact typographically, such as in this case one mark above and one below the base, doesn’t matter. The sequence l  ◌̄  ◌̥ is “canonical equivalent” to l  ◌̥  ◌̄ and should be treated as equal nearly everywhere. Marks that interact typographically because they attach on the same side are ordered outwards from the base, so their order does matter.

The “doesn’t matter” rule is implemented through Unicode normalization, which maps equivalent strings to one preferred form. As part of the Unicode data, nonspacing marks are assigned values of the Unicode property Canonical_Combining_Class (ccc) that reflect their position relative to the base. The value for below-base marks such as ◌̥ is 220, and the value for above-base marks such as ◌̄ 230. Bases get the value 0. The canonical ordering algorithm, which is part of Unicode normalization, then reorders marks that attach to the same base into a canonical order with ascending ccc values. Since 220 is less than 230, the canonical order for l̥̄ is l  ◌̥  ◌̄.

Another part of Unicode normalization deals with the precomposed characters that Unicode had to include for compatibility with earlier character encodings or because users saw them as entities that must be encoded atomically. For example, the character , used in Vietnamese, can be represented by the single Unicode code point , or by the sequence ẹ  ◌̂, or ê  ◌̣, or e  ◌̂  ◌̣, or e  ◌̣  ◌̂. These should all be treated as equal. The canonical decomposition algorithm maps any of these to the last one (labeled “normalization form D”), the canonical composition algorithm can then map back to the first one (labeled “normalization form C”).

Encoding orderRenderingNormalization form
l  ◌̄  ◌̥l̥̄
l  ◌̥  ◌̄l̥̄NFC, NFD
NFC
ẹ  ◌̂ệ
ê  ◌̣ệ
e  ◌̂  ◌̣ệ
e  ◌̣  ◌̂ệNFD

Brahmic scripts: not so simple

Few Brahmic scripts are that simple. Text in most Brahmic scripts has to be seen as a sequence of orthographic syllables. These orthographic syllables are two-dimensional arrangements of marks surrounding a central base glyph, which may be a consonant or independent vowel or a ligature involving at least one of these. Our earlier Khmer example ស្ត្រី consists of one such orthographic syllable; the Tamil example மொழி of two, மொ and ழி. While the complete orthographic syllables are arranged continuously in writing direction (usually left-to-right for Brahmic scripts, but top-to-bottom for Phags-pa), the marks within an orthographic syllable are often not arranged that way.

Many Brahmic scripts have features where the visual order clearly differs from the phonetic order:

In addition, canonical ordering is fundamentally incompatible with several characteristics of the encoding of many Brahmic scripts:

Given these issues, and also given precedents set by earlier character encodings, two separate models have evolved for defining the encoding orders of Brahmic scripts. For a few Brahmic scripts the encoding order is primarily based on the visual order; for all others primarily on the phonetic order. In neither case, however, does the underlying visual or phonetic order fully determine the encoding order – a more precise specification is required.

Using visual order

Visual order is used for four Brahmic scripts: Lao, New Tai Lue, Tai Viet, and Thai.

When basing the encoding order on the visual order, spacing characters are stored in the order in which they occur in writing direction. For example, the Thai word for “Thai” is ไทย. Using visual order, the characters are encoded from left to right, ไ  ท  ย, even though ai is pronounced after th.

The issue then is what to do with nonspacing marks, which Lao, Tai Viet, and Thai have. As we’ve seen above, for simple scripts this is handled using Unicode normalization. Tai Viet is simple enough, and its ccc values are assigned such that this can work.

For Thai and Lao encoding decisions were made that break normalization. First, both scripts encode am vowels, Thai ◌ำ and Lao ◌ຳ, that represent combinations of a nonspacing and a spacing mark, without canonical decompositions – one of the fundamentally incompatible features discussed above. Second, ccc values were assigned in a number of incompatible ways. Most nonspacing marks in these scripts received “fixed position” ccc values whose existence the standard describes as a historical artifact of an earlier stage in its development. The virama characters received the ccc value 9 that’s used for most such characters across Unicode. Both assignments interact badly with the ccc values of script independent marks that are also used in Thai. Some other marks received ccc=0, which prevents reordering. The result is that canonical ordering in some cases reorders characters such that syllables are rendered incorrectly, and doesn’t always treat as equivalent syllables that look the same.

The sections of the Unicode Standard on both Thai and Lao discuss some workarounds; another proposal with more workarounds should be integrated into the standard eventually. In the meantime, the leading OpenType rendering systems have given up on validating Thai text, and Thai font developers have added ad-hoc normalization routines into their fonts to achieve the rendering users expect.

Unfortunately, the incorrect ccc values can not be fixed because of the Unicode policy of Normalization Stability.

Using phonetic order

When basing the encoding order on phonetic order, consonants and vowels are stored in the order they are pronounced. Virama marks that indicate the absence of inherent vowels of consonants are placed immediately after those consonants. In our initial Khmer example ស្ត្រី strī, the correct encoding order based on pronunciation is ស  ◌្ត  ◌្រ  ◌ី representing s-  -t-  -r-  -ī.

This general idea leaves a number of issues unspecified.

First, as we saw earlier, it’s not clear in which order the components of multi-part vowels, such as the Tamil vowel ◌ொ o, should be stored.

Second, it doesn’t clarify what to do with other characters whose position in speech is not so well-defined, such as nuktas, tone marks, consonant shifters, syllable modifiers, or ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER.

Here the Unicode Standard fails almost entirely. Only for a few scripts does it try to fully specify the orderings of syllable components, and only for Cham and Lepcha do they look complete and reasonable. For Khmer, it provides a regular expression that contradicts other parts of the same section, and has various other flaws. For Myanmar, it covers only modern Burmese and then refers to a comprehensive Unicode Technical Note on Myanmar, which however is no longer up to date, and to Microsoft documentation that partially conflicts with the UTN and is no longer up to date either. For Sundanese, information about Old Sundanese is missing.

Third, there can be cases where two or more strings look the same that are pronounced differently, and therefore would be encoded differently when using a purely phonetic order. For such cases, it needs to be decided whether only a single encoding is valid, or whether a disambiguating rendering is required, or whether search operations should treat the different sequences as equal. Sometimes language-specific requirements come into play. Our initial Khmer example is one such case: ស្ត្រី represents strī, ស្រ្តី represents srtī, and ស្រី្ត represents srīt. In the modern Khmer language, only the first pronunciation is valid; in this language, the medial consonant ◌្រ -r- must come after any other medial consonant, and final consonants are written as separate orthographic syllables. However, in the Tampuan language ◌្រ -r- comes before ◌្វ -v-, and in Middle Khmer, an older form of the Khmer language, final consonants can be written by using the same marks as for medial consonants, but after the vowel. For Tampuan, some fonts distinguish the different orders of medial consonants. For Middle Khmer, the use of the mark of a medial consonant as a final consonant can only sometimes be visually distinguished from its use as a medial consonant; in most cases, the reader has to derive the correct pronunciation from the context.

Finally, phonetic order raises the question of how to prevent interference with Unicode normalization. Such interference can come in two ways:

Neither incorrect ccc values nor inappropriate canonical decomposition mappings can ever be fixed, because of the Unicode policy of Normalization Stability.

Indic data

The Unicode Standard provides some data that, outside the standard, has become useful in defining encoding orders: The Unicode character properties Indic_Syllabic_Category (InSC) and Indic_Positional_Category (InPC), collectively the “Indic data”. This data covers the Brahmic scripts encoded in the Unicode Standard as well as the structurally similar Kharoshthi script. The syllabic category specifies the function each character normally has within an orthographic syllable, while the positional category specifies for each combining mark how it is normally positioned relative to its base character. For example, U+17BE ◌ើ KHMER VOWEL SIGN OE has InSC=Vowel_Dependent and InPC=Top_And_Left. The documentation for the properties has room for improvement, and some categories are poorly named – “virama” here means the shapeshifters, while the characters representing real-world visible virama marks are indicted as “pure killers”. But the main issue is that the Unicode Standard doesn’t actually use Indic data to define an encoding order. The next section describes the only known specification that does.

Using copyrighted material without license to create AI systems is theft.

Defining encoding order outside of the Unicode Standard

There have been a number of attempts outside the Unicode Standard to specify the encoding orders of Brahmic scripts. The most influential ones are those of the OpenType shaping engines developed by Microsoft, which are – with variations – implemented in all major OpenType rendering systems. Another interesting effort is that by the Internet Corporation for Assigned Names and Numbers (ICANN), whose goal is to prevent spoofing of domain names. Finally, researchers at SIL Global and others have worked to define encoding orders for the most complex Brahmic scripts, including Myanmar and Khmer, as part of extended script documentation. We’ll take a look at these efforts and their relationships to the Unicode Standard.

OpenType shaping engines

Shaping engines are the components of OpenType rendering systems that take an initial sequence of representative glyphs for the characters in the text to be rendered and transform it into a two-dimensional arrangement of glyphs by applying glyph reordering, substitution, and positioning operations. There’s one relatively simple shaping engine for all simple scripts (strangely called the “standard” scripts), a variety of single-script shaping engines for more complicated scripts, and finally the “universal” shaping engine that attempts to handle all scripts that weren’t already handled by some other shaping engine. Each engine is described in its own document.

Among Brahmic scripts, Bengali (Bangla), Devanagari, Gujarati, Gurmukhi, Kannada, Khmer, Lao, Malayalam, Myanmar, Oriya (Odia), Tamil, Telugu, and Thai are currently described as being handled by dedicated single-script engines, the engine for the New Tai Lue script is unspecified, and all others are handled by the Universal Shaping Engine (USE).

One key feature of the shaping engines for Brahmic scripts is validation: They’re documented to check for invalid character sequences, and to insert the character U+25CC DOTTED CIRCLE before any combining mark found in an invalid position (base characters are valid anywhere). What’s valid is in most engine documentations described by regular expressions, thus defining an encoding order for each script. The USE uses several properties of the Unicode Character Database, such as the general category and the Indic syllabic and positional categories, to classify characters for the regular expressions; the documentation of the other engines sometimes uses explicit lists of characters, and sometimes relies on the intuition of the reader.

Unfortunately, the documentation of most of these shaping engines has always been imprecise, and was abandoned at least a decade ago. Technical bugs have not been fixed, changes in the implementations not reflected, many characters not classified, inconsistencies with the Unicode Standard and other script documentation (such as Representing Myanmar in Unicode) not resolved, and imprecise statements not clarified. In the meantime, the major implementations have diverged substantially, as documented for Devanagari and Khmer.

Nevertheless, the best available specification of encoding orders for many Brahmic scripts is currently provided by the documentation of the OpenType Universal Shaping Engine (USE). This specification classifies characters based on several Unicode properties, including the Indic data mentioned above, and provides a regular expression describing a generic syllable structure for Brahmic scripts. As the encoding order resulting from this generic syllable structure and the property values provided by the Unicode Standard doesn’t always allow all syllables that can be found in real texts, the USE overrides the Indic data where necessary. Deriving the specific syllable structure for any particular script takes a bit of effort, so Encoding orders of Brahmic scripts provides the complete set as of Unicode 16.

The USE specification is good, but not free of issues:

As far as I know, Microsoft has never brought a proposal to the Unicode Consortium that would have specified the encoding orders of Brahmic scripts in the Unicode Standard based on the validation rules of their OpenType shaping engines.

Domain names

ICANN has defined label generation rulesets that restrict the set of valid top-level (root zone) and second-level domain names in various scripts in order to prevent spoofing of such domain names. The rulesets use a custom XML scheme defined in RFC 7940: Representing Label Generation Rulesets Using XML. Unlike regular expressions, these rulesets have the capability to identify variants of a given name, that is, other names that could be confused with the given one. Rather than declaring one variant valid and all others invalid, this allows to register the first variant applied for and then block registration of all others.

Interestingly, the rules for top-level domain names (which ICANN controls) are not the same as the rules for second-level names (which are only recommendations for registrars). For example, the rules for second-level names for Khmer allow only one of the encodings of ស្ត្រី, while the rules for top-level names allow two of them. (The .com registrars seem to ignore the rules entirely.)

The ICANN rulesets prioritize security above all: they support only major languages, prohibit control characters that are only needed to control rendering, and rule out character sequences that could create confusion between different domain names. They are therefore primarily of interest to other areas that have to prioritize security over expressiveness, such as programming language identifiers. They wouldn’t be useful as a general specification of encoding orders for the Unicode Standard, whose goal is that “everyone in the world should be able to use their own language on phones and computers”. On the other hand, well-defined encoding orders in the Unicode Standard would very likely have simplified the development of the ICANN rulesets.

Extended script documentation

Several people have written documents with information about specific scripts that goes beyond the short introductions in the Unicode Standard. Researchers at SIL Global have been particularly active in this regard.

Some of these documents include descriptions of encoding orders and have been published as Unicode Technical Notes (UTNs), for Myanmar, Javanese, Kawi, and most recently Khmer. The problem with UTNs is that they are not reviewed by the Unicode Technical Committee (UTC) and not normative. They’re basically personal opinions, and readers have to judge for themselves whether they find them trustworthy, and to what extent they want to follow them when implementing fonts, keyboards, and other software for these scripts.

The UTN on Khmer is an outcome of a project started around 2020 to resolve the serious issues in the encoding of Khmer identified in Spoof-Vulnerable Rendering in Khmer Unicode Implementations and Issues in Khmer syllable validation. Contributors were SIL Global researchers Makara Sok, Martin Hosken, Diethelm Kanjahn, Marc Durdin, as well as I. This project exemplifies the amount of work that is necessary to define the encoding order of a complex Brahmic script:

Despite all this work, experts in the Unicode Script Ad Hoc (now Script Encoding Working Group) did not like the outcome. Their most important comment in Recommendations to UTC #174 on the 2022 precursor document of the UTN: “Most experts agreed that we want to make sure existing Khmer text should render as it has been, instead of dotted circles starting to appear suddenly.” Let’s consider why.

Using copyrighted material without license to create AI systems is theft.

The dotted circle problem

Using dotted circles for validation as part of font rendering has a major problem: Most of the time they’re just a nuisance. That’s because most of the time users are reading, not editing text. When readers see dotted circles in the text they read, there’s nothing they can do about it. Dotted circles are only helpful while an author is editing text (and maybe while reading in a security sensitive context). Theoretically one might imagine showing dotted circles only during editing, but that would require that font rendering systems (and possibly fonts) know when they’re being used for editing and when not. They don’t.

As the primary job of a font rendering system is to make text readable, not to act as an encoding validator, the trend has been to reduce dotted circle insertion, especially when resolving differences between rendering systems on different platforms. In fact, this did happen for ស្ត្រី: In January 2019, the font rendering system on Apple platforms inserted dotted circles into two of the three character sequences; now it doesn’t do that anymore. Shaping engine implementations for Thai seem to have stopped validating entirely.

Making font rendering systems responsible for validation may have been a good idea back when keyboards were generally dumb and a single font rendering system controlled rendering on the vast majority of all computer screens. It may not be the best solution today.

Using copyrighted material without license to create AI systems is theft.

A new type of normalization?

How normalization can help

Let’s look again at the failures in rendering and search discussed above:

ConsumerRenderingSearch
Text received ம  ◌ா  ◌ெ  ழ  ◌ி ស  ◌្ត  ◌្រ  ◌ី  is it equal?  ស  ◌្រ  ◌្ត  ◌ី
Result மொழி with dotted circle ស្ត្រីស្រ្តី
visible failuresilent failure

Instead of highlighting invalid encoding orders in rendering, and failing to find matches in search, a better solution may be to use a new form of normalization. Instead of separating “correct” and “incorrect” encoding orders, normalization selects one preferred encoding order among several “equivalent” or “similar” ones, and maps from equivalent or similar encodings to the preferred one.

Normalization is particularly useful in processes interpreting text (“consumers”), as it immediately yields better results.

ConsumerRenderingSearch
Text received ம  ◌ா  ◌ெ  ழ  ◌ி ស  ◌្ត  ◌្រ  ◌ី  is it equal?  ស  ◌្រ  ◌្ត  ◌ី
normalization
Text normalized ம  ◌ெ  ◌ா  ழ  ◌ி ស  ◌្ត  ◌្រ  ◌ី  is it equal?  ស  ◌្ត  ◌្រ  ◌ី
Result மொழி ស្ត្រី = ស្ត្រី
success success

However, normalization should also be used in processes generating text (“producers”), such as smart keyboards or speech input systems. This has the main advantage of getting better results from consumers that haven’t been upgraded yet to use the new normalization. However, it also improves performance in consumers that have been upgraded – it’s cheaper to check whether text is normalized than to normalize it.

ProducerSmart keyboard
Text input ம  ◌ா  ◌ெ  ழ  ◌ி ស  ◌្ត  ◌្រ  ◌ី  is it equal?  ស  ◌្រ  ◌្ត  ◌ី
normalization
Text normalized ம  ◌ெ  ◌ா  ழ  ◌ி ស  ◌្ត  ◌្រ  ◌ី  is it equal?  ស  ◌្ត  ◌្រ  ◌ី
transmission
ConsumerRenderingSearch
Text received ம  ◌ெ  ◌ா  ழ  ◌ி ស  ◌្ត  ◌្រ  ◌ី  is it equal?  ស  ◌្ត  ◌្រ  ◌ី
normalization verification
Result மொழி ស្ត្រី = ស្ត្រី
success success

Requirements for Brahmic normalization

Normalization for Brahmic scripts would have to differ from existing Unicode normalization forms in several ways:

However, normalization for Brahmic scripts cannot ignore Unicode normalization in the way the specifications for OpenType shaping engines do, as that could lead to incompatible results depending on whether text has gone through Unicode normalization already. Instead, it will need to perform Unicode normalization as its first step and then correct the results, avoiding the introduction of unnecessary differences such as reversing the order of above- and below-base marks.

Using normalization that extends beyond the current canonical equivalence was proposed before by Lawrence Wolf-Sonkin. His proposal however seems limited to applying the variety of equivalences that were at the time only informally documented in the Unicode Standard, and which starting in Unicode 16 are also collected in the file DoNotEmit.txt. Normalizing encoding order for Brahmic scripts would go beyond this, and require more work, because encoding order is currently largely undocumented.

The Arabic Mark Transient Reordering Algorithm (AMTRA) specified in Unicode Arabic Mark Rendering, an annex to the Unicode Standard, is another example of correcting the encoding order that may result from Unicode normalization. Rendering Arabic marks in the way implied by their ccc values can lead to them being stacked incorrectly. AMTRA reorders them such that marks are positioned the way users expect.

Not a simple fix

Developing a new standardized normalization for Brahmic scripts is not easier than fully specifying encoding orders. It still requires collecting information about the requirements of all the different languages that use a script. For many scripts it may turn out that two or more different normalization variants are needed, as discussed above.

If and when the owners of search engines realize that languages such as Khmer are worth supporting well, it’s very likely that they will develop an algorithmic solution, and not try to encourage the owners of web sites to convert text to a standardized encoding order as the UTN on Khmer proposes. Normalization is the obvious solution. The big question then is whether each search engine will get its own normalization, or whether the owners will collaborate and develop a standard encoding order and normalization. Standardization would have the advantage that the same normalized encoding order could also be used by text producers, whether smart keyboards or speech input systems, leading to better results overall.

Unicode may be in a similar situation as HTML was 21 years ago, before the WHATWG was founded. While HTML at the time had a well-defined syntax, and the W3C was promoting XHTML as a stricter variant, many web pages used incorrect syntax, “tag soup”. Browsers tried to make sense of tag soup and render it, but in different ways. Then the WHATWG was formed to develop the HTML5 standard, part of which standardizes the interpretation of tag soup. This was a big multi-year multi-company project, but it paid off.

Using copyrighted material without license to create AI systems is theft.

Acknowledgments

I’d like to thank Martin Hosken, Makara Sok, Didi Kanjahn, and Marc Durdin for many discussions in the Khmer encoding project that helped me better understand the issues around encoding order as well as possible solutions. I’d also like to thank Alolita Sharma, Andrew Glass, Aditya Bayu Perdana, Behdad Esfahbod, Ben Mitchell, Deborah Anderson, Lee Collins, Liang Hai, Maaike Langerak, Makara Sok, Marc Durdin, Martin Hosken, Muthu Nedumaran, Peter Constable, Peter Lofting, and Richard Ishida for providing information for this document or feedback on drafts of it.

Using copyrighted material without license to create AI systems is theft.

References

Deborah Anderson, Ken Whistler, Roozbeh Pournader, and Peter Constable: Recommendations to UTC #174 January 2023 on Script Proposals. UTC document L2/23-012. The Unicode Consortium, 2023.

Peter Constable: Canonical Ordering of Marks in Thai Script. UTC document L2/18-216. The Unicode Consortium, 2018.

Kim Davies, Asmus Freytag: RFC 7940: Representing Label Generation Rulesets Using XML. Internet Engineering Task Force 2016.

Mark Davis et al.: Unicode Locale Data Markup Language (LDML). Version 47. Unicode Technical Standard 35. The Unicode Consortium, 2025. See in particular Steven Loomis et al.: Keyboards.

Joshua Horton, Makara Sok, Marc Durdin, Rasmey Ty: Spoof-Vulnerable Rendering in Khmer Unicode Implementations. European Language Resources Association, 2019.

Martin Hosken: Representing Myanmar in Unicode. Details and Examples. Version 4. Unicode Technical Note 11. The Unicode Consortium, 2012.

Martin Hosken: Khmer Encoding Structure (Nov 2022). Contributors: Makara Sok, Norbert Lindenberg. UTC document L2/22-290. The Unicode Consortium, 2022.

Martin Hosken: Khmer Encoding Structure. Contributors: Makara Sok, Norbert Lindenberg. Unicode Technical Note 61, Version 2. The Unicode Consortium, 2025.

Internet Corporation for Assigned Names and Numbers: Root Zone Label Generation Rules. ICANN 2022. See in particular Root Zone LGR for script: Khmer (Khmr).

Internet Corporation for Assigned Names and Numbers: Second-Level Reference Label Generation Rules. ICANN, 2024. See in particular Reference LGR for script: Khmer (Khmr).

Robin Leroy, Mark Davis: Unicode Source Code Handling. Unicode Technical Standard 55. The Unicode Consortium, 2024.

Norbert Lindenberg: Issues in Khmer syllable validation. Lindenberg Software LLC, 2019.

Norbert Lindenberg et al.: Unilateral change to USE cluster model. HarfBuzz issue 3498. GitHub 2022.

Norbert Lindenberg: Issues in Devanagari cluster validation. Lindenberg Software LLC, 2020.

Norbert Lindenberg: Implementing Javanese. Unicode Technical Note 47. The Unicode Consortium, 2022.

Norbert Lindenberg: Implementing Kawi. Unicode Technical Note 48. The Unicode Consortium, 2022.

Norbert Lindenberg: Encoding orders of Brahmic scripts. Lontar GmbH, 2024.

Norbert Lindenberg: Introduction to Brahmic scripts. Lontar GmbH, 2025.

Lontar GmbH: Keyboards and fonts for iPhone and iPad. Lontar GmbH, 2014-2024.

Microsoft Corporation: Creating and supporting OpenType fonts for the Universal Shaping Engine. Microsoft Corporation, dated 2024-09-15. See also shaping engine documentation for Bengali (Bangla), Devanagari, Gujarati, Gurmukhi, Kannada, Khmer, Lao, Malayalam, Myanmar, Oriya (Odia), Tamil, Telugu, and Thai.

Roozbeh Pournader, Bob Hallissy, Lorna Evans: Unicode Arabic Mark Rendering. Unicode Standard Annex 53. The Unicode Consortium, 2024.

SIL Global: Khmer Angkor. SIL Global, 2025.

Makara Sok: Phonological Principles and Automatic Phonemic and Phonetic Transcription of Khmer Words. Payap University, 2016.

Makara Sok: Khmer Character Specification/Usages. Version of 2024-08-02. GitHub, 2024.

Richard Wordingham: Decomposed Sinhala Vowels. Microsoft Typography Issues, issue 905. GitHub 2022.

The Unicode Consortium: Topical Document List: Tai Tham. Accessed 2025-05-22.

The Unicode Consortium: Normalization Stability. The Unicode Consortium, dated 2024-01-09.

The Unicode Consortium: The Unicode Standard, Version 16.0.0. The Unicode Consortium, 2024. See in particular Core Specification.

Lawrence Wolf-Sonkin: Beyond Canonical Equivalence: A discussion on a way forward. UTC document L2/23-056. The Unicode Consortium, 2023.