Constructing fonts for the Batak script

Norbert Lindenberg
October 14, 2020



The Batak script was until the early 20th century used to write several related languages spoken in North Sumatra. Five variants are generally distinguished: The northern variants Karo and Pakpak, and the southern variants Simalungun, Toba, and Mandailing (Kozok 1996, Kozok 2009a, Everson and Kozok 2008, Unicode 2020).

The Batak script is an abugida: At its core is a set of consonants with the inherent vowel a, which can be modified with a consonant modifier, vowel marks, final consonant marks, and virama signs. Compared to other Brahmic scripts, it has no conjuncts, no repha forms, and no pre-base marks. The final consonants of phonetic syllables can be represented in two ways: The consonants ng and h, corresponding to anusvara and visarga, are represented as above-base marks ◌ᯰ and ◌ᯱ, which are positioned on top of a spacing vowel mark if there is one. Other final consonants are represented by a consonant letter followed by a virama, and in this case the consonant is placed before any vowel mark in the syllable: The syllable consisting of ta, ◌ᯪ i, pa, ◌᯲ virama has to be displayed as ᯖᯪᯇ᯲ tip. There are a maximum of two rows of above-base and one row of below-base marks. The positioning of above-base marks relative to a base consonant is semantically significant. The encoding of Batak in Unicode follows phonetic order, requiring reordering in the case of final consonants, and encodes most language-dependent glyph variants as separate code points.

This document provides information on the required character sets for the different variants, the structure of valid clusters and how to validate them, the steps needed to bring a sequence of nominal glyphs representing a Unicode character sequence into the corresponding sequence of glyphs for correct rendering, commonly used ligatures, glyph variants, and the positioning information required for the glyphs. It relies on the OpenType rendering system, where Batak is supported through the Universal Shaping Engine (USE, Microsoft 2020), which has been available since Android 7, iOS 10, Windows 10, and macOS 10.12. The document assumes knowledge equivalent to Nedumaran and Lindenberg 2017. To create an actual font for Batak, it needs to be complemented with research and design of Batak letter shapes. The Batak font used in this document is based on the fonts for Karo, Mandailing, Pakpak, Simalungun, and Variants 1.1 for Windows in Kozok 2009b, but has been reengineered for Unicode and the USE.

Script and language tags

The ISO script tag for Batak is Batk. The ISO language tag for Batak in general is btk; the ones for the variants mentioned above are Karo btx, Pakpak btd, Simalungun bts, Toba bbc, and Mandailing btm. When forming BCP 47 language tags with these codes, the script tag may need to be included, as the BCP 47 language tag registry doesn’t define a default script for these languages (if it did, it would more likely be Latin than Batak).

The OpenType script tag for Batak is batk. The only registered OpenType language tags for Batak and its variants are Simalungun BTS and Toba BBC. The HarfBuzz shaping system has a fallback mechanism that maps three-letter language strings to their uppercase equivalents as pseudo-OpenType language tags, including Batak BTK, Karo BTX, Pakpak BTD, and Mandailing BTM. An update to the OpenType language tag registry to add more Batak language tags is in progress.


The table below shows the Unicode characters relevant to a Batak font, with Unicode code point, Unicode name, a proposed glyph name, and representative glyphs in the columns of the languages that use the character. It indicates cases where a character is pronounced differently from the consonant or vowel used for the Unicode name. A glyph in parentheses means that this character is less commonly used for this language.

Code PointUnicode NameGlyph NameKaroPakpakSimalungunTobaMandailing
Unicode character names in this group have the prefix BATAK LETTER.
Glyph names in this group have the suffix “-batak”.
1BC0Aa a, ha a, ha
1BC2HAha ka ka ha, ka( ha, ka)
1BC3SIMALUNGUN HAhaSima ha, ka ( ha, ka)
1BC4MANDAILING HAhaMandai ha, ka
1BC5BAba mba
1BD8SAsa sa, ca()
1BE0NYAnya ca
Unicode character names in this group have the prefix BATAK.
Glyph names in this group have the suffix “-batak”.
1BE6SIGN TOMPItompi ◌᯦
1BE7VOWEL SIGN EeSign◌ᯧ(◌ᯧ)
1BE9VOWEL SIGN EEeeSign◌ᯩ◌ᯩ◌ᯩ◌ᯩ◌ᯩ
1BEAVOWEL SIGN IiSign◌ᯪ◌ᯪ ◌ᯪ◌ᯪ
1BECVOWEL SIGN OoSign◌ᯬ u◌ᯬ◌ᯬ◌ᯬ◌ᯬ
1BEDVOWEL SIGN KARO OoSignKaro◌ᯭ ◌ᯭ ou
1BEEVOWEL SIGN UuSign ◌ᯮ◌ᯮ◌ᯮ◌ᯮ
1BF0CONSONANT SIGN NGngSign◌ᯰ◌ᯰ◌ᯰ◌ᯰ◌ᯰ
1BF2PANGOLATpangolat ◌᯲ ◌᯲◌᯲
1BF3PANONGONANpanongonan◌᯳ ◌᯳
Unicode character names in this group have the prefix BATAK SYMBOL.
Glyph names in this group have the suffix “-batak”.
1BFFBINDU PANGOLATbinduPangolat᯿᯿᯿᯿᯿
These non-Batak characters should be provided in a Batak font.

A few notes on some of these characters:

Cluster validation

The Unicode Standard describes the Batak syllable structure as C(V)(Cs|Cd): a consonant, followed by an optional vowel sign, which may be followed either by a consonant sign Cs (-ng or -h) or a killed final consonant Cd. It also specifies that text is stored in “logical order”, by which it means phonetic order, and that the virama pangolat can’t follow a dependent vowel, by which it means in stored order. This description isn’t complete, as it doesn’t account for the consonant modifier tompi, and doesn’t say whether the virama panongonan can follow a dependent vowel (it can’t).

The Universal Shaping Engine (USE) relies on Unicode character properties to classify characters.

Code pointsCharactersGeneral categoryCanonical combining classIndic syllabic categoryIndic positional categoryUSE subclass
1BFC..1BFF᯼ ᯽ ᯾ ᯿Po0OtherNABASE­_IND
0020 Zs0OtherNAOTHER
00A0 Zs0Consonant­_PlaceholderNABASE­_OTHER
1BC0..1BE3ᯀ ᯁ ᯂ ᯃ ᯄ ᯅ ᯆ ᯇ ᯈ ᯉ ᯊ ᯋ ᯌ ᯍ ᯎ ᯏ ᯐ ᯑ ᯒ ᯓ ᯔ ᯕ ᯖ ᯗ ᯘ ᯙ ᯚ ᯛ ᯜ ᯝ ᯞ ᯟ ᯠ ᯡ ᯢ ᯣLo0ConsonantNABASE
1BE4..1BE5ᯤ ᯥLo0Vowel­_IndependentNABASE
1BE8..1BE9, 1BED, 1BEF◌ᯨ ◌ᯩ ◌ᯭ ◌ᯯMn0Vowel­_DependentTopVOWEL­_ABOVE
1BE7, 1BEA..1BEC, 1BEE◌ᯧ ◌ᯪ ◌ᯫ ◌ᯬ ◌ᯮMc0Vowel­_DependentRightVOWEL­_POST
1BF2..1BF3◌᯲ ◌᯳Mc9Pure­_KillerRightVOWEL­_POST
1BF0..1BF1◌ᯰ ◌ᯱMn0Consonant­_FinalTopCONS­_FINAL­_ABOVE

The classification of U+1BEE ◌ᯮ is likely wrong – it is usually positioned below the base, and should therefore have the general category Mn and the Indic positional category Bottom, and the USE subclass VOWEL_BELOW. A proposal is being prepared to correct this, and the following text assumes corrected data. You don’t need to worry about this, as the change can only affect incorrect text – in correct text at most one vowel or virama can be attached to each consonant.

Based on the USE subclasses identified above, but assuming corrected data for U+1BEE, the USE defines the following clusters for Batak:

The standard cluster definition is a bit difficult to compare to the description in the Unicode Standard, but the key differences are:

  1. If a syllable, as defined in the Unicode Standard, includes a final consonant group Cd, that is, a consonant followed by a virama, then the USE treats that final consonant as a separate cluster. This will impact the implementation of the required reordering of the consonant and a preceding dependent vowel, as discussed below.
  2. As the Batak virama signs are classified as VOWEL_POST, and the USE allows multiple dependent vowels of any kind, the USE does not enforce the rule that a virama can’t follow a dependent vowel.
  3. The USE allows multiple consonant modifiers, multiple vowels, and multiple final consonant signs within a cluster, and, because they’re separate clusters, multiple final consonant groups. A Batak syllable as described in the Unicode Standard allows only one of each, and only either a final consonant sign or a final consonant group.

Items 2 and 3 mean that the validation performed by the USE is partial at best, and a font should make potential problems visible. Some of the problems in item 3 are immediately visible, for example, if an author writes multiple final consonant groups. Others, such as repeated below-base vowels, can be made visible through mark-to-mark positioning, as described in Glyph positioning below.

The one issue a font must deal with is item 2: If a virama follows a dependent vowel, a dotted circle should be inserted between them. If that’s not done, authors may write a syllable such as ᯖᯪᯇ᯲ as a sequence of consonant, consonant, dependent vowel, virama and see what looks like the correct visual sequence, which is however based on an incorrect logical sequence. (Starting with Corbett 2017 a hack was added to USE implementations to insert the dotted circle automatically, but older implementations don’t do that.)

A font can implement the insertion of the dotted circle as follows:

lookup insert_dotted_circle_batak {

lookupflag 0;


sub pangolat-batak by dottedCircle pangolat-batak;

sub panongonan-batak by dottedCircle panongonan-batak;

} insert_dotted_circle_batak;


feature ccmp {


lookup validate_batak {

lookupflag 0;


sub @vowelSign_batak @virama_batak' lookup insert_dotted_circle_batak;

} validate_batak;


} ccmp;

USE cluster validation does not take Unicode normalization into consideration. This creates a conflict with the canonical ordering algorithm that’s part of normalization: This algorithm reorders a sequence of a Batak virama, which have canonical combining class 9, followed by a tompi, which has canonical combining class 7, into a tompivirama sequence, and so the two sequences are considered canonical equivalent in Unicode. The USE, on the other hand, inserts a dotted circle into the first sequence, treating it as decidedly non-equivalent. A font could try and remove such a dotted circle and attach the tompi to the preceding consonant to restore equivalence, but deciding whether a dotted circle was in the original text or was added by cluster validation is tricky. It’s probably safer to punt on this issue and accept that Batak rendering only works correctly on normalized text.

Glyph reordering

Batak requires none of the reordering features that the USE supports: pre-base vowels, pre-base vowel modifiers, pre-base medial consonants, or repha forms. On the other hand, Batak requires a reordering feature that the USE doesn’t support: A logical sequence of dependent vowel, consonant, and virama has to be reordered into the corresponding consonant-vowel-virama sequence for display. For example, the syllable stored as ta, ◌ᯪ i, pa, ◌᯲ virama has to be displayed as ᯖᯪᯇ᯲ tip.

This reordering can’t be performed in the USE’s feature application I phase, because the USE applies this phase’s features one cluster at a time and treats the consonant-virama sequence as a separate cluster. It has to be performed in the USE’s standard typographic presentation phase, where features are applied to the entire run. Of the features applied in this phase, rclt seems the least inappropriate, and has the advantage that it normally can’t be turned off.

OpenType lacks a reordering substitution that would let us implement this feature directly. Everson and Kozok 2008 propose ligatures as one possible implementation. However, Batak in its Unicode representation has 36 consonants and 9 dependent vowels, so a full set of ligatures would consist of 324 glyphs. Most of them would likely never be used, either because the consonant is not used as a final consonant at all, or because the specific vowel-consonant combination is not used. Unfortunately, there’s no documentation available on which combinations are used and which ones aren’t. This makes ligatures a rather inefficient way to implement this reordering.

Instead, we can use contextual substitutions to remove glyphs from where they shouldn’t be, and insert them where they should be. Since there are 36 consonants, 9 dependent vowels, and 2 viramas involved, it’s best to avoid rules that would have to enumerate the consonants. We therefore remove and insert the dependent vowels.

OpenType also lacks a substitution to remove glyphs. Instead, we remove a dependent vowel from its old location by replacing it with an “invisible” glyph that is classified as an OpenType mark, has advance width 0, no outline, and a name that doesn’t map to a Unicode character according to the Adobe Glyph List.

lookup remove_vowel_batak {

lookupflag 0;


sub @vowelSign_batak by __.invisible;

} remove_vowel_batak;

The dependent vowel is then inserted before a virama. As this lookup is specific to the vowel, it needs to be repeated for each of the vowel signs.

lookup insert_eSign_batak {

lookupflag 0;


sub pangolat-batak by eSign-batak pangolat-batak;

sub panongonan-batak by eSign-batak panongonan-batak;

} insert_eSign_batak;

We use these lookups nested within a contextual lookup that detects the vowel-consonant-virama sequence, removes the dependent vowel from its pre-consonant location, and inserts it before the virama.

feature rclt {


lookup reorder_batak {

lookupflag UseMarkFilteringSet @vowelOrConsonantMark;


sub eSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_eSign_batak;

sub eSignPak-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_eSignPak_batak;

sub eeSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_eeSign_batak;

sub iSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_iSign_batak;

sub iSignKaro-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_iSignKaro_batak;

sub oSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_oSign_batak;

sub oSignKaro-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_oSignKaro_batak;

sub uSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_uSign_batak;

sub uSignSima-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_uSignSima_batak;

} reorder_batak;


} rclt;

Language-dependent features

Batak has two language-dependent features. For Karo, ja is usually written in a more symmetric form, . For Mandailing, ◌ᯮ uSign is flipped horizontally when attached to the consonant pa: ᯇᯮ.

These features can be enabled in browsers by setting the lang attribute of an element containing the text to the desired language, e.g., lang="btx-Batk". Some word processors also provide a way to specify the language of text. Unfortunately, currently these features work only with browsers and applications based on the HarfBuzz shaping system, because other OpenType implementations don’t yet map the ISO language tags btx and btm to the corresponding OpenType language tags BTX and BTM. Browsers based on HarfBuzz are Firefox, Chrome, and new Edge, except for their versions for iOS, which have to use WebKit and CoreText. Here is what you should be seeing and what your browser actually renders:

LanguageExpectedYour browser

The features can’t be implemented in Batak fonts using the usual OpenType feature locl because that feature is applied as part of the USE’s feature application I phase, while we don’t know whether a ◌ᯮ sits under a until reordering is done. We therefore use the rclt feature tag again.

feature rclt {


script batk;


language BTX;


lookup karo_ja_batak {

lookupflag 0;


sub ja-batak by ja-batak​.v1;

} karo_ja_batak;


language BTM;


lookup flip_uSign_batak {

lookupflag UseMarkFilteringSet @below_base_batak;


sub pa-batak uSign-batak' by uSign-batak​.paMandai;

} flip_uSign_batak;

} rclt;

Ligatures and contextual forms

The below-base dependent vowel ◌ᯮ uSign forms ligatures with several consonants; with several other consonants it changes into a simplified shape. Combinations of this vowel with consonants that occur only in Karo are not used (indicated in gray).

Consonant glyph nameUnligated combinationCorrect combinationCombination type
aSimaᯁᯮᯁᯮattachment of simplified glyph
haSimaᯃᯮᯃᯮattachment of simplified glyph
paᯇᯮᯇᯮ, ᯇᯮattachment
waSimaᯌᯮᯌᯮattachment of simplified glyph
gaSimaᯏᯮᯏᯮattachment of simplified glyph
raSimaᯓᯮᯓᯮattachment of simplified glyph
maSimaᯕᯮᯕᯮligature or attachment of simplified glyph

Glyph variants

Several Batak characters can be written with different glyphs. While the additional glyphs described here are not essential for reading or writing Batak, they may be necessary to faithfully transcribe existing documents. OpenType offers two features to support variant glyphs: stylistic sets (ss01–ss20) and character variants (cv01–cv99) (Microsoft 2018). Stylistic sets are intended for cases where variant glyphs are used in logically defined sets that should change together. So far, no such logically defined sets have been identified for Batak. Character variants are intended for cases where variant glyphs are not systematically related, which appears to be the case for Batak.

Character variants (and stylistic sets) are always specific to a particular font. The character variants shown here are derived from Kozok 2009a and implemented in the font used in this article. In some cases only the combination with the vowel sign ◌ᯮ uSign is affected; in other cases that combination is irrelevant (and indicated in gray below) because the variant is only known to be used for Karo, which does not use that vowel.

Variant featureConsonantConsonant with vowel uVariant numberVariant of consonantVariant of consonant with vowel u

Character variants can be enabled in browsers using the CSS property font-feature-settings, specifying the variant feature and the variant number. For example, to replace the default glyph for with its variant number 2, , use font-feature-settings: "cv08" 2. Legacy Edge appears to disable the liga feature when font-feature-settings is used, so for compatibility with this browser you may want to explicitly enable it and use: font-feature-settings: "liga", "cv08" 2. Character variants for different characters can be enabled in combination; for example, to use both and ᯞᯮ, use font-feature-settings: "liga", "cv06" 1, "cv10" 1.

The implementation of a character variant feature is in many cases an alternate substitution that offers one or more alternative glyphs. For example, the variants for are implemented as:

feature cv08 {


sub ma-batak from [ma-batak​.v1 ma-batak​.v2];

sub ma_uSign-batak from [ma_uSign-batak​.v1 ma_uSign-batak​.v2];

} cv08;

However, where only one alternative is needed (i.e., the feature is essentially a binary on-off feature), other substitutions can be used. For example, the variant for ᯇᯮ is implemented by reusing the contextual lookup for the language-dependent feature for that glyph sequence:

feature cv03 {


lookup flip_uSign_batak;

} cv03;

Similarly, the variant for ᯞᯮ is implemented by breaking up the ligature into its components:

feature cv10 {


sub la_uSign-batak by la-batak uSign-batak;

} cv10;

Glyph positioning

Mark-to-base positioning

In the Batak script, the position of an above-base mark can be semantically significant: ◌ᯩ is the vowel ee, but ◌ᯰ is the final consonant ng, and ◌᯦ is the consonant modifier tompi, while ◌ᯱ is the final consonant h. We therefore have to distinguish topleft, top, and topright marks. Combinations do occur; ᯚ᯦ᯩᯰ is a valid syllable. There’s only one below-base mark, which tends to sit in the bottom right corner, so a single bottomright class suffices here. The classes are:

For correct text, base glyphs need anchors as follows:

Mark-to-mark positioning

For correct text, mark-to-mark positioning is necessary only for those marks that can sit above the right-hand side of a base consonant: The final consonants ◌ᯰ ngSign and ◌ᯱ hSign above the vowels ◌ᯨ eSign and ◌ᯭ oSignKaro. If the final consonant marks occur together with other above-base vowels or with ◌᯦ tompi, they are positioned to their right, as discussed above. In order to make incorrect text obvious, it is useful to add mark-to-mark positioning for combinations that shouldn’t occur but are not prohibited by syllable validation (indicated in red).



The following materials are available to help with the construction of fonts for the Batak script:


I’d like to thank Uli Kozok for his generous advice on the requirements for Batak fonts, as well as for licensing the glyphs of his Batak fonts to Lindenberg Software LLC. I also thank முத்து நெடுமாறன் (Muthu Nedumaran) and Ben Mitchell for their comments on drafts of this article.


Corbett 2017: David Corbett: Forbid Batak killers after vowel signs. GitHub 2017.

Everson and Kozok 2008: Michael Everson, Uli Kozok: Proposal for encoding the Batak script in the UCS. Unicode Consortium, 2008.

Kozok 1996: Uli Kozok: Bark, Bones, and Bamboo: Batak Traditions of Sumatra. In: Ann Kumar and John McGlynn: Illuminations. The Writing Traditions of Indonesia. The Lontar Foundation / Weatherill 1996. Available at Internet Archive.

Kozok 2009a: Uli Kozok: Surat Batak. Sejarah Perkembangan Tulisan Batak. Ecole française d’Extrême-Orient / Kepustakaan Populer Gramedia 2009.

Kozok 2009b: Uli Kozok: Unduh Font Batak, in particular Karo 1.1. for Windows, Mandailing 1.1. for Windows, Pakpak 1.1. for Windows, Simalungun 1.1. for Windows, Variants 1.1. for Windows. 2009.

Microsoft 2018: Feature tags. Microsoft, dated 08/14/2018. In particular, character variants and stylistic sets.

Microsoft 2020: Creating and supporting OpenType fonts for the Universal Shaping Engine. Microsoft, dated 07/31/2020.

Nedumaran and Lindenberg 2017: Muthu Nedumaran, Norbert Lindenberg: Creating fonts for Brahmic scripts with OpenType and Apple Advanced Typography.

Unicode 2020: The Unicode Consortium: The Unicode Standard, Version 13.0. The Unicode Consortium, 2020. For Batak, section 17.6 Batak, pages 702-703. For abugidas, section 6.1 Writing Systems, pages 255-256. For canonical ordering, section 3.11 Normalization Forms, pages 135-136.