Constructing fonts for the Batak script

Norbert Lindenberg

October 14, 2020

This article requires web fonts to be rendered correctly. Please read it in a browser and mode that supports web fonts (“reader” views don’t).

Introduction
Script and language tags
Characters
Cluster validation
Glyph reordering
Language-dependent features
Ligatures and contextual forms
Glyph variants
Glyph positioning
Materials
Acknowledgments
References

Using copyrighted material without license to create AI systems is theft.

Introduction

The Batak script was until the early 20^th century used to write several related languages spoken in North Sumatra. Five variants are generally distinguished: The northern variants Karo and Pakpak, and the southern variants Simalungun, Toba, and Mandailing (Kozok 1996, Kozok 2009a, Everson and Kozok 2008, Unicode 2020).

The Batak script is an abugida: At its core is a set of consonants with the inherent vowel a, which can be modified with a consonant modifier, vowel marks, final consonant marks, and virama signs. Compared to other Brahmic scripts, it has no conjuncts, no repha forms, and no pre-base marks. The final consonants of phonetic syllables can be represented in two ways: The consonants ng and h, corresponding to anusvara and visarga, are represented as above-base marks ◌ᯰ and ◌ᯱ, which are positioned on top of a spacing vowel mark if there is one. Other final consonants are represented by a consonant letter followed by a virama, and in this case the consonant is placed before any vowel mark in the syllable: The syllable consisting of ᯖ ta, ◌ᯪ i, ᯇ pa, ◌᯲ virama has to be displayed as ᯖᯪᯇ᯲ tip. There are a maximum of two rows of above-base and one row of below-base marks. The positioning of above-base marks relative to a base consonant is semantically significant. The encoding of Batak in Unicode follows phonetic order, requiring reordering in the case of final consonants, and encodes most language-dependent glyph variants as separate code points.

This document provides information on the required character sets for the different variants, the structure of valid clusters and how to validate them, the steps needed to bring a sequence of nominal glyphs representing a Unicode character sequence into the corresponding sequence of glyphs for correct rendering, commonly used ligatures, glyph variants, and the positioning information required for the glyphs. It relies on the OpenType rendering system, where Batak is supported through the Universal Shaping Engine (USE, Microsoft 2020), which has been available since Android 7, iOS 10, Windows 10, and macOS 10.12. The document assumes knowledge equivalent to Nedumaran and Lindenberg 2017. To create an actual font for Batak, it needs to be complemented with research and design of Batak letter shapes. The Batak font used in this document is based on the fonts for Karo, Mandailing, Pakpak, Simalungun, and Variants 1.1 for Windows in Kozok 2009b, but has been reengineered for Unicode and the USE.

Using copyrighted material without license to create AI systems is theft.

Script and language tags

The ISO script tag for Batak is Batk. The ISO language tag for Batak in general is btk; the ones for the variants mentioned above are Karo btx, Pakpak btd, Simalungun bts, Toba bbc, and Mandailing btm. When forming BCP 47 language tags with these codes, the script tag may need to be included, as the BCP 47 language tag registry doesn’t define a default script for these languages (if it did, it would more likely be Latin than Batak).

The OpenType script tag for Batak is batk. The only registered OpenType language tags for Batak and its variants are Simalungun BTS and Toba BBC. The HarfBuzz shaping system has a fallback mechanism that maps three-letter language strings to their uppercase equivalents as pseudo-OpenType language tags, including Batak BTK, Karo BTX, Pakpak BTD, and Mandailing BTM. An update to the OpenType language tag registry to add more Batak language tags is in progress.

Using copyrighted material without license to create AI systems is theft.

Characters

The table below shows the Unicode characters relevant to a Batak font, with Unicode code point, Unicode name, a proposed glyph name, and representative glyphs in the columns of the languages that use the character. It indicates cases where a character is pronounced differently from the consonant or vowel used for the Unicode name. A glyph in parentheses means that this character is less commonly used for this language.

Code Point	Unicode Name	Glyph Name	Karo	Pakpak	Simalungun	Toba	Mandailing
Letters Unicode character names in this group have the prefix BATAK LETTER. Glyph names in this group have the suffix “-batak”.
1BC0	A	a	ᯀ a, ha	ᯀ a, ha		ᯀ	ᯀ
1BC1	SIMALUNGUN A	aSima			ᯁ
1BC2	HA	ha	ᯂ ka	ᯂ ka		ᯂ ha, ka	(ᯂ ha, ka)
1BC3	SIMALUNGUN HA	haSima			ᯃ ha, ka		(ᯃ ha, ka)
1BC4	MANDAILING HA	haMandai					ᯄ ha, ka
1BC5	BA	ba	ᯅ mba	ᯅ	ᯅ	ᯅ	ᯅ
1BC6	KARO BA	baKaro	ᯆ
1BC7	PA	pa	ᯇ	ᯇ		ᯇ	ᯇ
1BC8	SIMALUNGUN PA	paSima			ᯈ
1BC9	NA	na	ᯉ	ᯉ	ᯉ	ᯉ	ᯉ
1BCA	MANDAILING NA	naMandai					(ᯊ)
1BCB	WA	wa	ᯋ			ᯋ	ᯋ
1BCC	SIMALUNGUN WA	waSima			ᯌ
1BCD	PAKPAK WA	waPak		ᯍ		ᯍ
1BCE	GA	ga	ᯎ	ᯎ		ᯎ	ᯎ
1BCF	SIMALUNGUN GA	gaSima			ᯏ
1BD0	JA	ja	ᯐ	ᯐ	ᯐ	ᯐ	ᯐ
1BD1	DA	da	ᯑ	ᯑ	ᯑ	ᯑ	ᯑ
1BD2	RA	ra	ᯒ	ᯒ		ᯒ	ᯒ
1BD3	SIMALUNGUN RA	raSima			ᯓ
1BD4	MA	ma	ᯔ	ᯔ		ᯔ	ᯔ
1BD5	SIMALUNGUN MA	maSima			ᯕ
1BD6	SOUTHERN TA	taSouth			ᯖ	ᯖ	ᯖ
1BD7	NORTHERN TA	taNorth	ᯗ	ᯗ		ᯗ
1BD8	SA	sa	ᯘ	ᯘ sa, ca		ᯘ	(ᯘ)
1BD9	SIMALUNGUN SA	saSima			ᯙ		(ᯙ)
1BDA	MANDAILING SA	saMandai			(ᯚ)		ᯚ
1BDB	YA	ya	ᯛ	ᯛ		ᯛ	ᯛ
1BDC	SIMALUNGUN YA	yaSima			ᯜ
1BDD	NGA	nga	ᯝ	ᯝ	ᯝ	ᯝ	ᯝ
1BDE	LA	la	ᯞ	ᯞ		ᯞ	ᯞ
1BDF	SIMALUNGUN LA	laSima			ᯟ
1BE0	NYA	nya	ᯠ ca		ᯠ	ᯠ	ᯠ
1BE1	CA	ca	ᯡ
1BE2	NDA	nda	ᯢ
1BE3	MBA	mba	ᯣ
1BE4	I	i	ᯤ	ᯤ	ᯤ	ᯤ	ᯤ
1BE5	U	u	ᯥ	ᯥ	ᯥ	ᯥ	ᯥ
Marks Unicode character names in this group have the prefix BATAK. Glyph names in this group have the suffix “-batak”.
1BE6	SIGN TOMPI	tompi					◌᯦
1BE7	VOWEL SIGN E	eSign	◌ᯧ	(◌ᯧ)
1BE8	VOWEL SIGN PAKPAK E	eSignPak	◌ᯨ o	◌ᯨ
1BE9	VOWEL SIGN EE	eeSign	◌ᯩ	◌ᯩ	◌ᯩ	◌ᯩ	◌ᯩ
1BEA	VOWEL SIGN I	iSign	◌ᯪ	◌ᯪ		◌ᯪ	◌ᯪ
1BEB	VOWEL SIGN KARO I	iSignKaro	◌ᯫ		◌ᯫ
1BEC	VOWEL SIGN O	oSign	◌ᯬ u	◌ᯬ	◌ᯬ	◌ᯬ	◌ᯬ
1BED	VOWEL SIGN KARO O	oSignKaro	◌ᯭ		◌ᯭ ou
1BEE	VOWEL SIGN U	uSign		◌ᯮ	◌ᯮ	◌ᯮ	◌ᯮ
1BEF	VOWEL SIGN U FOR SIMALUNGUN SA	uSignSima			◌ᯯ
1BF0	CONSONANT SIGN NG	ngSign	◌ᯰ	◌ᯰ	◌ᯰ	◌ᯰ	◌ᯰ
1BF1	CONSONANT SIGN H	hSign	◌ᯱ	◌ᯱ	◌ᯱ
1BF2	PANGOLAT	pangolat		◌᯲		◌᯲	◌᯲
1BF3	PANONGONAN	panongonan	◌᯳		◌᯳
Punctuation Unicode character names in this group have the prefix BATAK SYMBOL. Glyph names in this group have the suffix “-batak”.
1BFC	BINDU NA METEK	binduNaMetek	᯼	᯼	᯼	᯼	᯼
1BFD	BINDU PINARBORAS	binduPinarboras	᯽	᯽	᯽	᯽	᯽
1BFE	BINDU JUDUL	binduJudul	᯾	᯾	᯾	᯾	᯾
1BFF	BINDU PANGOLAT	binduPangolat	᯿	᯿	᯿	᯿	᯿
Other These non-Batak characters should be provided in a Batak font.
0020	SPACE	space
00A0	NO-BREAK SPACE	nbspace
25CC	DOTTED CIRCLE	dottedCircle	◌

A few notes on some of these characters:

ᯀ a and ᯁ aSima can carry vowel marks and final consonants, so for rendering purposes they need to be treated like consonants.
The post-base vowel signs ◌ᯧ eSign, ◌ᯪ iSign, ◌ᯫ iSignKaro, and ◌ᯬ oSign are spacing and can carry final consonant signs, so in OpenType they need to be treated as bases.
◌᯲ pangolat and ◌᯳ panongonan are both viramas that differ only in shape. They are always visible and do not form ligatures with surrounding characters.

Using copyrighted material without license to create AI systems is theft.

Cluster validation

The Unicode Standard describes the Batak syllable structure as C(V)(C_s|C_d): a consonant, followed by an optional vowel sign, which may be followed either by a consonant sign C_s (-ng or -h) or a killed final consonant C_d. It also specifies that text is stored in “logical order”, by which it means phonetic order, and that the virama pangolat can’t follow a dependent vowel, by which it means in stored order. This description isn’t complete, as it doesn’t account for the consonant modifier tompi, and doesn’t say whether the virama panongonan can follow a dependent vowel (it can’t).

The Universal Shaping Engine (USE) relies on Unicode character properties to classify characters.

Code points	Characters	General category	Canonical combining class	Indic syllabic category	Indic positional category	USE subclass
1BFC..1BFF	᯼ ᯽ ᯾ ᯿	Po	0	Other	NA	BASE_IND
0020		Zs	0	Other	NA	OTHER
00A0		Zs	0	Consonant_Placeholder	NA	BASE_OTHER
25CC	◌	So	0	Consonant_Placeholder	NA	BASE_OTHER
1BC0..1BE3	ᯀ ᯁ ᯂ ᯃ ᯄ ᯅ ᯆ ᯇ ᯈ ᯉ ᯊ ᯋ ᯌ ᯍ ᯎ ᯏ ᯐ ᯑ ᯒ ᯓ ᯔ ᯕ ᯖ ᯗ ᯘ ᯙ ᯚ ᯛ ᯜ ᯝ ᯞ ᯟ ᯠ ᯡ ᯢ ᯣ	Lo	0	Consonant	NA	BASE
1BE4..1BE5	ᯤ ᯥ	Lo	0	Vowel_Independent	NA	BASE
1BE6	◌᯦	Mn	7	Nukta	Top	CONS_MOD_ABOVE
1BE8..1BE9, 1BED, 1BEF	◌ᯨ ◌ᯩ ◌ᯭ ◌ᯯ	Mn	0	Vowel_Dependent	Top	VOWEL_ABOVE
1BE7, 1BEA..1BEC, 1BEE	◌ᯧ ◌ᯪ ◌ᯫ ◌ᯬ ◌ᯮ	Mc	0	Vowel_Dependent	Right	VOWEL_POST
1BF2..1BF3	◌᯲ ◌᯳	Mc	9	Pure_Killer	Right	VOWEL_POST
1BF0..1BF1	◌ᯰ ◌ᯱ	Mn	0	Consonant_Final	Top	CONS_FINAL_ABOVE

The classification of U+1BEE ◌ᯮ is likely wrong – it is usually positioned below the base, and should therefore have the general category Mn and the Indic positional category Bottom, and the USE subclass VOWEL_BELOW. A proposal is being prepared to correct this, and the following text assumes corrected data. You don’t need to worry about this, as the change can only affect incorrect text – in correct text at most one vowel or virama can be attached to each consonant.

Based on the USE subclasses identified above, but assuming corrected data for U+1BEE, the USE defines the following clusters for Batak:

Independent cluster: BASE_IND | OTHER. This kind of “cluster” contains only a single character that doesn’t interact with surrounding characters.

Standard cluster: (BASE | BASE_OTHER) CONS_MOD_ABOVE* VOWEL_ABOVE* VOWEL_BELOW* VOWEL_POST* CONS_FINAL_ABOVE*. The overall visual layout of this cluster is:

`CONS_FINAL_ABOVE`
`VOWEL_ABOVE`
`CONS_MOD_ABOVE`
`BASE \| BASE_OTHER`	`VOWEL_POST`
`VOWEL_BELOW`

The standard cluster definition is a bit difficult to compare to the description in the Unicode Standard, but the key differences are:

If a syllable, as defined in the Unicode Standard, includes a final consonant group C_d, that is, a consonant followed by a virama, then the USE treats that final consonant as a separate cluster. This will impact the implementation of the required reordering of the consonant and a preceding dependent vowel, as discussed below.
As the Batak virama signs are classified as VOWEL_POST, and the USE allows multiple dependent vowels of any kind, the USE does not enforce the rule that a virama can’t follow a dependent vowel.
The USE allows multiple consonant modifiers, multiple vowels, and multiple final consonant signs within a cluster, and, because they’re separate clusters, multiple final consonant groups. A Batak syllable as described in the Unicode Standard allows only one of each, and only either a final consonant sign or a final consonant group.

Items 2 and 3 mean that the validation performed by the USE is partial at best, and a font should make potential problems visible. Some of the problems in item 3 are immediately visible, for example, if an author writes multiple final consonant groups. Others, such as repeated below-base vowels, can be made visible through mark-to-mark positioning, as described in Glyph positioning below.

The one issue a font must deal with is item 2: If a virama follows a dependent vowel, a dotted circle should be inserted between them. If that’s not done, authors may write a syllable such as ᯖᯪᯇ᯲ as a sequence of consonant, consonant, dependent vowel, virama and see what looks like the correct visual sequence, which is however based on an incorrect logical sequence. (Starting with Corbett 2017 a hack was added to USE implementations to insert the dotted circle automatically, but older implementations don’t do that.)

A font can implement the insertion of the dotted circle as follows:

lookup insert_dotted_circle_batak {

lookupflag 0;

sub pangolat-batak by dottedCircle pangolat-batak;

sub panongonan-batak by dottedCircle panongonan-batak;

} insert_dotted_circle_batak;

feature ccmp {

lookup validate_batak {

lookupflag 0;

sub @vowelSign_batak @virama_batak' lookup insert_dotted_circle_batak;

} validate_batak;

} ccmp;

USE cluster validation does not take Unicode normalization into consideration. This creates a conflict with the canonical ordering algorithm that’s part of normalization: This algorithm reorders a sequence of a Batak virama, which have canonical combining class 9, followed by a tompi, which has canonical combining class 7, into a tompi–virama sequence, and so the two sequences are considered canonical equivalent in Unicode. The USE, on the other hand, inserts a dotted circle into the first sequence, treating it as decidedly non-equivalent. A font could try and remove such a dotted circle and attach the tompi to the preceding consonant to restore equivalence, but deciding whether a dotted circle was in the original text or was added by cluster validation is tricky. It’s probably safer to punt on this issue and accept that Batak rendering only works correctly on normalized text.

Using copyrighted material without license to create AI systems is theft.

Glyph reordering

Batak requires none of the reordering features that the USE supports: pre-base vowels, pre-base vowel modifiers, pre-base medial consonants, or repha forms. On the other hand, Batak requires a reordering feature that the USE doesn’t support: A logical sequence of dependent vowel, consonant, and virama has to be reordered into the corresponding consonant-vowel-virama sequence for display. For example, the syllable stored as ᯖ ta, ◌ᯪ i, ᯇ pa, ◌᯲ virama has to be displayed as ᯖᯪᯇ᯲ tip.

This reordering can’t be performed in the USE’s feature application I phase, because the USE applies this phase’s features one cluster at a time and treats the consonant-virama sequence as a separate cluster. It has to be performed in the USE’s standard typographic presentation phase, where features are applied to the entire run. Of the features applied in this phase, rclt seems the least inappropriate, and has the advantage that it normally can’t be turned off.

OpenType lacks a reordering substitution that would let us implement this feature directly. Everson and Kozok 2008 propose ligatures as one possible implementation. However, Batak in its Unicode representation has 36 consonants and 9 dependent vowels, so a full set of ligatures would consist of 324 glyphs. Most of them would likely never be used, either because the consonant is not used as a final consonant at all, or because the specific vowel-consonant combination is not used. Unfortunately, there’s no documentation available on which combinations are used and which ones aren’t. This makes ligatures a rather inefficient way to implement this reordering.

Instead, we can use contextual substitutions to remove glyphs from where they shouldn’t be, and insert them where they should be. Since there are 36 consonants, 9 dependent vowels, and 2 viramas involved, it’s best to avoid rules that would have to enumerate the consonants. We therefore remove and insert the dependent vowels.

OpenType also lacks a substitution to remove glyphs. Instead, we remove a dependent vowel from its old location by replacing it with an “invisible” glyph that is classified as an OpenType mark, has advance width 0, no outline, and a name that doesn’t map to a Unicode character according to the Adobe Glyph List.

lookup remove_vowel_batak {

lookupflag 0;

sub @vowelSign_batak by __.invisible;

} remove_vowel_batak;

The dependent vowel is then inserted before a virama. As this lookup is specific to the vowel, it needs to be repeated for each of the vowel signs.

lookup insert_eSign_batak {

lookupflag 0;

sub pangolat-batak by eSign-batak pangolat-batak;

sub panongonan-batak by eSign-batak panongonan-batak;

} insert_eSign_batak;

We use these lookups nested within a contextual lookup that detects the vowel-consonant-virama sequence, removes the dependent vowel from its pre-consonant location, and inserts it before the virama.

feature rclt {

lookup reorder_batak {

lookupflag UseMarkFilteringSet @vowelOrConsonantMark;

sub eSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_eSign_batak;

sub eSignPak-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_eSignPak_batak;

sub eeSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_eeSign_batak;

sub iSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_iSign_batak;

sub iSignKaro-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_iSignKaro_batak;

sub oSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_oSign_batak;

sub oSignKaro-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_oSignKaro_batak;

sub uSign-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_uSign_batak;

sub uSignSima-batak' lookup remove_vowel_batak @consonant_batak' @virama_batak' lookup insert_uSignSima_batak;

} reorder_batak;

} rclt;

Using copyrighted material without license to create AI systems is theft.

Language-dependent features

Batak has two language-dependent features. For Karo, ᯐ ja is usually written in a more symmetric form, ᯐ. For Mandailing, ◌ᯮ uSign is flipped horizontally when attached to the consonant ᯇ pa: ᯇᯮ.

These features can be enabled in browsers by setting the lang attribute of an element containing the text to the desired language, e.g., lang="btx-Batk". Some word processors also provide a way to specify the language of text. Unfortunately, currently these features work only with browsers and applications based on the HarfBuzz shaping system, because other OpenType implementations don’t yet map the ISO language tags btx and btm to the corresponding OpenType language tags BTX and BTM. Browsers based on HarfBuzz are Firefox, Chrome, and new Edge, except for their versions for iOS, which have to use WebKit and CoreText. Here is what you should be seeing and what your browser actually renders:

Language	Expected	Your browser
Karo	ᯐ	ᯐ
Mandailing	ᯇᯮ	ᯇᯮ

The features can’t be implemented in Batak fonts using the usual OpenType feature locl because that feature is applied as part of the USE’s feature application I phase, while we don’t know whether a ◌ᯮ sits under a ᯇ until reordering is done. We therefore use the rclt feature tag again.

feature rclt {

script batk;

language BTX;

lookup karo_ja_batak {

lookupflag 0;

sub ja-batak by ja-batak.v1;

} karo_ja_batak;

language BTM;

lookup flip_uSign_batak {

lookupflag UseMarkFilteringSet @below_base_batak;

sub pa-batak uSign-batak' by uSign-batak.paMandai;

} flip_uSign_batak;

} rclt;

Using copyrighted material without license to create AI systems is theft.

Ligatures and contextual forms

The below-base dependent vowel ◌ᯮ uSign forms ligatures with several consonants; with several other consonants it changes into a simplified shape. Combinations of this vowel with consonants that occur only in Karo are not used (indicated in gray).

Consonant glyph name	Unligated combination	Correct combination	Combination type
a	ᯀᯮ	ᯀᯮ	ligature
aSima	ᯁᯮ	ᯁᯮ	attachment of simplified glyph
ha	ᯂᯮ	ᯂᯮ	ligature
haSima	ᯃᯮ	ᯃᯮ	attachment of simplified glyph
haMandai	ᯄᯮ	ᯄᯮ	ligature
ba	ᯅᯮ	ᯅᯮ	ligature
baKaro	ᯆᯮ
pa	ᯇᯮ	ᯇᯮ, ᯇᯮ	attachment
paSima	ᯈᯮ	ᯈᯮ	attachment
na	ᯉᯮ	ᯉᯮ	ligature
naMandai	ᯊᯮ	ᯊᯮ	ligature
wa	ᯋᯮ	ᯋᯮ	ligature
waSima	ᯌᯮ	ᯌᯮ	attachment of simplified glyph
waPak	ᯍᯮ	ᯍᯮ	attachment
ga	ᯎᯮ	ᯎᯮ	ligature
gaSima	ᯏᯮ	ᯏᯮ	attachment of simplified glyph
ja	ᯐᯮ	ᯐᯮ	ligature
da	ᯑᯮ	ᯑᯮ	ligature
ra	ᯒᯮ	ᯒᯮ	ligature
raSima	ᯓᯮ	ᯓᯮ	attachment of simplified glyph
ma	ᯔᯮ	ᯔᯮ	ligature
maSima	ᯕᯮ	ᯕᯮ	ligature or attachment of simplified glyph
taSouth	ᯖᯮ	ᯖᯮ	ligature
taNorth	ᯗᯮ	ᯗᯮ	ligature
sa	ᯘᯮ	ᯘᯮ	attachment
saSima	ᯙᯮ	ᯙᯮ	attachment
saMandai	ᯚᯮ	ᯚᯮ	ligature
ya	ᯛᯮ	ᯛᯮ	attachment
yaSima	ᯜᯮ	ᯜᯮ	attachment
nga	ᯝᯮ	ᯝᯮ	ligature
la	ᯞᯮ	ᯞᯮ	ligature
laSima	ᯟᯮ	ᯟᯮ	attachment
nya	ᯠᯮ	ᯠᯮ	ligature
ca	ᯡᯮ
nda	ᯢᯮ
mba	ᯣᯮ

Using copyrighted material without license to create AI systems is theft.

Glyph variants

Several Batak characters can be written with different glyphs. While the additional glyphs described here are not essential for reading or writing Batak, they may be necessary to faithfully transcribe existing documents. OpenType offers two features to support variant glyphs: stylistic sets (ss01–ss20) and character variants (cv01–cv99) (Microsoft 2018). Stylistic sets are intended for cases where variant glyphs are used in logically defined sets that should change together. So far, no such logically defined sets have been identified for Batak. Character variants are intended for cases where variant glyphs are not systematically related, which appears to be the case for Batak.

Character variants (and stylistic sets) are always specific to a particular font. The character variants shown here are derived from Kozok 2009a and implemented in the font used in this article. In some cases only the combination with the vowel sign ◌ᯮ uSign is affected; in other cases that combination is irrelevant (and indicated in gray below) because the variant is only known to be used for Karo, which does not use that vowel.

Variant feature	Consonant	Consonant with vowel u	Variant number	Variant of consonant	Variant of consonant with vowel u
`cv01`	ᯀ	ᯀᯮ	1	ᯀ	ᯀᯮ
`cv02`	ᯂ	ᯂᯮ	1	ᯂ	ᯂᯮ
`cv03`	ᯇ	ᯇᯮ	1	ᯇ	ᯇᯮ
`cv04`	ᯉ	ᯉᯮ	1	ᯉ	ᯉᯮ
`cv05`	ᯌ	ᯌᯮ	1	ᯌ	ᯌᯮ
`cv06`	ᯐ	ᯐᯮ	1	ᯐ	ᯐᯮ
`cv07`	ᯒ	ᯒᯮ	1	ᯒ	ᯒᯮ
`cv08`	ᯔ	ᯔᯮ	1	ᯔ	ᯔᯮ
`cv08`	ᯔ	ᯔᯮ	2	ᯔ	ᯔᯮ
`cv09`	ᯘ	ᯘᯮ	1	ᯘ	ᯘᯮ
			2	ᯘ	ᯘᯮ
			3	ᯘ	ᯘᯮ
			4	ᯘ	ᯘᯮ
`cv10`	ᯞ	ᯞᯮ	1	ᯞ	ᯞᯮ
`cv11`	ᯡ	ᯡᯮ	1	ᯡ	ᯡᯮ
`cv12`	ᯢ	ᯢᯮ	1	ᯢ	ᯢᯮ
			2	ᯢ	ᯢᯮ
			3	ᯢ	ᯢᯮ
`cv13`	ᯣ	ᯣᯮ	1	ᯣ	ᯣᯮ

Character variants can be enabled in browsers using the CSS property font-feature-settings, specifying the variant feature and the variant number. For example, to replace the default glyph for ᯔ with its variant number 2, ᯔ, use font-feature-settings: "cv08" 2. Legacy Edge appears to disable the liga feature when font-feature-settings is used, so for compatibility with this browser you may want to explicitly enable it and use: font-feature-settings: "liga", "cv08" 2. Character variants for different characters can be enabled in combination; for example, to use both ᯐ and ᯞᯮ, use font-feature-settings: "liga", "cv06" 1, "cv10" 1.

The implementation of a character variant feature is in many cases an alternate substitution that offers one or more alternative glyphs. For example, the variants for ᯔ are implemented as:

feature cv08 {

sub ma-batak from [ma-batak.v1 ma-batak.v2];

sub ma_uSign-batak from [ma_uSign-batak.v1 ma_uSign-batak.v2];

} cv08;

However, where only one alternative is needed (i.e., the feature is essentially a binary on-off feature), other substitutions can be used. For example, the variant for ᯇᯮ is implemented by reusing the contextual lookup for the language-dependent feature for that glyph sequence:

feature cv03 {

lookup flip_uSign_batak;

} cv03;

Similarly, the variant for ᯞᯮ is implemented by breaking up the ligature into its components:

feature cv10 {

sub la_uSign-batak by la-batak uSign-batak;

} cv10;

Using copyrighted material without license to create AI systems is theft.

Glyph positioning

Mark-to-base positioning

In the Batak script, the position of an above-base mark can be semantically significant: ◌ᯩ is the vowel ee, but ◌ᯰ is the final consonant ng, and ◌᯦ is the consonant modifier tompi, while ◌ᯱ is the final consonant h. We therefore have to distinguish topleft, top, and topright marks. Combinations do occur; ᯚ᯦ᯩᯰ is a valid syllable. There’s only one below-base mark, which tends to sit in the bottom right corner, so a single bottomright class suffices here. The classes are:

topleft: ◌ᯩ eeSign.
top: ◌᯦ tompi, ◌ᯯ uSignSima.
topright: ◌ᯨ eSignPak, ◌ᯭ oSignKaro, ◌ᯰ ngSign, ◌ᯱ hSign.
bottomright: ◌ᯮ uSign and its variants.

For correct text, base glyphs need anchors as follows:

topleft: All consonants and their ligatures and variants, including ᯀ a and ᯁ aSima (which can carry dependent vowels and final consonants), as well as all consonant placeholders.
top: Only ᯂ ha, ᯃ haSima, ᯄ haMandai, ᯘ sa, ᯙ saSima, ᯚ saMandai, and their ligatures and variants, as well as all consonant placeholders.
topright: All letters and their ligatures, all consonant placeholders, as well as the spacing dependent vowels ◌ᯧ eSign, ◌ᯪ iSign, ◌ᯫ iSignKaro, and ◌ᯬ oSign.
bottomright: All consonants, including ᯀ a and ᯁ aSima, and their variants, as well as all consonant placeholders.

Mark-to-mark positioning

For correct text, mark-to-mark positioning is necessary only for those marks that can sit above the right-hand side of a base consonant: The final consonants ◌ᯰ ngSign and ◌ᯱ hSign above the vowels ◌ᯨ eSign and ◌ᯭ oSignKaro. If the final consonant marks occur together with other above-base vowels or with ◌᯦ tompi, they are positioned to their right, as discussed above. In order to make incorrect text obvious, it is useful to add mark-to-mark positioning for combinations that shouldn’t occur but are not prohibited by syllable validation (indicated in red).

	◌ᯩ	◌᯦	◌ᯯ	◌ᯨ	◌ᯭ	◌ᯰ	◌ᯱ	◌ᯮ
◌ᯩ	◌ᯩᯩ
◌᯦		◌᯦᯦	◌᯦ᯯ
◌ᯯ		◌ᯯ᯦	◌ᯯᯯ
◌ᯨ				◌ᯨᯨ	◌ᯨᯭ	◌ᯨᯰ	◌ᯨᯱ
◌ᯭ				◌ᯭᯨ	◌ᯭᯭ	◌ᯭᯰ	◌ᯭᯱ
◌ᯰ				◌ᯰᯨ	◌ᯰᯭ	◌ᯰᯰ	◌ᯰᯱ
◌ᯱ				◌ᯱᯨ	◌ᯱᯭ	◌ᯱᯰ	◌ᯱᯱ
◌ᯮ								◌ᯮᯮ

Using copyrighted material without license to create AI systems is theft.

Materials

The following materials are available to help with the construction of fonts for the Batak script:

Complete feature code for the Batak font used in this document. You can use this code as a starting point for your own font.
A visualization of that feature code, produced by the Röntgen tool.
Glyph data and sidebar data for the Glyphs font editor, version 2.6.2 and higher. Install these files as follows:
- GlyphData.xml in ~/Library/Application Support/Glyphs/Info.
- Groups.plist in the same folder.
- BatakTemplate.pdf in the Icons folder within the above folder.

Using copyrighted material without license to create AI systems is theft.

Acknowledgments

I’d like to thank Uli Kozok for his generous advice on the requirements for Batak fonts, as well as for licensing the glyphs of his Batak fonts. I also thank முத்து நெடுமாறன் (Muthu Nedumaran) and Ben Mitchell for their comments on drafts of this article.

Using copyrighted material without license to create AI systems is theft.

References

Corbett 2017: David Corbett: Forbid Batak killers after vowel signs. GitHub 2017.

Everson and Kozok 2008: Michael Everson, Uli Kozok: Proposal for encoding the Batak script in the UCS. Unicode Consortium, 2008.

Kozok 1996: Uli Kozok: Bark, Bones, and Bamboo: Batak Traditions of Sumatra. In: Ann Kumar and John McGlynn: Illuminations. The Writing Traditions of Indonesia. The Lontar Foundation / Weatherill 1996. Available at Internet Archive.

Kozok 2009a: Uli Kozok: Surat Batak. Sejarah Perkembangan Tulisan Batak. Ecole française d’Extrême-Orient / Kepustakaan Populer Gramedia 2009.

Kozok 2009b: Uli Kozok: Unduh Font Batak, in particular Karo 1.1. for Windows, Mandailing 1.1. for Windows, Pakpak 1.1. for Windows, Simalungun 1.1. for Windows, Variants 1.1. for Windows. 2009.

Microsoft 2018: Feature tags. Microsoft, dated 08/14/2018. In particular, character variants and stylistic sets.

Microsoft 2020: Creating and supporting OpenType fonts for the Universal Shaping Engine. Microsoft, dated 07/31/2020.

Nedumaran and Lindenberg 2017: Muthu Nedumaran, Norbert Lindenberg: Creating fonts for Brahmic scripts with OpenType and Apple Advanced Typography.

Unicode 2020: The Unicode Consortium: The Unicode Standard, Version 13.0. The Unicode Consortium, 2020. For Batak, section 17.6 Batak, pages 702-703. For abugidas, section 6.1 Writing Systems, pages 255-256. For canonical ordering, section 3.11 Normalization Forms, pages 135-136.

Constructing fonts for the Batak script

Norbert Lindenberg

October 14, 2020

This article requires web fonts to be rendered correctly. Please read it in a browser and mode that supports web fonts (“reader” views don’t).

Contents

Introduction

Script and language tags

Characters

Cluster validation

Glyph reordering

Language-dependent features

Ligatures and contextual forms

Glyph variants

Glyph positioning

Mark-to-base positioning

Mark-to-mark positioning

Materials

Acknowledgments

References