February 23, 2015 7:01 am

Windows shapes the world’s languages

For Windows to be a truly global product, anyone in the world should be able to type in their language. The first step to unlocking text input for the world is to be able to display any of the world’s languages. This is a challenging task, one which most people don’t need to worry about because their language is already supported, but for millions of people around the world getting basic text support has been a problem. The stumbling block in most such cases is a little-known component called a “shaping engine”. A shaping engine is used for so-called complex text layout, which is needed for about half of the world’s writing systems. For many years, Windows customers have been able to install their own fonts and keyboards but before Windows 10, if there was no shaping engine for your script things wouldn’t look right.

Nko_NoShaping

Incorrect: N’ko script without a shaping engine.

Nko_Shaping

Correct: N’ko script with a shaping engine.

In order to get things to look right, that is, to get a complex script to render correctly on Windows, linguists and software engineers had to take time to study all of the features and requirements of that script and craft a shaping engine that would provide the necessary support. This meant that just a small number of new writing systems could be added in any release of Windows. By Windows 8.1, it had taken around 15 years to build shaping engines for 27 of the most widely used complex writing systems. But if your script wasn’t one of these 27, you were out of luck.

“The Universal Shaping Engine is the proverbial ‘game changer’ for complex script font development, especially for new Unicode scripts that might otherwise languish unsupported in software and fonts for years.” – John Hudson, CEO, Tiro Typeworks Ltd.

In recent years the The Unicode Standard has made amazing progress in defining and setting standards as to how all of the world’s writing systems should be supported in the digital era. The most recent version of Unicode includes 125 different writing systems, 56 of which need shaping support. The Script Encoding Initiative at the University of California Berkeley estimates there are about almost 50 more complex scripts yet to be added to Unicode. So for some language communities and scholars it looked like it would be many years before they would be able type their language on Windows, if that day would even ever come.

This problem led a small team of engineers in Microsoft’s Operating Systems Group to think about how to design a shaping engine so that any script defined in Unicode could be displayed correctly without the time and effort required to create a dedicated shaping engine. The result is a new kind of shaping engine, a “universal shaping engine”, that is capable of supporting any complex script when provided with a suitable font.

There are four parts to this engine that make it universal:

  1. It consumes data directly from the Unicode Standard
  2. It uses a “universal cluster model” that models the superset of human writing systems
  3. It enables OpenType font features to support cutting edge typography
  4. The specification is available publically

Consuming data directly from The Unicode Standard

The team worked with experts from the Unicode Technical Committee to make sure that all of the necessary research to shape each script is made available in a machine readable format and that this data will be kept up-to-date when new scripts are added to the standard. As a result, the burden of script research is contained within Unicode’s script encoding process and means that Microsoft doesn’t need to do additional research that would delay adding support for a new script. This approach makes it easy for Windows to keep current with the latest version of Unicode.

The Universal cluster model

The team did original research into the unsupported scripts of the world to determine formulas that describe how letter forms and other signs can combine. The result is a generalized “cluster model” that is applicable to any known writing system. Here is an example of the formula for the standard cluster in the universal cluster model:

< B | GB > [VS] (CMAbv)* (CMBlw)* (< H B | SUB > [VS] (CMAbv)* (CMBlw)*)* [R]
  [MPre] [MAbv] [MBlw] [MPst]
  (VPre)* (VAbv)* (VBlw)* (VPst)*
  (VMPre)* (VMAbv)* (VMBlw)* (VMPst)*
  (FAbv)* (FBlw)* (FPst)* [FM]

For an explanation of this formula you can check out the full specification.

Here is the same formula as a diagram:

diagram2

Enable the full set of font features to support cutting edge typography

The team consulted with a leading font designer and other OpenType font experts to determine the complete set of features that would enable font developers to create fonts that will meet the orthographic requirements for the newly enabled scripts as well as making it possible to do cutting edge typography. For example, Soyombo has a unique property among writing systems in that the length of a vertical bar that is part of each letter must match the longest bar in the line. We were able to use features of the Universal Shaping Engine to show that this script, which is still in the process of being encoded in Unicode, would be supported by the Universal Shaping Engine once it is published in the standard.

Sample of Soyombo script showing the consistent vertical descenders.

Sample of Soyombo script showing the consistent vertical descenders.

“The Universal Shaping Engine enabled us to work out how to encode Soyombo, a Mongolian script with very different clustering and typographic requirements.” – Anshuman Pandey, Post-Doctoral Researcher, University of California, Berkeley

Publish the specification

Microsoft Typography has been publishing specifications for its shaping engines for years so that font developers and other platforms can build compatible systems. By publishing the technical details for the Universal Shaping Engine we enable font developers to understand how to create fonts for the world’s complex scripts so that they will display correctly on Windows 10. We hope that other platforms and text layout software will create compatible systems so that documents and fonts produced on Windows will display correctly on other systems and vice-versa. That way language communities, enthusiasts and scholars can share documents in any of the world’s more than 7,000 languages using one of the 125 writing systems in Unicode (and counting).

“The Universal Shaping Engine … is of great importance for all language groups to be able to communicate on computers and through the internet.” – Lorna Evans, Script Technologist

This new engine is part of Windows 10, so if you want to type in Balinese or Tirhuta, or any of the other complex scripts included in Unicode 7.0, the shaping support is there and will keep up with Unicode as each new writing system is added to the standard. Now it’s over to font developers and language communities to take advantage of this support!

What scripts are covered?

The following writing systems are now supported using the Universal Shaping Engine on Windows 10. Some of these were supported on previous versions of Windows using different technology.

Balinese, Batak, Brahmi, Buginese, Buhid, Chakma, Cham, Duployan, Egyptian Hieroglyphs, Grantha, Hanunoo, Javanese, Kaithi, Kayah Li, Kharoshthi, Khojki, Khudawadi, Lepcha, Limbu, Mahajani, Mandaic, Manichaean, Meitei Mayek, Modi, Mongolian, N’Ko, Pahawh Hmong, Phags-pa, Psalter Pahlavi, Rejang, Saurashtra, Sharada, Siddham, Sinhala, Sundanese, Syloti Nagri, Tagalog, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Takri, Tibetan, Tifinagh, and Tirhuta.

The examples below illustrate some of the Universal Shaping Engine’s supported scripts showing various degrees of shaping. All of these examples have been rendered using Unicode sequences in Notepad on Windows 10 with the Universal Shaping Engine. For Balinese, Batak, Lepcha, Sundanese, and Tai Viet, we used fonts from Google’s Noto Project (thanks for the great fonts!). The remaining fonts are Microsoft’s.

Balinese script.

Batak script.

Batak script.

Lepcha.

Lepcha.

Brahmi script.

Brahmi script.

Buhid.

Buhid.

Buginese.

Buginese.

Javanese.

Javanese.

Hanunoo.

Hanunoo.

Tagbanwa.

Tagbanwa.

Egyptian Hieroglyphs.

Egyptian Hieroglyphs.

Kharoshthi.

Kharoshthi.

N'Ko.

N’Ko.

Sundanese.

Sundanese.

Tai Viet.

Tai Viet.

Phags-pa.

Phags-pa.

Tai Le.

Tai Le.

Tifinagh.

Tifinagh.

What is a shaping engine?

A shaping engine enables scripts with contextual and non-linear typographic requirements to be displayed on a computer. Common types of complex text layout include:

Ligatures

In Latin script, the combination of f and i may form a ligature fi:

Ligatures_Latin

Such ligatures are optional in Latin script, but are required in other writing systems. In Devanagari, the sign क (ka) must form a ligature when combined with the sign ष (ṣa):

Ligatures_Devanagari

Reordering

In some scripts, vowels may be written in front of a letter that they follow in pronunciation. In Sinhala, the vowel symbol1a (e)  is written the left of a consonant, such as symbol2b (ka), so the sound ke is written:

Reordering

Joining:

Cursive scripts such as Arabic and Mongolian connect letters so that words may be written as a single stroke. This means that letters must change shape depending on whether they occur at the beginning, middle or end of a word. For example, the word “Mongolia” written in Mongolian script is written like this:

Joining

To learn more about International Mother Language Day, and what Microsoft is doing to support technology on this front, please visit the Official Microsoft Blog.

Andrew Glass
Program Manager, Operating Systems Group

Updated February 23, 2015 4:59 pm