June 1, 2016 10:00 am

Introducing the Speech Synthesis API in Microsoft Edge

By / Senior Program Manager

Starting with the Windows 10 Anniversary Update, Microsoft Edge will support the Speech Synthesis APIs defined in the W3C Web Speech API Specification. These APIs allow websites to convert text to audible speech with customizable voice and language settings. With them, website developers can add and control text-to-speech features specific to their page content and design.

Speech Synthesis is useful whenever narration might be applied. Our implementation also supports Speech Synthesis Markup Language (SSML) Version 1.0 to provide further control over the speech output.

Speech Synthesis is enabled by default in Windows Insider Preview builds starting with EdgeHTML 14.14316 and above – try it out with our new Speech Synthesis Demo on Test Drive!

API Overview

The Web Speech API Specification defines a SpeechSynthesisUtterance interface that lets Javascript set speech text along with attributes that control the voice used and modify the language pronunciation, voice, volume, rate and pitch of the voice output. Other interfaces are defined that allow playback control and monitoring state of the synthesized speech.

Microsoft Edge implements these SpeechSynthesis interfaces:

  • SpeechSynthesis: Provides speech playback control and state
  • SpeechSynthesisUtterance: Controls speech content, voice and pronunciation
  • SpeechSynthesisEvent: Provides state information on the current utterance
  • SpeechSynthesisVoice: Sets speech service information

Our implementation of these Speech Synthesis APIs is based on the WinRT Windows.Media.SpeechSynthesis APIs. These directly support most of the W3C Speech Synthesis interfaces. There are a few SpeechSynthesis details that we don’t currently support this release, which we’re evaluating for future releases:

  • Playback pitch: Used to vary the voice pitch on playback.
  • onmark event: Used to indicate that a marked tag has been reached.
  • onboundary event: Used to signal boundaries of spoken words or sentences.

Speech Synthesis Demo

To illustrate these new speech features, we’ve published a Speech Synthesis Demo on Test Drive. This allows input of random text (try something really long) and exposes parameters like voice, language, rate and volume that allow tuning of the resulting speech.

The demo includes this sample code that uses SpeechSynthesisUtterance to take your selected text and speech settings, and use them to do a text to speech voice synthesis.

This sample reads in data from the demo HTML, and then uses window.speechSynthesis.speak to start playback. It shows how simple it is to add basic speech synthesis features to your website.

Speech Synthesis Markup Language (SSML)

SSML allows speech voices and content to be expressed in XML, allowing direct control over a variety of speech characteristics. You can try this by pasting SSML derived text into the Speech Demo.

Here’s an example of JavaScript SSML from the W3C spec:

If we concatenate the SSML content, we get:

Copy and paste this into the Speech Synthesis Demo text box to see how the voice selections affect the synthesize output.

Languages

The language setting in our Test Drive demo will work with any installed voice language pack in Windows 10. By default, there will be a primary language installed for a system. Others need to be installed. Here’s how to add an input language to your PC:

  • Go to Settings > Time & language > Region & language.
  • Select Add a language.
  • Select the language you want to use from the list, then choose which region’s version you want to use. Your download will begin immediately.

Once installed, the language pack will be used to alter the pronunciation of foreign language text.

We’re excited to share this release of HTML5 speech capabilities in Microsoft Edge. We prioritized Speech Synthesis based on feedback from users and developers, and we look forward to refining our speech support in the future with speech synthesis feature enhancements and speech recognition capabilities.

Try it out and let us know what you think!

– Steve Becker, Senior Software Engineer
– Jerry Smith, Senior Program Manager

Join the conversation

  1. re: test page – https://developer.microsoft.com/en-us/microsoft-edge/testdrive/demos/speechsynthesis/

    when testing with Canary (don’t have latest build yet), the dropdown voices combo is not being cleared between each call to getVoices, resulting in duplicate listings of voices.

    also expected that getVoices should be filtered by the selected Language codes(eg. filter for language values that begin with ‘en’ to list en-US or en-UK local voices) so that say only locally installed voices that support the selected language code are displayed in the voice dropdown.

    Language codes offered do not match those used by chromium. Here is a transcription of narrator voices with the matching language codes from chromium getVoices.
    var narratorVoices = [
    { ‘lang’: ‘en-US’, ‘gender’: ‘Female’, ‘name’: ‘Zira’, ‘displayname’: ‘Microsoft Zira (en-US, female)’ },
    { ‘lang’: ‘en-US’, ‘gender’: ‘Male’, ‘name’: ‘David’, ‘displayname’: ‘Microsoft David (en-US, male)’ },
    { ‘lang’: ‘en-GB’, ‘gender’: ‘Female’, ‘name’: ‘Hazel’, ‘displayname’: ‘Microsoft Hazel (en-GB, female)’ },
    { ‘lang’: ‘en-IN’, ‘gender’: ‘Female’, ‘name’: ‘Heera’, ‘displayname’: ‘Microsoft Heera (en-IN, female)’ },
    { ‘lang’: ‘fr-FR’, ‘gender’: ‘Female’, ‘name’: ‘Hortense’, ‘displayname’: ‘Microsoft Hortense (fr-FR , female)’ },
    { ‘lang’: ‘de-DE’, ‘gender’: ‘Female’, ‘name’: ‘Hedda’, ‘displayname’: ‘Microsoft Hedda (de-DE, female)’ },
    { ‘lang’: ‘es-ES’, ‘gender’: ‘Female’, ‘name’: ‘Helena’, ‘displayname’: ‘Microsoft Helena (es-ES, female)’ },
    { ‘lang’: ‘zh-CN’, ‘gender’: ‘Female’, ‘name’: ‘Huihui’, ‘displayname’: ‘Microsoft Huihui (zh-CN, female)’ },
    { ‘lang’: ‘zh-HK’, ‘gender’: ‘Female’, ‘name’: ‘Tracy’, ‘displayname’: ‘Microsoft Tracy (zh-HK, female)’ },
    { ‘lang’: ‘zh-TW’, ‘gender’: ‘Female’, ‘name’: ‘Hanhan’, ‘displayname’: ‘Microsoft Hanhan (zh-TW, female)’ },
    { ‘lang’: ‘jp-JP’, ‘gender’: ‘Female’, ‘name’: ‘Haruka’, ‘displayname’: ‘Microsoft Haruka (ja-JP, female)’ },
    { ‘lang’: ‘ko-KR’, ‘gender’: ‘Female’, ‘name’: ‘Heami’, ‘displayname’: ‘Microsoft Heami (ko-KR, female)’ },
    { ‘lang’: ‘es-MX’, ‘gender’: ‘Female’, ‘name’: ‘Sabina’, ‘displayname’: ‘Microsoft Sabina (es-MX, female)’ },
    { ‘lang’: ‘it-IT’, ‘gender’: ‘Female’, ‘name’: ‘Elsa’, ‘displayname’: ‘Microsoft Elsa (it-IT, female)’ },
    { ‘lang’: ‘ru-RU’, ‘gender’: ‘Female’, ‘name’: ‘Irina’, ‘displayname’: ‘Microsoft Irina (ru-RU, female)’ },
    { ‘lang’: ‘pl-PL’, ‘gender’: ‘Female’, ‘name’: ‘Paulina’, ‘displayname’: ‘Microsoft Paulina (pl-PL, female)’ },
    { ‘lang’: ‘pt-BR’, ‘gender’: ‘Female’, ‘name’: ‘Maria’, ‘displayname’: ‘Microsoft Maria (pt-BR, female)’ }
    ];

    Regards.

  2. Cool!
    One thing that bothers me is that “HTML5 speech capabilities” in the last paragraph – how is the Web Speech API related to HTML5? Other than being published by the W3C website (as a community group report versus as a recommendation), I do not remember seeing a connection to HTML5…

  3. Thanks for sharing yor solution.
    One small issue: Chrome keeps loading voices to the dropdown list when onvoiceschanged trigered.
    Here is a fix:
    //Variables
    var voices = []; // I use this in more than one place

    var loadVoices = function () {
    var voiceloaded = alreadyLoaded();
    //Chrome keeps loading the voices. load only once
    if (!voiceloaded) {
    voices = [];
    voices = Speaker.getVoices();

    voices.forEach((voice) => {
    var option = document.createElement(‘option’);
    option.value = voice.name;
    option.innerHTML = voice.name;
    voiceSelect.appendChild(option);
    });
    }
    };
    var alreadyLoaded = function ()
    {
    // Any options in select element ???
    var optionCount = voiceSelect.length;
    return (optionCount > 0) ? true : false;
    }