February 14, 2017 10:03 am

Cognitive Services APIs: Speech

Speech recognition is in many ways at the heart of Artificial Intelligence. The 18th Century essayist Samuel Johnson captured this beautifully when he wrote, “Language is the dress of thought.” If the ultimate goal of AI research is a machine that thinks like a human, a reasonable starting point would be to create a machine that understands how humans think. To understand how humans think, in turn, requires an understanding of what humans say.

In the previous post in this series, you learned about the Cognitive Services Vision APIs. In this post, we’re going to complement that with an overview of the Speech APIs. The Cognitive Services Speech APIs are grouped into three categories:

  • Bing Speech—convert spoken audio to text and, conversely, text to speech.
  • Speaker Recognition—identify speakers and use speech recognition for authentication.
  • Custom Speech Service (formerly CRIS)—overcome speech recognition barriers like background noise and specialized vocabulary.

A good way to understand the relationship between the Bing Speech API and the other APIs is that while Big Speech handles taking raw speech and turns it into text without knowing anything about the speaker, Custom Speech Service and Speaker Recognition go further and try to use processing to clean up the raw speech or to compare it against other speech samples. They basically do extra speech analysis work.

Bing Speech for UWP

As a UWP developer, you have several options for accessing speech-to-text capabilities. You can access the UWP Speech APIs found in the Windows.Media.SpeechRecognition namespace. You can also integrate Cortana into your UWP app. Alternatively, you can go straight to the Bing Speech API which underlies both of these technologies.

Bing Speech lets you do text-to-speech and speech-to-text through REST calls to Cognitive Services. The Cognitive Services website provides samples for iOS, Android and Javascript. There’s also a client library NuGet package if you are working in WPF. For UWP, however, you will use the REST APIs.

As with the other Cognitive Services offerings, you first need to pick up a subscription key for Bing Speech in order to make calls to the API. In UWP, you then need to record microphone input using the MediaCapture class and encode it before sending it to Bing Speech. (Gotcha Warning — be sure to remember to check off the Microphone capability in your project’s app manifest file so the mic can be accessed, otherwise you may spend hours wondering why the code doesn’t work for you.)


var CaptureMedia = new MediaCapture();
var captureInitSettings = new MediaCaptureInitializationSettings();
captureInitSettings.StreamingCaptureMode = StreamingCaptureMode.Audio;
await CaptureMedia.InitializeAsync(captureInitSettings);
MediaEncodingProfile encodingProfile = MediaEncodingProfile.CreateWav(AudioEncodingQuality.Medium);
AudioStream = new InMemoryRandomAccessStream();
await CaptureMedia.StartRecordToStreamAsync(encodingProfile, AudioStream);

Once you are done recording, you can use the standard HttpClient class to send the audio stream to Cognitive Services for processing, like so…


// build REST message
cookieContainer = new CookieContainer();
handler = new HttpClientHandler() { CookieContainer = cookieContainer };
client = new HttpClient(handler);
client.DefaultRequestHeaders.TryAddWithoutValidation("Content-Type", "audio / wav; samplerate = 16000");
// authenticate the REST call
client.DefaultRequestHeaders.TryAddWithoutValidation("Authorization", _subscriptionKey);
// pass in the Bing Speech endpoint
request = new HttpRequestMessage(HttpMethod.Post, uri);
// pass in the audio stream
request.Content = new ByteArrayContent(fileBytes);
// make REST call to CogSrv
response = await client.SendAsync(request, HttpCompletionOption.ResponseHeadersRead, cancellationToken);

Getting these calls right may seem a bit hairy at first. To make integrating Bing Speech easier, Microsoft MVP Gian Paolo Santopaolo has created a UWP reference app on GitHub with several useful helper classes you can incorporate into your own speech recognition project. This reference app also includes a sample for reversing the process and doing text-to-speech.

Speaker Recognition

While the Bing Speech API can figure out what you are saying without knowing anything about who you are as a speaker, the Speaker Recognition API in Cognitive Services is all about figuring out who you without caring about what you are specifically saying. There’s a nice symmetry to this. Using machine learning, the Speaker Recognition API finds qualities in your voice that identify you almost as well as your fingerprints or retinal pattern do.

This API is typically used for two purposes: identification and verification. Identification allows a voice to be compared to a group of voices in order to find the best match. This is the auditory equivalent to how the Cognitive Services Face API matches up faces that resemble each other.

Speaker verification allows you to use a person’s voice as part of a two-factor login mechanism. For verification to work, the speaker must say a specific, pre-selected passphrase like “apple juice tastes funny after toothpaste” or “I am going to make him an offer he cannot refuse.” The initial recording of a passphrase to compare against is called enrollment. (It hardly needs to be said but—please don’t use “password” for your speaker verification passphrase.)

There is a client library that supports speaker enrollment, speaker verification and speaker identification. Per usual, you need to sign up for a Speaker Recognition subscription key to use it. You can add the client library to your UWP project in Visual Studio by installing the Microsoft.ProjectOxford.SpeakerRecognition NuGet package.

Using the media capture code from the Bing Speech sample above to record on the microphone, and assuming that the passphrase has already been enrolled for the user through her Speaker Id (a Guid), verification is as easy as calling the Speaker Recognition client library VerifyAsync method and passing the audio stream and Speaker Id as parameters.


string _subscriptionKey;
Guid _speakerId;
Stream audioStream;

public async void VerifySpeaker()
{
    var serviceClient = new SpeakerVerificationServiceClient(_subscriptionKey);
    Verification response = await serviceClient.VerifyAsync(audioStream, _speakerId);

    if (response.Result == Result.Accept)
    {
        // verification successful
    }

}

Sample projects are available showing how to use Speaker Recognition with Android, Python and WPF. Because of the close similarities between UWP and WPF, you will probably find the last sample useful as a reference for using this Cognitive Service in your UWP app.

Custom Speech Service

You already know how to use the Bing Speech speech-to-text capability introduced at the top of this post. That Cognitive Service is built around generalized language models to work for most people most of the time. But what if you want to do speech recognition involving specialized jargon or vocabulary? To handle these situations, you might need a custom language model rather than the one used by the speech-to-text engine in Bing Speech.

Along the same lines, the generalized acoustic model used to train Bing Speech may not work well for you if your app is likely to be used in an atypical acoustic environment like an air hangar or a factory floor.

Custom Speech Service lets you build custom language models and custom acoustic models for your speech-to-text engine. You can then set these up as custom REST endpoints for doing calls to Cognitive Services from your app. These RESTful endpoints can also be used from any device and from any software platform that can make REST calls. It’s basically a really powerful machine learning tool that lets you take the speech recognition capabilities of your app to a whole new level. Additionally, since all that changes is the endpoint you call, any previous code you have written to use the Bing Speech API should work without any alteration other than the Uri you are targeting.

Wrapping Up

In this post, we went over the Bing Speech APIs for speech-to-text and text-to-speech as well as the extra APIs for cleaning up raw speech input and doing comparisons and verification using speech input. In the next post in the Cognitive APIs Series, we’ll take a look using the Language Understanding Intelligent Service (LUIS) to derive meaning from speech in order to figure out what people really want when they ask for something. In the meantime, here are some additional resources so you can learn more about the Speech APIs on your own.

Join the conversation

  1. An audio/voice Emotion API would be even cooler! Detect anger, stress, sadness, surprise, joy would be very useful!