September 26th, 2016

Speech Recognition in iOS 10

Pierce Boggan
Senior Program Manager

Speech is increasingly becoming a big part of building modern mobile applications. Users expect to be able to interact with apps through speech, so much so that speech is developing into a user interface itself. iOS contains multiple ways for users to interact with their mobile device through speech, mainly via Siri and Keyboard Dictation. iOS 10 vastly improves developers’ ability to build intelligent apps that can be controlled not only via a typical user interface, but by speech as well through the new SiriKit and Speech Recognition APIs.

Prior to iOS 10, Keyboard Dictation was the only way for developers to enable users to interact with their apps through speech. This comes with many limitations for developers, namely the fact that it only worked through user interface elements that support TextKit, is limited to live audio, and doesn’t support attributes such as timing and confidence. Speech Recognition in iOS 10 doesn’t require us to use any particular user interface elements, supports both prerecorded and live speech, and provides lots of additional context for translations, such as multiple interpretations, confidence levels, and timing information. In this blog post, you will learn how to use the new iOS 10 Speech Recognition API to perform speech-to-text in a mobile app.

Introduction to Speech Recognition

The Speech Recognition API is available as part of the iOS 10 release from Apple. To ensure that you can build apps using the new iOS 10 APIs, confirm that you are running the latest Stable build from Xamarin in the updater channel in Visual Studio or Xamarin Studio. Speech recognition can be added to our iOS applications in just a few steps:

  1. Provide a usage description in the app’s Info.plist file for the NSSpeechRecognitionUsageDescriptionKey.
  2. Request authorization to use speech recognition by calling SFSpeechRecognizer.RequestAuthorization.
  3. Create a speech recognition request and pass the speech recognition request to a SFSpeechRecognizer to begin recognition.

Providing a Usage Description

Privacy is a big part of building mobile applications; both iOS and Android have recently revamped the way apps can request user permissions such as the ability to use the camera or microphone. Because the audio is temporarily transmitted to and stored on Apple servers to perform translation, user permission is required. Be sure to take into account various other privacy considerations when deciding to use the Speech Recognition API.

To enable us to use the Speech Recognition API, open Info.plist and add the key NSSpeechRecognitionUsageDescription as the Property, String as the Key, and a message you would like to display the to the user when requesting permission to use speech recognition as the Value.

Info.plist for requesting user permissions.

Note: If the app will be performing live speech recognition, you will need add an additional permission with property value `NSMicrophoneUsageDescription`.

Request Authorization for Speech Recognition

Now that we have added our key(s) to Info.plist, it’s time to request permission from the user by using the SFSpeechRecognizer.RequestAuthorization method. This method has one parameter, `Action>`, that allows us to handle the various scenarios that could occur when we ask the user for permission:

  • SFSpeechRecognizerAuthorizationStatus.Authorized: Permission granted from the user.
  • SFSpeechRecognizerAuthorizationStatus.Denied: Permission denied from the user.
  • SFSpeechRecognizerAuthorizationStatus.NotDetermined: Awaiting Permission approval from user.
  • SFSpeechRecognizerAuthorizationStatus.Restricted: Device does not allow usage of SFSpeechRecognizer

Recognizing Speech

Now that we have permission, let’s write some code to use the new Speech Recognition API! Create a new method named RecognizeSpeech that takes in an NSUrl as a parameter. This is where we will perform all of our speech-to-text logic.

public void RecognizeSpeech(NSUrl url)
{
    var recognizer = new SFSpeechRecognizer();

    // Is the default language supported?
    if (recognizer == null)
        return;

    // Is recognition available?
    if (!recognizer.Available)
        return;
}

SFSpeechRecognizer is the main class for speech recognition in iOS 10. In the code above, we “new up” an instance of this class. If speech recognition is not available in the current device language, the recognizer will be null. We can then check if speech recognition is available and authorized before using it.

Next, we’ll create and issue a new SFSpeechUrlRecognitionRequest with a local or remote NSUrl to select which prerecorded audio to recognize. Finally, we can use the SFSpeechRecognizer.GetRecognitionTask method to issue the speech recognition call to the server. Because recognition is performed incrementally, we can use the callback to update our user interface as results return. When speech recognition is completed, SFSpeechRecognitionResult.Final will be set to true, and we can use SFSpeechRecognitionResult.BestTranscription.FormattedString to access the final transcription.

// Create recognition task and start recognition
var request = new SFSpeechUrlRecognitionRequest(url);
recognizer.GetRecognitionTask(request, (SFSpeechRecognitionResult result, NSError err) =>
{
    // Was there an error?
    if (err != null)
    {
        var alertViewController = UIAlertController.Create("Error", $"An error recognizing speech occurred: {err.LocalizedDescription}", UIAlertControllerStyle.Alert);

        PresentViewController(alertViewController, true, null);
    }
    else
    {
        // Update the user interface with the speech-to-text result.
        if (result.Final)
            SpeechToTextView.Text = result.BestTranscription.FormattedString;
    }
});

That’s it! Now we can run our app and perform speech-to-text using the new Speech Recognition APIs as part of iOS 10.

Performing More Complex Speech & Language Operations

The Speech Recognition APIs from iOS 10 are great, but what if we need something a bit more complex? Microsoft Cognitive Services has a great set of language APIs for handling speech and natural language, from speaker recognition to understanding speaker intent. For more information about Microsoft Cognitive Services language and speech APIs, check out the Microsoft Cognitive Services webpage.

Wrapping Up

In this blog post, we took a look at the new Speech Recognition APIs that are available to developers as part of iOS 10. For more information on the Speech Recognition APIs, visit our documentation. Mobile applications that want to build conversational user interfaces should also check out the documentation on iOS 10’s SiriKit. To download the sample from this blog post, visit my GitHub.

Author

Pierce Boggan
Senior Program Manager

Pierce is a Senior Program Manager on the Mobile Developer Tools team at Microsoft. He is responsible for IDE tooling for mobile developers in Visual Studio (Xamarin) and Visual Studio Code (React Native and Cordova). In his free time, Pierce enjoys playing ultimate, backpacking, and spending way too much time on side projects he will never finish.

0 comments

Discussion are closed.