Determining Speech Intent with Cognitive Services and LUIS

From the early days of punch cards and command line, the ways in which we interact with computers have continued to evolve. The mouse made things a little easier by ushering in graphical user interfaces. Only recently have more natural human-computer interactions become prevalent through touch and speech. Digital personal assistants (e.g. Cortana, Siri, Google Now, Alexa) are examples of how we are able to interact with computers in a more natural way using speech. Cognitive Services, used with the Windows 10 Speech APIs, forms a complete and comprehensive platform that supports a wide range of speech scenarios and applications for developers of all backgrounds. This real-life code story will walk through how we used Cognitive Services in combination with Language Understanding Intelligent Service (LUIS) to interpret voice commands and determine the final intent of what a user said.

The Problem

In collaboration with an automobile manufacturer, we developed a ‘smart’ center console for a vehicle. Among other capabilities, the console would continually listen for the driver/passenger’s commands and perform an action based on what was said. For instance, “Hey [insert-car-name-here], navigate me to the cheapest gas station around here” or “Hey [insert-car-name-here], let my next meeting know that I am running late”. With any speech interaction, it’s important to enable natural language commands. The challenge is that there are many ways of conveying your intent. For example, to ask your vehicle to play a piece of music, you can say:

Play Thriller by Michael Jackson
Play Michael Jackson’s Thriller
I want to listen to Michael Jackson’s Thriller
Thriller by Michael Jackson, play it

Despite the many variants, the speaker wished to elicit the same effect – playing this. How does an application handle the different ways a user can request an action?

Overview of the Solution

Architecture

For simplicity, the following solution will focus on how we handled the intent of playing music.

Implementation

Cognitive Services is a set of machine learning libraries developed by Microsoft Research. Cognitive Services’s services are used internally at Microsoft across a range of services, including Cortana and Skype Translator. As a colleague, Mike Lanzetta, mentions, “[Cognitive Services] differs from Azure ML in that these are pre-trained/pre-built libraries for specific but common ML tasks”. Cognitive Services exposes cloud-based APIs that enable applications to easily integrate recognition capabilities for input such as speech, faces, or objects in images.

The Bing Speech APIs are accessible through a REST endpoint and a variety of client libraries. A benefit of these client libraries is that they allow for partial recognition results as the microphone data is streamed to Cognitive Services. Since our application was in .NET, we opted for the C# client library.

Start Recording

To start a microphone recording session, we need three configuration values:

Bing Speech API Subscription Key – Obtained through the Cognitive Services dashboard where you can create a Speech API Subscription and view the keys.
LUIS App ID – Known after publishing LUIS model
LUIS Subscription ID – Known after publishing LUIS model

The MicrophoneRecognitionClient will send data from the microphone to Speech Recognition Service.

MicrophoneRecognitionClient _client = SpeechRecognitionServiceFactory.CreateMicrophoneClientWithIntent(
    "en-us",
    recognitionConfig.SpeechSubscriptionId,
    recognitionConfig.LuisAppId,
    recognitionConfig.LuisSubscriptionId);

_client.StartMicAndRecognition();

_client.OnIntent += ((s, e) => {
    // do stuff
});

In the SpotifySearch project, we followed the factory design pattern, and created a separate ProviderFactory assembly to handle creation of providers. This allows for clients of the assembly to create a SpeechRecognition client using:

ProviderFactory.Create(new SpeechRecognitionConfig(...))

Whereupon the MicrophoneClient is instantiated in a similar fashion as shown above. The WithIntent clients of the SpeechRecognitionServiceFactory require a trained model in order to assess intent on the recognition results. We train the models using Cognitive Services LUIS.

Training the Model

Creating an account with LUIS here. Once you have access, log into LUIS and create a new application.

Create a new application

LUIS offers a graphical interface to train a model. The core concepts behind LUIS are: utterances, entities, and intents. Entities refer to the subjects you wish to identify in your utterance and intent refers to the final intention of the utterance. Using the utterance “Play the Thriller by Michael Jackson” as an example, the entities I wish to identify are the artist (Michael Jackson) and the song (Thriller) with the final intent being that I wish to play music.

Entities

We can add an entity and train our model to identify this newly created entity. LUIS also offers a set of pre-built entities. To show both use cases, we will manually define a song entity and leverage the ‘encyclopedia’ built-in entity to identify the artist.

encyclopedia entity

Intent

All applications include a pre-defined intent of “none”. If no intents are recognized, LUIS will return “none”. Continuing with the example of playing music, we will create a new intent of “PlayMusic”.

Train and Publish

With each utterance, LUIS will attempt to recognize the relevant intent and entities. In the graphic shown below, LUIS correctly identified “Michael Jackson” as an encyclopedia entity, “Thriller” as a song entity, with “PlayMusic” as the intent.

intent entity

In order to improve the accuracy of LUIS, continue to seed the system with more utterances and ensure proper labeling of entities and intent. LUIS will generalize the seeded examples and develop the necessary model to recognize the relevant intents and entities. Once you think the system has been seeded with sufficient data, publish the model to expose an HTTP endpoint.

Submitting the query: “I want to listen to Thriller by Michael Jackson”, we receive the following response JSON:

{
  "query": "i want to listen to thriller by michael jackson",
  "intents": [
    {
      "intent": "PlayMusic",
      "score": 0.9999995
    },
    {
      "intent": "None",
      "score": 0.07043537
    }
  ],
  "entities": [
    {
      "entity": "thriller",
      "type": "song",
      "startIndex": 20,
      "endIndex": 27
    },
    {
      "entity": "michael jackson",
      "type": "builtin.encyclopedia.people.person",
      "startIndex": 32,
      "endIndex": 46,
      "score": 0.9995551
    },
    {
      "entity": "thriller",
      "type": "builtin.encyclopedia.tv.program",
      "startIndex": 20,
      "endIndex": 27
    },
    {
      "entity": "michael jackson",
      "type": "builtin.encyclopedia.music.artist",
      "startIndex": 32,
      "endIndex": 46
    },
    {
      "entity": "michael jackson",
      "type": "builtin.encyclopedia.film.actor",
      "startIndex": 32,
      "endIndex": 46
    },
    {
      "entity": "michael jackson",
      "type": "builtin.encyclopedia.film.producer",
      "startIndex": 32,
      "endIndex": 46
    },
    {
      "entity": "michael jackson",
      "type": "builtin.encyclopedia.film.writer",
      "startIndex": 32,
      "endIndex": 46
    }
  ]
}

Queries made to the HTTP endpoint will be tracked within LUIS. You can periodically login to LUIS and view the history of the queries, predicted entities, and intents, and make any adjustments to improve the predictions.

Given the intent of the spoken audio, we can now use that to drive further actions from the application.

Handling the Intent

Once we have obtained the intent and entities detected in the voice command, our application can perform the required action. In our SpotifySearch application, given the PlayMusic intent, we obtain the song and artist from the response payload and make a query to the Spotify web API to retrieve a sample of the song.

var model = TempData["model"] as RecoModel;

// query Spotify web API for the song and artist
var client = new HttpClient();
Task<string> spotifySearch =
    client.GetStringAsync(string.Format("https://api.spotify.com/v1/search?q=track:{0}%20artist:{1}&type=track", Uri.EscapeDataString(model.Song), Uri.EscapeDataString(model.Artist)));

var result = await spotifySearch;
dynamic json = JsonConvert.DeserializeObject(result);

// retrieve a preview and update the model
model.SpotifyLink = json.tracks.items[0].preview_url;

intent entity

Opportunities for Reuse

With the source code of the sample project available on GitHub, our solution serves as an example of how to leverage Cognitive Services LUIS to enable natural language commands in your own application.

Determining Speech Intent with Cognitive Services and LUIS

The Problem

Overview of the Solution

Implementation

Start Recording

Training the Model

Entities

Intent

Train and Publish

Handling the Intent

Opportunities for Reuse

Author

Read next

Categorizing Driver Risk with Machine Learning

Video Tagging Tool for Video-Processing and Image Recognition

The Problem

Overview of the Solution

Implementation

Start Recording

Training the Model

Entities

Intent

Train and Publish

Handling the Intent

Opportunities for Reuse

Author

Read next

Categorizing Driver Risk with Machine Learning

Video Tagging Tool for Video-Processing and Image Recognition

Stay informed