{"id":2152,"date":"2015-12-16T04:35:13","date_gmt":"2015-12-16T04:35:13","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/index.php\/2015\/12\/16\/determining-speech-intent-with-cognitive-services-and-luis\/"},"modified":"2020-03-15T07:56:27","modified_gmt":"2020-03-15T14:56:27","slug":"determining-speech-intent-with-cognitive-services-and-luis","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/determining-speech-intent-with-cognitive-services-and-luis\/","title":{"rendered":"Determining Speech Intent with Cognitive Services and LUIS"},"content":{"rendered":"<p>From the early days of punch cards and command line, the ways in which we interact with computers have continued to evolve. The mouse made things a little easier by ushering in graphical user interfaces. Only recently have more natural human-computer interactions become prevalent through touch and speech. Digital personal assistants (e.g. Cortana, Siri, Google Now, Alexa) are examples of how we are able to interact with computers in a more natural way using speech. Cognitive Services, used with the Windows 10 Speech APIs, forms a complete and comprehensive platform that supports a wide range of speech scenarios and applications for developers of all backgrounds. This real-life code story will walk through how we used <a href=\"https:\/\/www.microsoft.com\/cognitive-services\/\">Cognitive Services<\/a> in combination with <a href=\"https:\/\/www.microsoft.com\/cognitive-services\/language-understanding-intelligent-service-luis\">Language Understanding Intelligent Service (LUIS)<\/a> to interpret voice commands and determine the final intent of what a user said.<\/p>\n<h2 id=\"the-problem\">The Problem<\/h2>\n<p>In collaboration with an automobile manufacturer, we developed a \u2018smart\u2019 center console for a vehicle. Among other capabilities, the console would continually listen for the driver\/passenger\u2019s commands and perform an action based on what was said. For instance, <em>\u201cHey [insert-car-name-here]<\/em>, navigate me to the cheapest gas station around here\u201d or <em>\u201cHey [insert-car-name-here]<\/em>, let my next meeting know that I am running late\u201d. With any speech interaction, it\u2019s important to enable natural language commands. The challenge is that there are many ways of conveying your intent. For example, to ask your vehicle to play a piece of music, you can say:<\/p>\n<ul>\n<li>Play Thriller by Michael Jackson<\/li>\n<li>Play Michael Jackson\u2019s Thriller<\/li>\n<li>I want to listen to Michael Jackson\u2019s Thriller<\/li>\n<li>Thriller by Michael Jackson, play it<\/li>\n<\/ul>\n<p>Despite the many variants, the speaker wished to elicit the same effect \u2013 playing <a href=\"https:\/\/www.youtube.com\/watch?v=sOnqjkJTMaA\">this<\/a>. How does an application handle the different ways a user can request an action?<\/p>\n<h2 id=\"overview-of-the-solution\">Overview of the Solution<\/h2>\n<p>Powered by a Windows 10 PC, our solution uses Cognitive Services LUIS to determine the intent of the user\u2019s request.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/12\/architecture.png\" alt=\"Architecture\" \/><\/p>\n<p>For simplicity, the following solution will focus on how we handled the intent of playing music.<\/p>\n<h3 id=\"implementation\">Implementation<\/h3>\n<p>Cognitive Services is a set of machine learning libraries developed by Microsoft Research. Cognitive Services\u2019s services are used internally at Microsoft across a range of services, including Cortana and Skype Translator. As a colleague, <a href=\"http:\/\/www.mikelanzetta.com\/\">Mike Lanzetta<\/a>, mentions, \u201c[Cognitive Services] differs from Azure ML in that these are pre-trained\/pre-built libraries for specific but common ML tasks\u201d. Cognitive Services exposes cloud-based APIs that enable applications to easily integrate recognition capabilities for input such as speech, faces, or objects in images.<\/p>\n<p>The Bing Speech APIs are accessible through a <a href=\"https:\/\/www.microsoft.com\/cognitive-services\/en-us\/Speech-api\/documentation\/API-Reference-REST\/BingVoiceRecognition\">REST endpoint<\/a> and a variety of client libraries. A benefit of these client libraries is that they allow for partial recognition results as the microphone data is streamed to Cognitive Services. Since our application was in .NET, we opted for the C# client library.<\/p>\n<h3 id=\"start-recording\">Start Recording<\/h3>\n<p>To start a microphone recording session, we need three configuration values:<\/p>\n<ol>\n<li><em>Bing Speech API Subscription Key<\/em> &#8211; Obtained through the <a href=\"https:\/\/www.projectoxford.ai\/Subscription\">Cognitive Services dashboard<\/a> where you can create a Speech API Subscription and view the keys.<\/li>\n<li><em>LUIS App ID<\/em> &#8211; Known after publishing LUIS model<\/li>\n<li><em>LUIS Subscription ID<\/em> &#8211; Known after publishing LUIS model<\/li>\n<\/ol>\n<p>The <code class=\"highlighter-rouge\">MicrophoneRecognitionClient<\/code> will send data from the microphone to Speech Recognition Service.<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>MicrophoneRecognitionClient _client = SpeechRecognitionServiceFactory.CreateMicrophoneClientWithIntent(\r\n    \"en-us\",\r\n    recognitionConfig.SpeechSubscriptionId,\r\n    recognitionConfig.LuisAppId,\r\n    recognitionConfig.LuisSubscriptionId);\r\n\r\n_client.StartMicAndRecognition();\r\n\r\n_client.OnIntent += ((s, e) =&gt; {\r\n    \/\/ do stuff\r\n});\r\n<\/code><\/pre>\n<\/div>\n<p>In the SpotifySearch project, we followed the factory design pattern, and created a separate <code class=\"highlighter-rouge\">ProviderFactory<\/code> assembly to handle creation of providers. This allows for clients of the assembly to create a <a href=\"https:\/\/github.com\/jpoon\/SpotifySearch\/blob\/master\/ProviderFactory\/Speech\/ISpeechRecognition.cs\">SpeechRecognition<\/a> client using:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>ProviderFactory.Create(new SpeechRecognitionConfig(...))\r\n<\/code><\/pre>\n<\/div>\n<p>Whereupon the MicrophoneClient is <a href=\"https:\/\/github.com\/jpoon\/SpotifySearch\/blob\/master\/ProviderFactory\/Speech\/SpeechRecognition.cs#L47\">instantiated<\/a> in a similar fashion as shown above. The <code class=\"highlighter-rouge\">WithIntent<\/code> clients of the <code class=\"highlighter-rouge\">SpeechRecognitionServiceFactory<\/code> require a trained model in order to assess intent on the recognition results. We train the models using Cognitive Services LUIS.<\/p>\n<h3 id=\"training-the-model\">Training the Model<\/h3>\n<p>Creating an account with LUIS <a href=\"https:\/\/www.luis.ai\/\">here<\/a>. Once you have access, log into LUIS and create a new application.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/12\/luis_new_application.png\" alt=\"Create a new application\" \/><\/p>\n<p>LUIS offers a graphical interface to train a model. The core concepts behind LUIS are: utterances, entities, and intents. Entities refer to the subjects you wish to identify in your utterance and intent refers to the final intention of the utterance. Using the utterance \u201cPlay the Thriller by Michael Jackson\u201d as an example, the entities I wish to identify are the artist (Michael Jackson) and the song (Thriller) with the final intent being that I wish to play music.<\/p>\n<h4 id=\"entities\">Entities<\/h4>\n<p>We can add an entity and train our model to identify this newly created entity. LUIS also offers a set of <a href=\"https:\/\/www.luis.ai\/Help#PreBuiltEntities\">pre-built entities<\/a>. To show both use cases, we will manually define a song entity and leverage the \u2018encyclopedia\u2019 built-in entity to identify the artist.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/12\/luis_entity_1.gif\" alt=\"encyclopedia entity\" \/><\/p>\n<h4 id=\"intent\">Intent<\/h4>\n<p>All applications include a pre-defined intent of \u201cnone\u201d. If no intents are recognized, LUIS will return \u201cnone\u201d. Continuing with the example of playing music, we will create a new intent of \u201cPlayMusic\u201d.<\/p>\n<h4 id=\"train-and-publish\">Train and Publish<\/h4>\n<p>With each utterance, LUIS will attempt to recognize the relevant intent and entities. In the graphic shown below, LUIS correctly identified \u201cMichael Jackson\u201d as an encyclopedia entity, \u201cThriller\u201d as a song entity, with \u201cPlayMusic\u201d as the intent.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/12\/luis_intent_1.gif\" alt=\"intent entity\" \/><\/p>\n<p>In order to improve the accuracy of LUIS, continue to seed the system with more utterances and ensure proper labeling of entities and intent. LUIS will generalize the seeded examples and develop the necessary model to recognize the relevant intents and entities. Once you think the system has been seeded with sufficient data, publish the model to expose an HTTP endpoint.<\/p>\n<p>Submitting the query: \u201cI want to listen to Thriller by Michael Jackson\u201d, we receive the following response JSON:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code><span class=\"p\">{<\/span>\r\n  <span class=\"nt\">\"query\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"i want to listen to thriller by michael jackson\"<\/span><span class=\"p\">,<\/span>\r\n  <span class=\"nt\">\"intents\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">[<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"intent\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"PlayMusic\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"score\"<\/span><span class=\"p\">:<\/span> <span class=\"mf\">0.9999995<\/span>\r\n    <span class=\"p\">},<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"intent\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"None\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"score\"<\/span><span class=\"p\">:<\/span> <span class=\"mf\">0.07043537<\/span>\r\n    <span class=\"p\">}<\/span>\r\n  <span class=\"p\">],<\/span>\r\n  <span class=\"nt\">\"entities\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">[<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"entity\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"thriller\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"type\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"song\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"startIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">20<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"endIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">27<\/span>\r\n    <span class=\"p\">},<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"entity\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"michael jackson\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"type\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"builtin.encyclopedia.people.person\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"startIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">32<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"endIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">46<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"score\"<\/span><span class=\"p\">:<\/span> <span class=\"mf\">0.9995551<\/span>\r\n    <span class=\"p\">},<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"entity\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"thriller\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"type\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"builtin.encyclopedia.tv.program\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"startIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">20<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"endIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">27<\/span>\r\n    <span class=\"p\">},<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"entity\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"michael jackson\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"type\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"builtin.encyclopedia.music.artist\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"startIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">32<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"endIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">46<\/span>\r\n    <span class=\"p\">},<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"entity\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"michael jackson\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"type\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"builtin.encyclopedia.film.actor\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"startIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">32<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"endIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">46<\/span>\r\n    <span class=\"p\">},<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"entity\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"michael jackson\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"type\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"builtin.encyclopedia.film.producer\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"startIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">32<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"endIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">46<\/span>\r\n    <span class=\"p\">},<\/span>\r\n    <span class=\"p\">{<\/span>\r\n      <span class=\"nt\">\"entity\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"michael jackson\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"type\"<\/span><span class=\"p\">:<\/span> <span class=\"s2\">\"builtin.encyclopedia.film.writer\"<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"startIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">32<\/span><span class=\"p\">,<\/span>\r\n      <span class=\"nt\">\"endIndex\"<\/span><span class=\"p\">:<\/span> <span class=\"mi\">46<\/span>\r\n    <span class=\"p\">}<\/span>\r\n  <span class=\"p\">]<\/span>\r\n<span class=\"p\">}<\/span>\r\n<\/code><\/pre>\n<\/div>\n<p>Queries made to the HTTP endpoint will be tracked within LUIS. You can periodically login to LUIS and view the history of the queries, predicted entities, and intents, and make any adjustments to improve the predictions.<\/p>\n<p>Given the intent of the spoken audio, we can now use that to drive further actions from the application.<\/p>\n<h3 id=\"handling-the-intent\">Handling the Intent<\/h3>\n<p>Once we have obtained the intent and entities detected in the voice command, our application can perform the required action. In our SpotifySearch application, given the <code class=\"highlighter-rouge\">PlayMusic<\/code> intent, we obtain the song and artist from the <a href=\"https:\/\/github.com\/jpoon\/SpotifySearch\/blob\/master\/SpotifySearch\/Controllers\/HomeController.cs#L84\">response payload<\/a> and <a href=\"https:\/\/github.com\/jpoon\/SpotifySearch\/blob\/master\/SpotifySearch\/Controllers\/HomeController.cs#L122\">make a query<\/a> to the Spotify web API to retrieve a sample of the song.<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>var model = TempData[\"model\"] as RecoModel;\r\n\r\n\/\/ query Spotify web API for the song and artist\r\nvar client = new HttpClient();\r\nTask&lt;string&gt; spotifySearch =\r\n    client.GetStringAsync(string.Format(\"https:\/\/api.spotify.com\/v1\/search?q=track:{0}%20artist:{1}&amp;type=track\", Uri.EscapeDataString(model.Song), Uri.EscapeDataString(model.Artist)));\r\n\r\nvar result = await spotifySearch;\r\ndynamic json = JsonConvert.DeserializeObject(result);\r\n\r\n\/\/ retrieve a preview and update the model\r\nmodel.SpotifyLink = json.tracks.items[0].preview_url;\r\n<\/code><\/pre>\n<\/div>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/12\/spotify_search.gif\" alt=\"intent entity\" \/><\/p>\n<h2 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>With the source code of the sample project available on <a href=\"https:\/\/github.com\/jpoon\/spotifysearch\">GitHub<\/a>, our solution serves as an example of how to leverage Cognitive Services LUIS to enable natural language commands in your own application.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this real-life-code story, we show how we used Cognitive Services and LUIS to build a vehicle center console that can listen and respond to user&#8217;s commands, specifically focusing on determining intent.<\/p>\n","protected":false},"author":21365,"featured_media":11144,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[19],"tags":[103,231,239,250],"class_list":["post-2152","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-bing-speech-api","tag-language-understanding-intelligent-service-luis","tag-machine-learning-ml","tag-microsoft-cognitive-services"],"acf":[],"blog_post_summary":"<p>In this real-life-code story, we show how we used Cognitive Services and LUIS to build a vehicle center console that can listen and respond to user&#8217;s commands, specifically focusing on determining intent.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2152","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21365"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2152"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2152\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/11144"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2152"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2152"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2152"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}