{"id":3512,"date":"2023-09-28T09:44:39","date_gmt":"2023-09-28T16:44:39","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/surface-duo\/?p=3512"},"modified":"2024-01-03T16:19:25","modified_gmt":"2024-01-04T00:19:25","slug":"android-openai-chatgpt-20","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-20\/","title":{"rendered":"Speech-to-speech conversing with OpenAI on Android"},"content":{"rendered":"<p>\n  Hello prompt engineers,\n<\/p>\n<p>\n  Just this week, OpenAI announced that their <a href=\"https:\/\/openai.com\/blog\/chatgpt-can-now-see-hear-and-speak\">chat app and website can now \u2018hear and speak\u2019<\/a>. In a huge coincidence (originally inspired by this <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/openai-speech?tabs=windows&amp;pivots=programming-language-csharp\">Azure OpenAI speech to speech<\/a> doc), we\u2019ve added similar functionality to our Jetpack Compose LLM chat sample based on Jetchat.\n<\/p>\n<p>\n  The screenshot below shows the two new buttons that enable this feature:\n<\/p>\n<ul>\n<li><strong>Microphone <\/strong>\u2013 press to start listening and then speak your query\n  <\/li>\n<li><strong>Speaker-mute<\/strong> \u2013 when the app is speaking the response, press this button to stop.\n  <\/li>\n<\/ul>\n<p>\n  <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-1.png\" class=\"wp-image-3513\" alt=\"Image showing speech recognition and synthesis with the Jetchat AI sample app\" width=\"600\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-1.png 1343w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-1-300x206.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-1-1024x701.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-1-768x526.png 768w\" sizes=\"(max-width: 1343px) 100vw, 1343px\" \/><br\/><em>Figure 1: The microphone and speaker-mute icons added to Jetchat<\/em>\n<\/p>\n<p>\n  The speech that is transcribed will be added to the chat as though it was typed and sent directly to the LLM. The LLM\u2019s response is then automatically spoken back through the speakers\/headset. Both the speech-in and speech-out functionality will use built-in Android APIs.\n<\/p>\n<h2>Speech in<\/h2>\n<p>\n  To listen to the user\u2019s question and create a prompt for the LLM, we\u2019re going to use the Android <a href=\"https:\/\/developer.android.com\/reference\/android\/speech\/SpeechRecognizer\"><code>SpeechRecognizer<\/code><\/a> API. We don\u2019t want the phone to be in permanent listen-mode, so the user will have to tap the microphone icon before speaking. This requires us to:\n<\/p>\n<ul>\n<li>\n    Configure permissions in <strong>AndroidManifest.xml<\/strong>\n  <\/li>\n<li>\n    Check\/ask for permission in code\n  <\/li>\n<li>\n    Initialize API with context\n  <\/li>\n<li>\n    Add API methods to listen for speech and send text to the view model\n  <\/li>\n<li>\n    Wire up UI button to start listening\n  <\/li>\n<\/ul>\n<h3>Set &amp; check permissions<\/h3>\n<p>\n  In order for the app to access the microphone, it must request the <code>RECORD_AUDIO<\/code> permission in <strong>AndroidManifest.xml<\/strong>:\n<\/p>\n<pre>&lt;uses-permission android:name=\"android.permission.RECORD_AUDIO\"\/&gt;<\/pre>\n<p>\n  and then call this <code>checkPermission<\/code> function in the <code>NavActivity.onCreate<\/code> function:\n<\/p>\n<pre>private fun checkPermission() {\r\n    if (Build.VERSION.SDK_INT &gt;= Build.VERSION_CODES.M) {\r\n        ActivityCompat.requestPermissions(\r\n            this,\r\n            arrayOf&lt;String&gt;(Manifest.permission.RECORD_AUDIO), RecordAudioRequestCode\r\n        )\r\n    }\r\n}<\/pre>\n<p>\n  Assuming the user agrees, the app will be able to listen to their spoken questions.\n<\/p>\n<h3>Add listening code<\/h3>\n<p>\n  On the <code>ChannelNameBar<\/code> composable, there\u2019s a new parameter <code>onListenPressed<\/code>, which is called when the icon is clicked (see the \u201cWire up\u2026\u201d section below). This function is implemented in the <code>ConversationFragment<\/code> and passed through the <code>ConversationContent<\/code> composable. \n<\/p>\n<p>\n  The <code>speechToText<\/code> object is initialized as part of the fragment\u2019s <code>onCreate<\/code> function and takes a number of parameters including setting the expected language. This is so the API can access the <code>context<\/code> required for its constructor:\n<\/p>\n<pre>speechToText = SpeechRecognizer.createSpeechRecognizer(this.context)\r\nspeechToText.setRecognitionListener(this)\r\nrecognizerIntent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)\r\nrecognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_PREFERENCE, \"US-en\")<\/pre>\n<p>\n  Once that\u2019s initialized, the <code>listen<\/code> function gets called via the view model and starts listening for speech to transcribe: \n<\/p>\n<pre>speechToText.startListening(recognizerIntent)<\/pre>\n<p>\n  The rest of the functions required to support the API are also implemented in the fragment. When some text is successfully transcribed, the result is passed to the view model via <code>setSpeech<\/code>.\n<\/p>\n<p>\n  The <code>MainViewModel.setSpeech<\/code> function inserts the text into the existing \u2018workflow\u2019 for new messages, so it gets added in the user interface and the existing code is triggered to send to the LLM.\n<\/p>\n<pre>fun setSpeech (text: String) {\r\n    onMessageSent(text)\r\n}<\/pre>\n<p><strong>Once <code>onMessageSent<\/code> is called, the app behaves as if the query was input via the keyboard.<\/strong>\n<\/p>\n<h2>Speech out<\/h2>\n<p>\n  To read the model\u2019s responses aloud, we\u2019ve used the Android <a href=\"https:\/\/developer.android.com\/reference\/android\/speech\/tts\/TextToSpeech\"><code>TextToSpeech<\/code><\/a> API. Like the <code>SpeechToText<\/code> class, it gets initialized in the fragment\u2019s <code>onCreate<\/code> function:\n<\/p>\n<pre>tts = TextToSpeech(this.context, this) \/\/ context, fragment\r\nactivityViewModel.setSpeechGenerator(tts)<\/pre>\n<p>\n  Notice the second constructor parameter is a reference to the fragment itself \u2013 this is because we\u2019ve implemented the <code>TextToSpeech.OnInitListener<\/code> interface on the fragment. The mute button state (disabled except when there is audio generated) is based on those callback functions.\n<\/p>\n<p>\n  In the <code>onMessageSent<\/code> function, after a response has been received from the <code>OpenAiWrapper<\/code> and displayed in the UI, <strong>the code will <em>also<\/em> read it aloud<\/strong>:\n<\/p>\n<pre>textToSpeech.speak(chatResponse, TextToSpeech.QUEUE_FLUSH, null,\"\")<\/pre>\n<p>\n  As with the speech recognition implementation, there is minimal change to the existing logic of the app.\n<\/p>\n<h2>Access the Android context<\/h2>\n<p>\n  As mentioned above, both APIs need to be configured with a <code>Context<\/code>, so they are created and initialized in the <code>ConversationFragment<\/code> with references set in the <code>MainViewModel<\/code>. The <code>MainViewModel<\/code> encapsulates the speech-related APIs and exposes methods that are called from the Jetpack Compose UI.\n<\/p>\n<p>\n  The other reason for implementing the functionality on the fragment is that we need to implement the interfaces <code>RecognitionListener<\/code> and <code>TextToSpeech.OnInitListener<\/code>, for instance to capture and update the mutable state <code>speechState<\/code>.\n<\/p>\n<blockquote><p>NOTE: you will also see a method <code>setContext<\/code> on the <code>MainViewModel<\/code> &#8211; this is used by the Sqlite implementations for vector caching and history embedding \u2013 and is not related to the speech features.<\/p><\/blockquote>\n<h2>Wire up the Jetpack Compose UI<\/h2>\n<p>\n  The above snippets show how the speech recognition and text-to-speech APIs are wired up. The next code snippet shows how the user interface triggers the functionality from the view model via composables in <strong>Conversation.kt<\/strong>. The speech UI controls (record and mute) are added as icons to the <code>ChannelNameBar<\/code> composable. In the composable where the icons are declared you can see these functions and state are used to trigger events on the view model:\n<\/p>\n<ul>\n<li>\n    <code>onListenPressed()<\/code>\n  <\/li>\n<li>\n    <code>speechState<\/code>\n  <\/li>\n<li>\n    <code>onStopTalkingPressed()<\/code>\n  <\/li>\n<\/ul>\n<pre>\/\/ \"Microphone\" icon\r\nIconButton(onClick = {\r\n    onListenPressed()\r\n}) {\r\n    Icon(\r\n         imageVector = when (speechState) {\r\n         SpeechState.LISTENING -&gt; Icons.Filled.KeyboardVoice\r\n         else -&gt; Icons.Outlined.KeyboardVoice\r\n    },\r\n    tint = when (speechState) {\r\n         SpeechState.LISTENING -&gt; MaterialTheme.colorScheme.primary\r\n         else -&gt; MaterialTheme.colorScheme.onSurfaceVariant\r\n    },\r\n    modifier = Modifier.clip(CircleShape),\r\n    contentDescription = stringResource(id = R.string.enable_mic)\r\n    )\r\n}\r\n\/\/ \"End speaking\" icon\r\nIconButton(\r\n    onClick = {\r\n        onStopTalkingPressed()\r\n    },\r\n    enabled = when (speechState) {\r\n            SpeechState.SPEAKING -&gt; true\r\n            else -&gt; false\r\n         }\r\n     ) {\r\n     Icon(\r\n         imageVector = Icons.Outlined.VolumeOff,\r\n         modifier = Modifier.clip(CircleShape),\r\n         contentDescription = stringResource(id = R.string.mute_tts)\r\n     )\r\n}<\/pre>\n<p>\n  These are set in the <code>ConversationContent<\/code> composable and implemented in the <code>ConversationFragment<\/code>, calling into functions or mutable state (eg. the <code>speechState<\/code>) defined in the fragment.\n<\/p>\n<h2>One last thing\u2026<\/h2>\n<p>\n  After adding this feature that reads out every response, it started driving me crazy while testing other features. I added a flag <code>Constants.ENABLE_SPEECH<\/code> that should be set to <code>true<\/code> to use these features, but <code>false<\/code> when you are testing or otherwise don\u2019t wish to utilize the <code>SpeechRecognition<\/code> and <code>TextToSpeech<\/code> functionality. This could be an application preference in future.\n<\/p>\n<p>\n  If you\u2019re running the sample and find the speech features not working, check that <code>ENABLE_SPEECH = true<\/code> in <strong>Constants.kt<\/strong> and that you agreed to the permissions dialog when the app started.\n<\/p>\n<h2>Resources and feedback<\/h2>\n<p>\n  See the <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/\">Azure OpenAI documentation<\/a> for more information on the wide variety of services available for your apps.\n<\/p>\n<p>\n  We\u2019d love your feedback on this post, including any tips or tricks you\u2019ve learned from playing around with ChatGPT prompts.\n<\/p>\n<p>\n  If you have any thoughts or questions, use the <a href=\"http:\/\/aka.ms\/SurfaceDuoSDK-Feedback\">feedback forum<\/a> or message us on <a href=\"https:\/\/twitter.com\/surfaceduodev\">Twitter @surfaceduodev<\/a>.\n<\/p>\n<p>\n  There will be no livestream this week, but you can check out the <a href=\"https:\/\/youtube.com\/c\/surfaceduodev\">archives on YouTube<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello prompt engineers, Just this week, OpenAI announced that their chat app and website can now \u2018hear and speak\u2019. In a huge coincidence (originally inspired by this Azure OpenAI speech to speech doc), we\u2019ve added similar functionality to our Jetpack Compose LLM chat sample based on Jetchat. The screenshot below shows the two new buttons [&hellip;]<\/p>\n","protected":false},"author":570,"featured_media":3513,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[741],"tags":[734,733],"class_list":["post-3512","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-chatgpt","tag-openai"],"acf":[],"blog_post_summary":"<p>Hello prompt engineers, Just this week, OpenAI announced that their chat app and website can now \u2018hear and speak\u2019. In a huge coincidence (originally inspired by this Azure OpenAI speech to speech doc), we\u2019ve added similar functionality to our Jetpack Compose LLM chat sample based on Jetchat. The screenshot below shows the two new buttons [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3512","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/users\/570"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/comments?post=3512"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3512\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media\/3513"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media?parent=3512"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/categories?post=3512"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/tags?post=3512"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}