{"id":16451,"date":"2025-10-31T00:00:00","date_gmt":"2025-10-31T07:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16451"},"modified":"2025-10-31T07:28:46","modified_gmt":"2025-10-31T14:28:46","slug":"azure-speech-to-text-optimization","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/azure-speech-to-text-optimization\/","title":{"rendered":"Tuning and Optimization of Speech-to-Text (STT), Text-to-Speech (TTS), and Custom Keyword Recognition in Azure Speech Services"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>Optimizing Speech-to-Text (STT) and Text-to-Speech (TTS) is essential for developers building voice-enabled applications, as it enhances recognition accuracy and overall user experience. Additionally, custom keyword recognition enables hands-free activation for voice assistants.<\/p>\n<p>This blog outlines:<\/p>\n<ul>\n<li>Methods for improving STT accuracy and tuning<\/li>\n<li>How to enhance TTS pronunciation using custom lexicons<\/li>\n<li>Approaches for implementing custom keyword recognition efficiently<\/li>\n<\/ul>\n<h2>Speech-to-Text (STT) Optimization<\/h2>\n<h3>Obtaining STT Models<\/h3>\n<p>Azure provides several STT models:<\/p>\n<ul>\n<li><strong>Default STT Models<\/strong>: Available via <a href=\"https:\/\/ai.azure.com\/\">Azure AI Foundry portal<\/a> for general-purpose speech recognition.<\/li>\n<li><strong>Custom Speech Models<\/strong>: Trainable via <a href=\"https:\/\/ai.azure.com\/\">Azure AI Foundry portal<\/a> for specific accents and terminology.<\/li>\n<li><strong>Embedded Speech Models<\/strong>: Designed for edge deployment, enabling <strong>low-latency offline processing<\/strong>. This is a Limited Access offering and applications can be submitted via <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/embedded-speech?tabs=android-target%2Cjre&amp;pivots=programming-language-java\">Azure AI Speech embedded speech limited access review<\/a>.<\/li>\n<\/ul>\n<h3>STT Locale Support<\/h3>\n<h4>Default STT Models<\/h4>\n<p>Azure Speech-to-Text (STT) supports a wide range of languages and locales using default models. You can find the complete list of supported languages and features in the official documentation: <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/language-support?tabs=stt#custom-speech\">Supported languages for Speech-to-Text<\/a>.<\/p>\n<h4>Custom Speech Models<\/h4>\n<p>To improve speech to text recognition accuracy, customization is available for some languages and base models. Depending on the locale, you can upload audio + human-labeled transcripts, plain text, structured text, and pronunciation data. By default, <strong>plain text customization<\/strong> is supported for all available <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/language-support?tabs=stt#custom-speech\">base models<\/a>.<\/p>\n<h4>Embedded Speech Models<\/h4>\n<p>The following Embedded STT models are currently available:<\/p>\n<ul>\n<li>Danish (da-DK)<\/li>\n<li>German (de-DE)<\/li>\n<li>English (en-AU, en-CA, en-GB, en-IE, en-IN, en-NZ, en-US)<\/li>\n<li>Spanish (es-ES, es-MX)<\/li>\n<li>French (fr-CA, fr-FR)<\/li>\n<li>Italian (it-IT)<\/li>\n<li>Japanese (ja-JP)<\/li>\n<li>Korean (ko-KR)<\/li>\n<li>Portuguese (pt-BR, pt-PT)<\/li>\n<li>Chinese (zh-CN, zh-HK, zh-TW)<\/li>\n<\/ul>\n<p>You can check the official documentation for the most up-to-date list of supported languages and models:\n<a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/embedded-speech?tabs=android-target%2Cjre&amp;pivots=programming-language-csharp#models-and-voices\">Embedded Speech Models and Voices<\/a>.<\/p>\n<h3>Improving Recognition Accuracy<\/h3>\n<p>Several techniques enhance STT accuracy by correcting misrecognitions and adapting the model to domain-specific vocabulary.<\/p>\n<h4>Phrase List (For both Embedded and Cloud models)<\/h4>\n<p>A phrase list improves recognition for specific words and phrases. This is particularly useful for proper nouns, brand names, and technical terms. For more details, refer to the <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/improve-accuracy-phrase-list?tabs=terminal&amp;pivots=programming-language-java\">Improve recognition accuracy with phrase list<\/a>.<\/p>\n<h5>Implementation Example (Java)<\/h5>\n<pre><code class=\"language-java\">import com.microsoft.cognitiveservices.speech.PhraseListGrammar;\r\n\r\nPhraseListGrammar phraseList = PhraseListGrammar.fromRecognizer(recognizer);\r\nphraseList.addPhrase(\"Microsoft\");\r\nphraseList.addPhrases([\"Azure\", \"Teams\"]);<\/code><\/pre>\n<h5>Additional Features<\/h5>\n<ul>\n<li>To remove all added phrases and reset the phrase list, use:\n<pre><code class=\"language-java\">phraseList.clear();<\/code><\/pre>\n<\/li>\n<li>Azure Speech SDK supports over 1,000 phrase list entries, but initialization time increases with the number of phrases.<\/li>\n<\/ul>\n<h4>Custom Correction Logic (For both Embedded and Cloud models (only <strong>en-US<\/strong>))<\/h4>\n<p>Phrase lists alone may not be sufficient for certain words. Custom logic can be implemented to force corrections of frequently misrecognized words.<\/p>\n<h5>Preparing <code>corrections.json<\/code><\/h5>\n<p>To refine the correction logic, two methods can be used to identify misrecognition patterns:<\/p>\n<ol>\n<li>Analyze results after applying the phrase list to find words that are still misrecognized.<\/li>\n<li>Use N-best results from Word Level Details to check alternative recognition candidates.<\/li>\n<\/ol>\n<h5>What is N-best?<\/h5>\n<p>N-best refers to a ranked list of recognition hypotheses, where the STT engine provides not only the top result but also alternative candidates. By analyzing lower-ranked alternatives, it is possible to identify and correct systematic misrecognitions.<\/p>\n<h5>Enabling N-best in Word Level Details<\/h5>\n<p>To retrieve N-best results and detailed word-level recognition information, configure <code>speechConfig<\/code> with the following properties:<\/p>\n<pre><code class=\"language-java\">speechConfig.setSpeechRecognitionOutputFormat(OutputFormat.Detailed);\r\nspeechConfig.setProperty(\"SpeechRecognition_RequestWordLevelCorrections\", \"true\");<\/code><\/pre>\n<h5>Parsing N-best Results<\/h5>\n<p>Once the STT engine provides N-best results, the application can analyze them to refine recognition accuracy. Below is an example of how to parse N-best results from the JSON response:<\/p>\n<pre><code class=\"language-java\">\/\/ Fetch JSON result containing N-best\r\nString jsonResult = e.getResult()\r\n                     .getProperties()\r\n                     .getProperty(PropertyId.SpeechServiceResponse_JsonResult);\r\nSystem.out.println(\"JSON result: \" + jsonResult);\r\n\r\n\/\/ Parse JSON for N-best recognition candidates\r\nJSONObject json = new JSONObject(jsonResult);\r\nJSONArray nbestArray = json.getJSONArray(\"NBest\");\r\n\r\nif (nbestArray != null &amp;&amp; nbestArray.length() &gt; 0) {\r\n    for (int i = 0; i &lt; nbestArray.length(); i++) {\r\n        JSONObject candidate = nbestArray.getJSONObject(i);\r\n        System.out.println(\"Candidate \" + (i+1) + \": \" + candidate.getString(\"Display\") +\r\n                           \" (Confidence: \" + candidate.getDouble(\"Confidence\") + \")\");\r\n    }\r\n}\r\n<\/code><\/pre>\n<h5>N-best Example with Sentence Input<\/h5>\n<p>When an entire sentence is processed instead of a single word, N-best alternatives focus on specific words within the sentence. Below is an actual example where the word <strong>&#8220;out&#8221;<\/strong> has multiple alternatives:<\/p>\n<pre><code class=\"language-json\">\"NBest\": [\r\n    {\r\n      \"Confidence\": 0.867123,\r\n      \"Lexical\": \"tune goggles\",\r\n      \"Display\": \"Tune Goggles.\",\r\n      \"Words\": [\r\n          {\r\n              \"Word\": \"tune\",\r\n              \"Confidence\": 0.867123,\r\n              \"Offset\": 12500000,\r\n              \"Duration\": 4500000\r\n          },\r\n          {\r\n              \"Word\": \"goggles\",\r\n              \"Confidence\": 0.867123,\r\n              \"Offset\": 17500000,\r\n              \"Duration\": 7000000\r\n          }\r\n      ]\r\n      ...\r\n    }\r\n],\r\n\"Corrections\": {\r\n    \"CorrectionCandidates\": [\r\n        {\r\n            \"Alternates\": [\r\n                { \"AlternateWords\": [\"toon\"], \"Id\": 0, \"SourceSpan\": [0] }\r\n            ],\r\n            \"Confidence\": \"High\",\r\n            \"Id\": 0,\r\n            \"Span\": [0]\r\n        }\r\n    ],\r\n}<\/code><\/pre>\n<p>In this example, &#8220;toon&#8221; is an alternative candidate for &#8220;tune&#8221; (SourceSpan = 0), making it a possible correction in the phrase &#8220;tune goggles.&#8221;<\/p>\n<p>This insight helps refine correction rules, ensuring misrecognitions like &#8220;tune goggles&#8221; \u2192 &#8220;ToonGoggles&#8221; are automatically corrected. By analyzing lower-confidence alternatives, developers can improve STT accuracy for domain-specific terms.<\/p>\n<h5>Example Correction Rules (<code>corrections.json<\/code>)<\/h5>\n<pre><code class=\"language-json\">{\r\n  \"ToonGoggles\": [\"tune goggles\", \"toon googles\", \"tune google\"]\r\n}<\/code><\/pre>\n<h4>Implementing Correction Logic<\/h4>\n<p>To efficiently apply correction rules, a dedicated correction utility class (<code>CorrectionUtils<\/code>) is used. This class loads misrecognition patterns from <code>corrections.json<\/code> and automatically corrects recognized text.<\/p>\n<h5>Key Features of <code>CorrectionUtils<\/code><\/h5>\n<ul>\n<li>Loads correction rules dynamically from a JSON file.<\/li>\n<li>Converts all misrecognition patterns to lowercase to ensure <strong>case-insensitive correction<\/strong>.<\/li>\n<li>Replaces misrecognized words with the correct phrase while preserving the rest of the sentence.<\/li>\n<li>Can be updated by modifying <code>corrections.json<\/code> without changing application code.<\/li>\n<\/ul>\n<h5>Code Snippet of <code>CorrectionUtils<\/code> (Java)<\/h5>\n<pre><code class=\"language-java\">public class CorrectionUtils {\r\n    private static final Map&lt;String, List&lt;String&gt;&gt; correctionMap = new HashMap&lt;&gt;();\r\n\r\n    static { loadCorrections(\"corrections.json\"); }\r\n\r\n    private static void loadCorrections(String filePath) {\r\n        try {\r\n            \/\/ Load JSON file\r\n            ObjectMapper mapper = new ObjectMapper();\r\n            Map&lt;String, List&lt;String&gt;&gt; corrections = mapper.readValue(\r\n                new BufferedReader(new FileReader(filePath)),\r\n                new TypeReference&lt;Map&lt;String, List&lt;String&gt;&gt;&gt;() {}\r\n            );\r\n\r\n           \/\/ Convert all misrecognition lists to lowercase\r\n            for (Map.Entry&lt;String, List&lt;String&gt;&gt; entry : corrections.entrySet()) {\r\n                correctionMap.put(entry.getKey(),\r\n                    entry.getValue().stream()\r\n                    .map(String::toLowerCase)\r\n                    .collect(Collectors.toList())\r\n                );\r\n            }\r\n        } catch (IOException e) {\r\n            e.printStackTrace();\r\n        }\r\n    }\r\n\r\n    public static String correctRecognition(String recognizedText) {\r\n        String lowerCaseText = recognizedText.toLowerCase(); \/\/ Convert input to lowercase\r\n        for (Map.Entry&lt;String, List&lt;String&gt;&gt; entry : correctionMap.entrySet()) {\r\n            String correctPhrase = entry.getKey();\r\n            List&lt;String&gt; misrecognitions = entry.getValue();\r\n\r\n            for (String misrecognition : misrecognitions) {\r\n                if (lowerCaseText.contains(misrecognition)) {\r\n                    \/\/ Correct the recognition result\r\n                    recognizedText = recognizedText.replaceAll(\"(?i)\" + misrecognition, correctPhrase);\r\n                }\r\n            }\r\n        }\r\n        return recognizedText;\r\n    }\r\n}<\/code><\/pre>\n<p>To avoid case-sensitivity issues, all entries in <code>corrections.json<\/code> should be written in lowercase, ensuring consistent application across different inputs.<\/p>\n<h5>Usage Example in Speech Recognition<\/h5>\n<pre><code class=\"language-java\">String recognizedText = e.getResult().getText();\r\nrecognizedText = CorrectionUtils.correctRecognition(recognizedText);\r\nSystem.out.println(\"Corrected: Text=\" + recognizedText);<\/code><\/pre>\n<h2>Text-to-Speech (TTS) Optimization<\/h2>\n<h3>Obtaining TTS Models<\/h3>\n<p>Azure provides several TTS models:<\/p>\n<ul>\n<li><strong>Cloud Neural Voices<\/strong>: Highly natural out-of-the-box voices available via <a href=\"https:\/\/speech.microsoft.com\/portal\/customvoice\/overview\">Speech Studio portal<\/a>.<\/li>\n<li><strong>Custom Neural Voice (CNV)<\/strong>: Enables brand-specific voice training for online use via <a href=\"https:\/\/speech.microsoft.com\/portal\/customvoice\/overview\">Speech Studio portal<\/a>. It can also be trained as an <strong>Embedded Neural Voice<\/strong> with an additional cost for offline use. For more details, refer to the <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/custom-neural-voice\">Custom Neural Voice documentation<\/a>.<\/li>\n<li><strong>Embedded Neural Voices<\/strong>: Designed for edge deployment, allowing TTS processing without cloud dependency, ideal for low-latency applications. This is a Limited Access offering and applications can be submitted via <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/embedded-speech?tabs=android-target%2Cjre&amp;pivots=programming-language-java\">Azure AI Speech embedded speech limited access review<\/a>.<\/li>\n<\/ul>\n<h3>TTS Locale Support<\/h3>\n<h4>Cloud Neural Voices<\/h4>\n<p>Cloud Neural Voices supports <strong>150+ languages and accents<\/strong> with <strong>500+ voice models<\/strong>. You can find the complete list of supported languages and features in the official documentation: <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/language-support?tabs=tts#supported-languages\">Supported languages for Text-to-Speech<\/a>.<\/p>\n<h4>Custom Neural Voice (CNV)<\/h4>\n<p>CNV includes several feature types\u2014CNV Pro, CNV Lite, Cross-lingual voice source and target, and Multi-style voice\u2014and the supported locales vary depending on the feature selected. For more details, refer to the <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/language-support?tabs=tts#custom-neural-voice\">Supported languages for Custom Neural Voice<\/a>.<\/p>\n<h4>Embedded Neural Voices<\/h4>\n<p>For comparison with Speech-to-Text (STT), all Text-to-Speech (TTS) locales <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/language-support?tabs=tts#supported-languages\">here<\/a> (except Persian (fa-IR)) are available out of the box with at least <strong>one selected female and\/or male voice per locale<\/strong>. If a required voice model is not available in Embedded Neural Voices, <strong>additional cost may be required<\/strong> to train a new model.<\/p>\n<h3>Custom Lexicon for Pronunciation<\/h3>\n<p>To correct mispronunciations, a custom lexicon can be created using <strong>SSML<\/strong> or an <strong>external XML-based lexicon file<\/strong>. For more details, refer to the <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/speech-synthesis-markup-pronunciation#custom-lexicon\">Custom Lexicon documentation<\/a>.<\/p>\n<h4>Defining Pronunciation Using SSML<\/h4>\n<p>SSML supports inline phoneme correction using <code>&lt;phoneme&gt;<\/code> and <code>&lt;sub&gt;<\/code> elements.<\/p>\n<ul>\n<li><code>&lt;phoneme&gt;<\/code> allows pronunciation adjustments using <strong>International Phonetic Alphabet (IPA)<\/strong>, <strong>Speech API (SAPI)<\/strong>, <strong>Universal Phone Set (UPS)<\/strong>, or <strong>X-SAMPA<\/strong>.<\/li>\n<li><code>&lt;sub&gt;<\/code> is used for aliasing text.<\/li>\n<\/ul>\n<h5>Example<\/h5>\n<pre><code class=\"language-xml\">&lt;speak version=\"1.0\" xmlns=\"http:\/\/www.w3.org\/2001\/10\/synthesis\"\r\n       xmlns:mstts=\"http:\/\/www.w3.org\/2001\/mstts\"\r\n       xml:lang=\"en-US\"&gt;\r\n    &lt;voice name=\"en-US-JennyNeural\"&gt;\r\n        &lt;phoneme alphabet=\"ipa\" ph=\"\u02c8ma\u026akr\u0259\u02ccs\u0252ft\"&gt;Microsoft&lt;\/phoneme&gt;\r\n        &lt;sub alias=\"MSFT\"&gt;Microsoft&lt;\/sub&gt;\r\n    &lt;\/voice&gt;\r\n&lt;\/speak&gt;<\/code><\/pre>\n<p>For more details, refer to:\n<a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/speech-synthesis-markup-pronunciation#phoneme-element\">SSML Pronunciation Guide<\/a><\/p>\n<h4>Using an External Custom Lexicon File<\/h4>\n<p>For multiple custom pronunciations, an <strong>XML-based pronunciation lexicon file<\/strong> can be used.\nThis file should follow the <strong>Pronunciation Lexicon Specification (PLS) Version 1.0<\/strong>.<\/p>\n<h5>Example Custom Lexicon File (PLS XML)<\/h5>\n<pre><code class=\"language-xml\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\r\n&lt;lexicon version=\"1.0\"\r\n        xmlns=\"http:\/\/www.w3.org\/2005\/01\/pronunciation-lexicon\"\r\n        alphabet=\"ipa\" xml:lang=\"en-US\"&gt;\r\n    &lt;lexeme&gt;\r\n        &lt;grapheme&gt;Microsoft&lt;\/grapheme&gt;\r\n        &lt;phoneme&gt;\u02c8ma\u026akr\u0259\u02ccs\u0252ft&lt;\/phoneme&gt;\r\n    &lt;\/lexeme&gt;\r\n    &lt;lexeme&gt;\r\n        &lt;grapheme&gt;Azure&lt;\/grapheme&gt;\r\n        &lt;phoneme&gt;\u02c8\u00e6\u0292\u0259r&lt;\/phoneme&gt;\r\n    &lt;\/lexeme&gt;\r\n    &lt;lexeme&gt;\r\n        &lt;grapheme&gt;Power BI&lt;\/grapheme&gt;\r\n        &lt;alias&gt;PowerBI&lt;\/alias&gt;\r\n    &lt;\/lexeme&gt;\r\n    &lt;lexeme&gt;\r\n        &lt;grapheme&gt;PowerBI&lt;\/grapheme&gt;\r\n        &lt;phoneme&gt;\u02c8pa\u028a\u0259r\u02ccbi\u02d0\u02cca\u026a&lt;\/phoneme&gt;\r\n    &lt;\/lexeme&gt;\r\n    &lt;lexeme&gt;\r\n        &lt;grapheme&gt;Teams&lt;\/grapheme&gt;\r\n        &lt;phoneme&gt;ti\u02d0mz&lt;\/phoneme&gt;\r\n    &lt;\/lexeme&gt;\r\n&lt;\/lexicon&gt;<\/code><\/pre>\n<h5>Key Elements:<\/h5>\n<ul>\n<li><code>&lt;lexicon&gt;<\/code>: Root element for defining pronunciation rules.<\/li>\n<li><code>&lt;lexeme&gt;<\/code>: Defines pronunciation for each word.<\/li>\n<li><code>&lt;grapheme&gt;<\/code>: Represents the original text.<\/li>\n<li><code>&lt;phoneme&gt;<\/code>: Defines the pronunciation of the word.<\/li>\n<\/ul>\n<p><strong>Supported Phonetic Alphabets:<\/strong><\/p>\n<ul>\n<li><code>ipa<\/code>: International Phonetic Alphabet<\/li>\n<li><code>sapi<\/code>: Microsoft Speech API<\/li>\n<li><code>ups<\/code>: Universal Phone Set<\/li>\n<li><code>x-sampa<\/code>: Extended SAMPA<\/li>\n<\/ul>\n<p>For more details: <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/speech-synthesis-markup-pronunciation#custom-lexicon-file-examples\">Custom Lexicon Documentation<\/a><\/p>\n<h3>Implementing Custom Lexicon<\/h3>\n<h4>Saving and Storing the Lexicon File<\/h4>\n<p>The custom lexicon should be stored as <code>customlexicon.xml<\/code> and must be in <strong>UTF-8<\/strong> encoding.\n<strong>Embedded Neural Voices only support offline custom lexicons<\/strong>, so the lexicon file must be stored locally.<\/p>\n<ul>\n<li><strong>For Cloud TTS<\/strong>, the lexicon file should be uploaded to a <strong>publicly accessible location<\/strong> such as Azure Blob Storage.<\/li>\n<li><strong>For Embedded TTS<\/strong>, the lexicon file should be stored locally on the device.<\/li>\n<\/ul>\n<h4>Configuring SSML to Use a Custom Lexicon<\/h4>\n<p>To reference the custom lexicon in SSML, use the <code>&lt;lexicon&gt;<\/code> element with the appropriate file path.<\/p>\n<h5>Example SSML for Cloud TTS (Publicly Hosted Lexicon)<\/h5>\n<pre><code class=\"language-xml\">&lt;speak version=\"1.0\" xmlns=\"http:\/\/www.w3.org\/2001\/10\/synthesis\"\r\n       xmlns:mstts=\"http:\/\/www.w3.org\/2001\/mstts\"\r\n       xml:lang=\"en-US\"&gt;\r\n    &lt;voice name=\"en-US-JennyNeural\"&gt;\r\n        &lt;lexicon uri=\"https:\/\/example.com\/customlexicon.xml\"\/&gt;\r\n        Please launch Teams.\r\n    &lt;\/voice&gt;\r\n&lt;\/speak&gt;<\/code><\/pre>\n<h5>Example SSML for Embedded TTS (Local Lexicon)<\/h5>\n<pre><code class=\"language-xml\">&lt;speak version=\"1.0\" xmlns=\"http:\/\/www.w3.org\/2001\/10\/synthesis\"\r\n       xmlns:mstts=\"http:\/\/www.w3.org\/2001\/mstts\"\r\n       xml:lang=\"en-US\"&gt;\r\n    &lt;voice name=\"en-US-JennyNeural\"&gt;\r\n        &lt;lexicon uri=\"file:\/\/\/\/home\/customlexicon.xml\"\/&gt;\r\n        Please launch Teams.\r\n    &lt;\/voice&gt;\r\n&lt;\/speak&gt;<\/code><\/pre>\n<h5>Example SSML for Hybrid TTS (Cloud and Embedded)<\/h5>\n<pre><code class=\"language-xml\">&lt;speak version=\"1.0\" xmlns=\"http:\/\/www.w3.org\/2001\/10\/synthesis\"\r\n       xmlns:mstts=\"http:\/\/www.w3.org\/2001\/mstts\"\r\n       xml:lang=\"en-US\"&gt;\r\n    &lt;voice name=\"en-US-JennyNeural\"&gt;\r\n        &lt;lexicon uri=\"https:\/\/example.com\/customlexicon.xml\"\/&gt;\r\n        &lt;lexicon uri=\"file:\/\/\/\/home\/customlexicon.xml\"\/&gt;\r\n        Please launch Teams.\r\n    &lt;\/voice&gt;\r\n&lt;\/speak&gt;<\/code><\/pre>\n<p><strong>Important Notes:<\/strong><\/p>\n<ul>\n<li>The path format for local files uses <code>file:\/\/\/\/<\/code> (four slashes) in SSML.<\/li>\n<li>For Cloud TTS, the lexicon file must be accessible via <strong>HTTPS<\/strong>.<\/li>\n<li><strong>Hybrid TTS Configuration<\/strong>:\n<ul>\n<li>By default, <strong>HybridSpeechConfig<\/strong> prioritizes the cloud speech service.<\/li>\n<li>If the cloud connection is restricted or slow, the system automatically falls back to <strong>embedded speech<\/strong>.<\/li>\n<li>For text-to-speech synthesis, <strong>both embedded and cloud synthesis<\/strong> run in parallel, with the faster response selected.<\/li>\n<li>Each synthesis request dynamically chooses the optimal result, minimizing latency and maintaining high-quality output.<\/li>\n<li><strong>Reference<\/strong>: <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/embedded-speech?tabs=android-target%2Cjre&amp;pivots=programming-language-java#hybrid-speech\">Hybrid Speech in Azure Speech Services<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Considerations and Best Practices<\/h3>\n<h4>File Size Limits<\/h4>\n<ul>\n<li>The <strong>maximum lexicon file size is 100 KB<\/strong>.<\/li>\n<li>If the file exceeds this limit, split the lexicon into multiple files and reference them separately in SSML.<\/li>\n<\/ul>\n<h4>Caching Behavior<\/h4>\n<ul>\n<li>Custom lexicons are <strong>cached for 15 minutes<\/strong> after being loaded. Changes may take up to <strong>15 minutes<\/strong> to reflect.<\/li>\n<\/ul>\n<h4>Lexicon Validation<\/h4>\n<p>Microsoft provides a validation tool to ensure that custom lexicons are correctly formatted:\n<a href=\"https:\/\/github.com\/Azure-Samples\/Cognitive-Speech-TTS\/tree\/master\/CustomLexiconValidation\">Custom Lexicon Validation Tool<\/a><\/p>\n<h4>Recommended Phonetic Conversion Tools<\/h4>\n<ul>\n<li><a href=\"https:\/\/www.phonetizer.com\/downloads.html\">Phonetizer<\/a> \u2013 Convert words to IPA format.<\/li>\n<li><a href=\"http:\/\/ipa-reader.xyz\/\">IPA Reader<\/a> \u2013 Check IPA pronunciation.<\/li>\n<\/ul>\n<h2>Custom Keyword Recognition<\/h2>\n<h3>Self-Serve Portal Model<\/h3>\n<p>Developers can train custom keywords using the <strong>Custom Keyword portal in Azure AI Speech Studio<\/strong>. This is a self-serve experience that allows customers to create models without requiring Microsoft&#8217;s intervention.<\/p>\n<h4>Implementation Steps<\/h4>\n<ol>\n<li>Create a custom keyword in <a href=\"https:\/\/speech.microsoft.com\/portal\/customkeyword\">Speech Studio<\/a>.<\/li>\n<li>Download the generated <code>.table<\/code> file.<\/li>\n<li>Integrate the file with Azure Speech SDK to enable custom keyword recognition.<\/li>\n<\/ol>\n<p>For more details, refer to the <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/custom-keyword-basics?pivots=programming-language-csharp#create-a-keyword-in-speech-studio\">official documentation<\/a>.<\/p>\n<h4>System Requirements and Performance Considerations<\/h4>\n<ul>\n<li>The model is adapted with Text-to-Speech data and trained in Azure Speech Studio. Models are enabled in your application using the Azure Speech SDK.<\/li>\n<li>Depending on the device, custom keyword recognition may impact CPU usage. Optimization strategies should be tested on target hardware.<\/li>\n<\/ul>\n<p>This self-serve approach is ideal for standard keyword recognition needs but has some limitations, such as limited control over phonetic tuning and background noise resilience.<\/p>\n<h2>Conclusion<\/h2>\n<p>Optimizing Speech-to-Text (STT), Text-to-Speech (TTS), and custom keyword recognition is crucial for delivering seamless and accurate voice experiences. By leveraging Azure Speech Services&#8217; powerful capabilities, developers can enhance recognition accuracy, improve pronunciation, and implement hands-free activation tailored to their needs.<\/p>\n<p>To maximize performance:<\/p>\n<ul>\n<li><strong>Enhance STT accuracy<\/strong> by combining <strong>phrase lists, N-best analysis, and correction logic<\/strong> for better recognition of domain-specific terms and proper nouns.<\/li>\n<li><strong>Leverage custom lexicons and SSML<\/strong> to refine TTS pronunciation and ensure consistent voice output.<\/li>\n<li><strong>Adopt the custom keyword recognition model<\/strong> to enable hands-free activation, balancing ease of implementation with application requirements.<\/li>\n<\/ul>\n<p>Whether developing cloud-based or embedded voice applications, applying these strategies ensures robust and efficient speech processing. Take full advantage of Azure Speech Services to enhance your voice-enabled solutions today.<\/p>\n<p><em>The feature image was sourced from <a href=\"https:\/\/unsplash.com\/\">Unsplash<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog outlines best practices for optimizing Speech-to-Text (STT), Text-to-Speech (TTS), and Custom Keyword Recognition in Azure Speech Services, helping developers build more accurate and responsive voice-enabled applications.<\/p>\n","protected":false},"author":184616,"featured_media":16452,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,3451],"tags":[124,238],"class_list":["post-16451","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-ise","tag-cognitive-services","tag-machine-learning"],"acf":[],"blog_post_summary":"<p>This blog outlines best practices for optimizing Speech-to-Text (STT), Text-to-Speech (TTS), and Custom Keyword Recognition in Azure Speech Services, helping developers build more accurate and responsive voice-enabled applications.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16451","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/184616"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16451"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16451\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16452"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16451"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16451"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16451"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}