Tuning and Optimization of Speech-to-Text (STT), Text-to-Speech (TTS), and Custom Keyword Recognition in Azure Speech Services

Introduction

Optimizing Speech-to-Text (STT) and Text-to-Speech (TTS) is essential for developers building voice-enabled applications, as it enhances recognition accuracy and overall user experience. Additionally, custom keyword recognition enables hands-free activation for voice assistants.

This blog outlines:

Methods for improving STT accuracy and tuning
How to enhance TTS pronunciation using custom lexicons
Approaches for implementing custom keyword recognition efficiently

Speech-to-Text (STT) Optimization

Obtaining STT Models

Azure provides several STT models:

Default STT Models: Available via Azure AI Foundry portal for general-purpose speech recognition.
Custom Speech Models: Trainable via Azure AI Foundry portal for specific accents and terminology.
Embedded Speech Models: Designed for edge deployment, enabling low-latency offline processing. This is a Limited Access offering and applications can be submitted via Azure AI Speech embedded speech limited access review.

STT Locale Support

Default STT Models

Azure Speech-to-Text (STT) supports a wide range of languages and locales using default models. You can find the complete list of supported languages and features in the official documentation: Supported languages for Speech-to-Text.

Custom Speech Models

To improve speech to text recognition accuracy, customization is available for some languages and base models. Depending on the locale, you can upload audio + human-labeled transcripts, plain text, structured text, and pronunciation data. By default, plain text customization is supported for all available base models.

Embedded Speech Models

The following Embedded STT models are currently available:

Danish (da-DK)
German (de-DE)
English (en-AU, en-CA, en-GB, en-IE, en-IN, en-NZ, en-US)
Spanish (es-ES, es-MX)
French (fr-CA, fr-FR)
Italian (it-IT)
Japanese (ja-JP)
Korean (ko-KR)
Portuguese (pt-BR, pt-PT)
Chinese (zh-CN, zh-HK, zh-TW)

You can check the official documentation for the most up-to-date list of supported languages and models: Embedded Speech Models and Voices.

Improving Recognition Accuracy

Several techniques enhance STT accuracy by correcting misrecognitions and adapting the model to domain-specific vocabulary.

Phrase List (For both Embedded and Cloud models)

A phrase list improves recognition for specific words and phrases. This is particularly useful for proper nouns, brand names, and technical terms. For more details, refer to the Improve recognition accuracy with phrase list.

Implementation Example (Java)

import com.microsoft.cognitiveservices.speech.PhraseListGrammar;

PhraseListGrammar phraseList = PhraseListGrammar.fromRecognizer(recognizer);
phraseList.addPhrase("Microsoft");
phraseList.addPhrases(["Azure", "Teams"]);

Additional Features

To remove all added phrases and reset the phrase list, use:
```
phraseList.clear();
```
Azure Speech SDK supports over 1,000 phrase list entries, but initialization time increases with the number of phrases.

Custom Correction Logic (For both Embedded and Cloud models (only en-US))

Phrase lists alone may not be sufficient for certain words. Custom logic can be implemented to force corrections of frequently misrecognized words.

Preparing `corrections.json`

To refine the correction logic, two methods can be used to identify misrecognition patterns:

Analyze results after applying the phrase list to find words that are still misrecognized.
Use N-best results from Word Level Details to check alternative recognition candidates.

What is N-best?

N-best refers to a ranked list of recognition hypotheses, where the STT engine provides not only the top result but also alternative candidates. By analyzing lower-ranked alternatives, it is possible to identify and correct systematic misrecognitions.

Enabling N-best in Word Level Details

To retrieve N-best results and detailed word-level recognition information, configure speechConfig with the following properties:

speechConfig.setSpeechRecognitionOutputFormat(OutputFormat.Detailed);
speechConfig.setProperty("SpeechRecognition_RequestWordLevelCorrections", "true");

Parsing N-best Results

Once the STT engine provides N-best results, the application can analyze them to refine recognition accuracy. Below is an example of how to parse N-best results from the JSON response:

// Fetch JSON result containing N-best
String jsonResult = e.getResult()
                     .getProperties()
                     .getProperty(PropertyId.SpeechServiceResponse_JsonResult);
System.out.println("JSON result: " + jsonResult);

// Parse JSON for N-best recognition candidates
JSONObject json = new JSONObject(jsonResult);
JSONArray nbestArray = json.getJSONArray("NBest");

if (nbestArray != null && nbestArray.length() > 0) {
    for (int i = 0; i < nbestArray.length(); i++) {
        JSONObject candidate = nbestArray.getJSONObject(i);
        System.out.println("Candidate " + (i+1) + ": " + candidate.getString("Display") +
                           " (Confidence: " + candidate.getDouble("Confidence") + ")");
    }
}

N-best Example with Sentence Input

When an entire sentence is processed instead of a single word, N-best alternatives focus on specific words within the sentence. Below is an actual example where the word “out” has multiple alternatives:

"NBest": [
    {
      "Confidence": 0.867123,
      "Lexical": "tune goggles",
      "Display": "Tune Goggles.",
      "Words": [
          {
              "Word": "tune",
              "Confidence": 0.867123,
              "Offset": 12500000,
              "Duration": 4500000
          },
          {
              "Word": "goggles",
              "Confidence": 0.867123,
              "Offset": 17500000,
              "Duration": 7000000
          }
      ]
      ...
    }
],
"Corrections": {
    "CorrectionCandidates": [
        {
            "Alternates": [
                { "AlternateWords": ["toon"], "Id": 0, "SourceSpan": [0] }
            ],
            "Confidence": "High",
            "Id": 0,
            "Span": [0]
        }
    ],
}

In this example, “toon” is an alternative candidate for “tune” (SourceSpan = 0), making it a possible correction in the phrase “tune goggles.”

This insight helps refine correction rules, ensuring misrecognitions like “tune goggles” → “ToonGoggles” are automatically corrected. By analyzing lower-confidence alternatives, developers can improve STT accuracy for domain-specific terms.

Example Correction Rules (`corrections.json`)

{
  "ToonGoggles": ["tune goggles", "toon googles", "tune google"]
}

Implementing Correction Logic

To efficiently apply correction rules, a dedicated correction utility class (CorrectionUtils) is used. This class loads misrecognition patterns from corrections.json and automatically corrects recognized text.

Key Features of `CorrectionUtils`

Loads correction rules dynamically from a JSON file.
Converts all misrecognition patterns to lowercase to ensure case-insensitive correction.
Replaces misrecognized words with the correct phrase while preserving the rest of the sentence.
Can be updated by modifying corrections.json without changing application code.

Code Snippet of `CorrectionUtils` (Java)

public class CorrectionUtils {
    private static final Map<String, List<String>> correctionMap = new HashMap<>();

    static { loadCorrections("corrections.json"); }

    private static void loadCorrections(String filePath) {
        try {
            // Load JSON file
            ObjectMapper mapper = new ObjectMapper();
            Map<String, List<String>> corrections = mapper.readValue(
                new BufferedReader(new FileReader(filePath)),
                new TypeReference<Map<String, List<String>>>() {}
            );

           // Convert all misrecognition lists to lowercase
            for (Map.Entry<String, List<String>> entry : corrections.entrySet()) {
                correctionMap.put(entry.getKey(),
                    entry.getValue().stream()
                    .map(String::toLowerCase)
                    .collect(Collectors.toList())
                );
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static String correctRecognition(String recognizedText) {
        String lowerCaseText = recognizedText.toLowerCase(); // Convert input to lowercase
        for (Map.Entry<String, List<String>> entry : correctionMap.entrySet()) {
            String correctPhrase = entry.getKey();
            List<String> misrecognitions = entry.getValue();

            for (String misrecognition : misrecognitions) {
                if (lowerCaseText.contains(misrecognition)) {
                    // Correct the recognition result
                    recognizedText = recognizedText.replaceAll("(?i)" + misrecognition, correctPhrase);
                }
            }
        }
        return recognizedText;
    }
}

To avoid case-sensitivity issues, all entries in corrections.json should be written in lowercase, ensuring consistent application across different inputs.

Usage Example in Speech Recognition

String recognizedText = e.getResult().getText();
recognizedText = CorrectionUtils.correctRecognition(recognizedText);
System.out.println("Corrected: Text=" + recognizedText);

Text-to-Speech (TTS) Optimization

Obtaining TTS Models

Azure provides several TTS models:

Cloud Neural Voices: Highly natural out-of-the-box voices available via Speech Studio portal.
Custom Neural Voice (CNV): Enables brand-specific voice training for online use via Speech Studio portal. It can also be trained as an Embedded Neural Voice with an additional cost for offline use. For more details, refer to the Custom Neural Voice documentation.
Embedded Neural Voices: Designed for edge deployment, allowing TTS processing without cloud dependency, ideal for low-latency applications. This is a Limited Access offering and applications can be submitted via Azure AI Speech embedded speech limited access review.

TTS Locale Support

Cloud Neural Voices

Cloud Neural Voices supports 150+ languages and accents with 500+ voice models. You can find the complete list of supported languages and features in the official documentation: Supported languages for Text-to-Speech.

Custom Neural Voice (CNV)

CNV includes several feature types—CNV Pro, CNV Lite, Cross-lingual voice source and target, and Multi-style voice—and the supported locales vary depending on the feature selected. For more details, refer to the Supported languages for Custom Neural Voice.

Embedded Neural Voices

For comparison with Speech-to-Text (STT), all Text-to-Speech (TTS) locales here (except Persian (fa-IR)) are available out of the box with at least one selected female and/or male voice per locale. If a required voice model is not available in Embedded Neural Voices, additional cost may be required to train a new model.

Custom Lexicon for Pronunciation

To correct mispronunciations, a custom lexicon can be created using SSML or an external XML-based lexicon file. For more details, refer to the Custom Lexicon documentation.

Defining Pronunciation Using SSML

SSML supports inline phoneme correction using <phoneme> and <sub> elements.

<phoneme> allows pronunciation adjustments using International Phonetic Alphabet (IPA), Speech API (SAPI), Universal Phone Set (UPS), or X-SAMPA.
<sub> is used for aliasing text.

Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="http://www.w3.org/2001/mstts"
       xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <phoneme alphabet="ipa" ph="ˈmaɪkrəˌsɒft">Microsoft</phoneme>
        <sub alias="MSFT">Microsoft</sub>
    </voice>
</speak>

For more details, refer to: SSML Pronunciation Guide

Using an External Custom Lexicon File

For multiple custom pronunciations, an XML-based pronunciation lexicon file can be used. This file should follow the Pronunciation Lexicon Specification (PLS) Version 1.0.

Example Custom Lexicon File (PLS XML)

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
        xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
        alphabet="ipa" xml:lang="en-US">
    <lexeme>
        <grapheme>Microsoft</grapheme>
        <phoneme>ˈmaɪkrəˌsɒft</phoneme>
    </lexeme>
    <lexeme>
        <grapheme>Azure</grapheme>
        <phoneme>ˈæʒər</phoneme>
    </lexeme>
    <lexeme>
        <grapheme>Power BI</grapheme>
        <alias>PowerBI</alias>
    </lexeme>
    <lexeme>
        <grapheme>PowerBI</grapheme>
        <phoneme>ˈpaʊərˌbiːˌaɪ</phoneme>
    </lexeme>
    <lexeme>
        <grapheme>Teams</grapheme>
        <phoneme>tiːmz</phoneme>
    </lexeme>
</lexicon>

Key Elements:

<lexicon>: Root element for defining pronunciation rules.
<lexeme>: Defines pronunciation for each word.
<grapheme>: Represents the original text.
<phoneme>: Defines the pronunciation of the word.

Supported Phonetic Alphabets:

ipa: International Phonetic Alphabet
sapi: Microsoft Speech API
ups: Universal Phone Set
x-sampa: Extended SAMPA

For more details: Custom Lexicon Documentation

Implementing Custom Lexicon

Saving and Storing the Lexicon File

The custom lexicon should be stored as customlexicon.xml and must be in UTF-8 encoding. Embedded Neural Voices only support offline custom lexicons, so the lexicon file must be stored locally.

For Cloud TTS, the lexicon file should be uploaded to a publicly accessible location such as Azure Blob Storage.
For Embedded TTS, the lexicon file should be stored locally on the device.

Configuring SSML to Use a Custom Lexicon

To reference the custom lexicon in SSML, use the <lexicon> element with the appropriate file path.

Example SSML for Cloud TTS (Publicly Hosted Lexicon)

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="http://www.w3.org/2001/mstts"
       xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <lexicon uri="https://example.com/customlexicon.xml"/>
        Please launch Teams.
    </voice>
</speak>

Example SSML for Embedded TTS (Local Lexicon)

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="http://www.w3.org/2001/mstts"
       xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <lexicon uri="file:////home/customlexicon.xml"/>
        Please launch Teams.
    </voice>
</speak>

Example SSML for Hybrid TTS (Cloud and Embedded)

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="http://www.w3.org/2001/mstts"
       xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <lexicon uri="https://example.com/customlexicon.xml"/>
        <lexicon uri="file:////home/customlexicon.xml"/>
        Please launch Teams.
    </voice>
</speak>

Important Notes:

The path format for local files uses file://// (four slashes) in SSML.
For Cloud TTS, the lexicon file must be accessible via HTTPS.
Hybrid TTS Configuration:
- By default, HybridSpeechConfig prioritizes the cloud speech service.
- If the cloud connection is restricted or slow, the system automatically falls back to embedded speech.
- For text-to-speech synthesis, both embedded and cloud synthesis run in parallel, with the faster response selected.
- Each synthesis request dynamically chooses the optimal result, minimizing latency and maintaining high-quality output.
- Reference: Hybrid Speech in Azure Speech Services

Considerations and Best Practices

File Size Limits

The maximum lexicon file size is 100 KB.
If the file exceeds this limit, split the lexicon into multiple files and reference them separately in SSML.

Caching Behavior

Custom lexicons are cached for 15 minutes after being loaded. Changes may take up to 15 minutes to reflect.

Lexicon Validation

Microsoft provides a validation tool to ensure that custom lexicons are correctly formatted: Custom Lexicon Validation Tool

Recommended Phonetic Conversion Tools

Phonetizer – Convert words to IPA format.
IPA Reader – Check IPA pronunciation.

Custom Keyword Recognition

Self-Serve Portal Model

Developers can train custom keywords using the Custom Keyword portal in Azure AI Speech Studio. This is a self-serve experience that allows customers to create models without requiring Microsoft’s intervention.

Implementation Steps

Create a custom keyword in Speech Studio.
Download the generated .table file.
Integrate the file with Azure Speech SDK to enable custom keyword recognition.

For more details, refer to the official documentation.

System Requirements and Performance Considerations

The model is adapted with Text-to-Speech data and trained in Azure Speech Studio. Models are enabled in your application using the Azure Speech SDK.
Depending on the device, custom keyword recognition may impact CPU usage. Optimization strategies should be tested on target hardware.

This self-serve approach is ideal for standard keyword recognition needs but has some limitations, such as limited control over phonetic tuning and background noise resilience.

Conclusion

Optimizing Speech-to-Text (STT), Text-to-Speech (TTS), and custom keyword recognition is crucial for delivering seamless and accurate voice experiences. By leveraging Azure Speech Services’ powerful capabilities, developers can enhance recognition accuracy, improve pronunciation, and implement hands-free activation tailored to their needs.

To maximize performance:

Enhance STT accuracy by combining phrase lists, N-best analysis, and correction logic for better recognition of domain-specific terms and proper nouns.
Leverage custom lexicons and SSML to refine TTS pronunciation and ensure consistent voice output.
Adopt the custom keyword recognition model to enable hands-free activation, balancing ease of implementation with application requirements.

Whether developing cloud-based or embedded voice applications, applying these strategies ensures robust and efficient speech processing. Take full advantage of Azure Speech Services to enhance your voice-enabled solutions today.

The feature image was sourced from Unsplash.