August 31st, 2023

Audio Alchemy: Transcribing and Translating with Azure SDKs

Mario Guerra
Senior Product Manager

Azure Cognitive Services offers a wide range of AI-powered services that can be utilized to enhance applications and services. In this article and associated demo project, I’m using Azure SDK client libraries to transcribe and translate audio files. By applying Azure Cognitive Services, you can easily convert audio files in one language to text in another language.

Prerequisites

Before diving into the tutorial, ensure that you have the following prerequisites:

  1. Python 3.6 or higher installed on your system.
  2. An Azure account with access to the Speech Service and Translator Service.
  3. The Azure Cognitive Services Speech library, Azure Cognitive Services Translator library, and python-dotenv package installed. You can install them using the following command:
   pip install azure-cognitiveservices-speech azure-ai-translation-text python-dotenv
  1. Get the demo project with the full script. You can clone the repo to your local machine using git:
    git clone https://github.com/mario-guerra/azure-speech-translator.git

Set up the environment

To set up your environment, retrieve the keys and endpoints for both the Speech Service and Translator Service from your Azure account.

To locate the keys and endpoints for both the Speech Service and Translator Service in your Azure account, follow these steps:

  1. Sign in to the Azure Portal: Visit the Azure portal and sign in with your Azure account credentials.
  2. Access the Speech Service:
    • In the left-hand menu, select “All services.”
    • In the search box, type “Speech” and select “Speech” from the results.
    • Choose the Speech Service you have created or create a new one if you haven’t already.
  3. Retrieve the Speech Service key and endpoint:
    • Once you have accessed your Speech Service, navigate to the “Keys and Endpoint” on the left-side menu under the “Resource Management” section.
    • Copy the Key1 or Key2 value (both are valid) and “Location/Region” value. These values are used as your AZURE_SPEECH_KEY and AZURE_SERVICE_REGION, respectively.
  4. Access the Translator Service:
    • Go back to the “All services” menu and search for “Translator.”
    • Select “Translator” from the results.
    • Choose the Translator Service you have created or create a new one if you haven’t already.
  5. Retrieve the Translator Service key and endpoint:
    • Once you have accessed your Translator Service, navigate to “Keys and Endpoint” under the “Resource Management” section in the left-hand menu.
    • Copy the Key1 or Key2 value (both are valid) and the “Text Translation” endpoint value under the “Web API” tab. These values are used as your AZURE_TRANSLATOR_KEY and AZURE_TRANSLATOR_ENDPOINT, respectively.

Once you have your keys, region, and endpoint, create a .env file in the same directory as your script and add the following environment variables:

AZURE_SPEECH_KEY=<your_speech_service_key>
AZURE_SERVICE_REGION=<your_speech_service_region>
AZURE_TRANSLATOR_KEY=<your_translator_service_key>
AZURE_TRANSLATOR_ENDPOINT=<your_translator_service_endpoint>

Replace the placeholder values with the appropriate keys and endpoints from your Azure account.

Audio translation script overview

The demo project features an audio translation script that processes input audio files in WAV format. It transcribes the audio using Azure Speech Service and translates the resulting text into the desired target language with Azure Translator Service.

This powerful tool can be invaluable for various applications, such as language learning, content localization, and accessibility services.

Customize the audio translation script

The provided audio translation script can be customized to suit specific requirements. For example, you can modify the script to:

  • Add support for more languages by updating the language_codes and translator_language_codes dictionaries.
  • Adjust the timeout settings for speech recognition by modifying the values of the SpeechServiceConnection_InitialSilenceTimeoutMs, SpeechServiceConnection_EndSilenceTimeoutMs, and Speech_SegmentationSilenceTimeoutMs properties.
  • Implement extra error handling and logging to improve the script’s robustness and maintainability.

Transcribe with continuous recognition vs. one-shot recognition

In speech recognition, two main approaches can be used: continuous recognition and one-shot recognition. Each has its own use cases and benefits. In my demo audio translation script, I chose continuous recognition for real-time translation, better handling of pauses, and greater flexibility.

Continuous recognition

Continuous recognition is a speech recognition approach that processes audio input in real-time and continuously recognizes speech as it is spoken. This method is useful when dealing with long audio files or live audio streams, as it provides real-time feedback and can handle pauses or interruptions in speech.

In continuous recognition, the speech recognizer listens for speech and generates results as it recognizes words and phrases. It can also raise events when it recognizes speech, allowing you to perform actions, such as translating the recognized text, as demonstrated in our script.

Here’s how I set up continuous recognition using the Azure Cognitive Services Speech library in my script:

  1. Configure the Speech Service by creating a SpeechConfig object and setting the speech recognition language and other properties:
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    speech_config.speech_recognition_language = speech_recognition_language
    speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_InitialSilenceTimeoutMs, "15000")
    speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs, "10000")
    speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationSilenceTimeoutMs, "5000")
  2. Define event handlers for the recognized and session_stopped events:
    def on_recognized(recognition_args, in_lang, out_lang):
        source_text = recognition_args.result.text
        print(f"Transcribed text: {source_text}")
    
        # Write the transcribed text to the transcription output file if specified
        if cmd_line_args.transcription:
            with open(cmd_line_args.transcription, 'a', encoding='utf-8') as f:
                f.write(f"{source_text}\n")
    
        # Translate the transcribed text using the Azure Translator SDK
        try:
            source_language = translator_language_codes[in_lang]
            # Translator service supports translation to multiple languages in one pass,
            # so it expects a bracketed list even when translating to only one language.
            target_languages = [translator_language_codes[out_lang]]
            input_text_elements = [ InputTextItem(text = source_text) ]
            response = text_translator.translate(content = input_text_elements, to = target_languages, from_parameter = source_language)
            translation = response[0] if response else None
    
            if translation:
                for translated_text in translation.translations:
                    print(f"Translated text: {translated_text.text}")
                    # Write the translated text to the output file
                    with open(cmd_line_args.output_file, 'a', encoding='utf-8') as f:
                        f.write(f"{translation}\n")
    
        except HttpResponseError as exception:
            print(f"Error Code: {exception.error.code}")
            print(f"Message: {exception.error.message}")
    
    def on_session_stopped(args):
        print("Continuous recognition session stopped.")
        global session_stopped
        session_stopped = True
  3. Create a SpeechRecognizer object using the SpeechConfig object and an AudioConfig object that specifies the input audio file:
    audio_input = speechsdk.audio.AudioConfig(filename=input_audio_file)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
  4. Connect the event handlers to the corresponding events of the SpeechRecognizer object and start the continuous recognition process asynchronously:
    speech_recognizer.recognized.connect(lambda recognition_args: on_recognized(recognition_args, cmd_line_args.in_lang, cmd_line_args.out_lang))
    speech_recognizer.session_stopped.connect(on_session_stopped)
    speech_recognizer.start_continuous_recognition_async().get()
  5. Wait for the session_stopped event to be triggered before proceeding to the next audio file or terminating the script:
    while not session_stopped:
       time.sleep(0.5)

By following these steps, I’ve set up continuous recognition using the Azure Cognitive Services Speech library. This approach allows the script to process audio input in real-time, handle pauses or interruptions in speech, and perform actions, such as translation, as soon as speech is recognized.

One-shot recognition

One-shot recognition, also known as single-utterance recognition, processes an entire audio file or a single utterance and returns the recognition result once the audio input is complete. This approach is suitable for short audio clips or situations where real-time feedback isn’t necessary.

To perform one-shot recognition, you would create a SpeechRecognizer object, just like in continuous recognition, and then call the recognize_once_async() method:

result = speech_recognizer.recognize_once_async().get()

The recognized text can be accessed using the result.text property. One-shot recognition is easier to implement, as it requires only a single function call, but it lacks the real-time feedback and flexibility of continuous recognition.

Translate the transcriptions

Once the audio files are transcribed, the script translates the transcriptions into the desired output language using the Azure Translator library. In the on_recognized event handler, the translation process is performed as follows:

  1. Retrieve the source and target language codes from the translator_language_codes dictionary:
    source_language = translator_language_codes[in_lang]
    target_languages = [translator_language_codes[out_lang]]
  2. Create a list of InputTextItem objects containing the transcribed text:
    input_text_elements = [ InputTextItem(text = source_text) ]
  3. Call the translate method of the TextTranslationClient object, passing the input text elements, target languages, and source language:
    response = text_translator.translate(content = input_text_elements, to = target_languages, from_parameter = source_language)
  4. Process the translation response and write the translated text to the output file:
    translation = response[0] if response else None
    
    if translation:
       for translated_text in translation.translations:
           print(f"Translated text: {translated_text.text}")
           # Write the translated text to the output file
           with open(cmd_line_args.output_file, 'a', encoding='utf-8') as f:
               f.write(f"{translation}\n")

Run the audio translation script

To run the script, use the following command:

python audio_translation.py --in-lang <input_language> --out-lang <output_language> <input_audio_pattern> <output_file> [--transcription <transcription_output_file>]

Replace the placeholders with the appropriate values:

  • <input_language>: The input language (currently supported: english, spanish, estonian, french, italian, german)
  • <output_language>: The output language (currently supported: english, spanish, estonian, french, italian, german)
  • <input_audio_pattern>: The path to the input audio files with a wildcard pattern (for example, ./*.wav)
  • <output_file>: The path to the output file containing the translations
  • <transcription_output_file> (optional): The path to the output file containing the transcriptions

For example:

python audio_translation.py --in-lang english --out-lang spanish ./input_audio/*.wav output.txt --transcription transcription.txt

This command transcribes and translates all .wav files in the input_audio directory from English to Spanish. The translations are saved in output.txt, and the transcriptions are saved in transcription.txt.

Output:

python .\azure_translator.py --in-lang spanish --out-lang english '.\Spanish test.wav' .\translation.txt
Processing audio file: .\Spanish test.wav
Transcribed text: Esta es una prueba del sistema de transmisiĂłn de emergencia. Solo es una prueba si esto fuera una emergencia real, estarĂ­a corriendo para salvar mi vida.
Translated text: This is a test of the emergency transmission system. It's just a test if this was a real emergency, I would be running for my life.
Continuous speech recognition session stopped.

Conclusion

By following this tutorial and using the provided audio translation script, you can efficiently transcribe and translate audio files into different languages. This powerful tool, powered by Azure Cognitive Services, opens up numerous possibilities for language learning, content localization, and accessibility services. With the added customization options and a choice between continuous recognition and one-shot recognition, the script becomes an even more versatile solution for your audio translation needs. I encourage you to explore further and experiment with the script to discover its full potential and adapt it to various use cases.

Happy coding!

Author

Mario Guerra
Senior Product Manager

Product Manager for Microsoft’s Azure SDK products. I help customers build great things in the cloud, and I help the Azure team determine what amazing things they should build next.

0 comments

Discussion are closed.

Feedback