Azure Cognitive Services offers a wide range of AI-powered services that can be utilized to enhance applications and services. In this article and associated demo project, I’m using Azure SDK client libraries to transcribe and translate audio files. By applying Azure Cognitive Services, you can easily convert audio files in one language to text in another language.
Prerequisites
Before diving into the tutorial, ensure that you have the following prerequisites:
- Python 3.6 or higher installed on your system.
- An Azure account with access to the Speech Service and Translator Service.
- The Azure Cognitive Services Speech library, Azure Cognitive Services Translator library, and
python-dotenv
package installed. You can install them using the following command:
pip install azure-cognitiveservices-speech azure-ai-translation-text python-dotenv
- Get the demo project with the full script. You can clone the repo to your local machine using git:
git clone https://github.com/mario-guerra/azure-speech-translator.git
Set up the environment
To set up your environment, retrieve the keys and endpoints for both the Speech Service and Translator Service from your Azure account.
To locate the keys and endpoints for both the Speech Service and Translator Service in your Azure account, follow these steps:
- Sign in to the Azure Portal: Visit the Azure portal and sign in with your Azure account credentials.
- Access the Speech Service:
- In the left-hand menu, select “All services.”
- In the search box, type “Speech” and select “Speech” from the results.
- Choose the Speech Service you have created or create a new one if you haven’t already.
- Retrieve the Speech Service key and endpoint:
- Once you have accessed your Speech Service, navigate to the “Keys and Endpoint” on the left-side menu under the “Resource Management” section.
- Copy the
Key1
orKey2
value (both are valid) and “Location/Region” value. These values are used as yourAZURE_SPEECH_KEY
andAZURE_SERVICE_REGION
, respectively.
- Access the Translator Service:
- Go back to the “All services” menu and search for “Translator.”
- Select “Translator” from the results.
- Choose the Translator Service you have created or create a new one if you haven’t already.
- Retrieve the Translator Service key and endpoint:
- Once you have accessed your Translator Service, navigate to “Keys and Endpoint” under the “Resource Management” section in the left-hand menu.
- Copy the
Key1
orKey2
value (both are valid) and the “Text Translation” endpoint value under the “Web API” tab. These values are used as yourAZURE_TRANSLATOR_KEY
andAZURE_TRANSLATOR_ENDPOINT
, respectively.
Once you have your keys, region, and endpoint, create a .env
file in the same directory as your script and add the following environment variables:
AZURE_SPEECH_KEY=<your_speech_service_key>
AZURE_SERVICE_REGION=<your_speech_service_region>
AZURE_TRANSLATOR_KEY=<your_translator_service_key>
AZURE_TRANSLATOR_ENDPOINT=<your_translator_service_endpoint>
Replace the placeholder values with the appropriate keys and endpoints from your Azure account.
Audio translation script overview
The demo project features an audio translation script that processes input audio files in WAV format. It transcribes the audio using Azure Speech Service and translates the resulting text into the desired target language with Azure Translator Service.
This powerful tool can be invaluable for various applications, such as language learning, content localization, and accessibility services.
Customize the audio translation script
The provided audio translation script can be customized to suit specific requirements. For example, you can modify the script to:
- Add support for more languages by updating the
language_codes
andtranslator_language_codes
dictionaries. - Adjust the timeout settings for speech recognition by modifying the values of the
SpeechServiceConnection_InitialSilenceTimeoutMs
,SpeechServiceConnection_EndSilenceTimeoutMs
, andSpeech_SegmentationSilenceTimeoutMs
properties. - Implement extra error handling and logging to improve the script’s robustness and maintainability.
Transcribe with continuous recognition vs. one-shot recognition
In speech recognition, two main approaches can be used: continuous recognition and one-shot recognition. Each has its own use cases and benefits. In my demo audio translation script, I chose continuous recognition for real-time translation, better handling of pauses, and greater flexibility.
Continuous recognition
Continuous recognition is a speech recognition approach that processes audio input in real-time and continuously recognizes speech as it is spoken. This method is useful when dealing with long audio files or live audio streams, as it provides real-time feedback and can handle pauses or interruptions in speech.
In continuous recognition, the speech recognizer listens for speech and generates results as it recognizes words and phrases. It can also raise events when it recognizes speech, allowing you to perform actions, such as translating the recognized text, as demonstrated in our script.
Here’s how I set up continuous recognition using the Azure Cognitive Services Speech library in my script:
- Configure the Speech Service by creating a
SpeechConfig
object and setting the speech recognition language and other properties:speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region) speech_config.speech_recognition_language = speech_recognition_language speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_InitialSilenceTimeoutMs, "15000") speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs, "10000") speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationSilenceTimeoutMs, "5000")
- Define event handlers for the
recognized
andsession_stopped
events:def on_recognized(recognition_args, in_lang, out_lang): source_text = recognition_args.result.text print(f"Transcribed text: {source_text}") # Write the transcribed text to the transcription output file if specified if cmd_line_args.transcription: with open(cmd_line_args.transcription, 'a', encoding='utf-8') as f: f.write(f"{source_text}\n") # Translate the transcribed text using the Azure Translator SDK try: source_language = translator_language_codes[in_lang] # Translator service supports translation to multiple languages in one pass, # so it expects a bracketed list even when translating to only one language. target_languages = [translator_language_codes[out_lang]] input_text_elements = [ InputTextItem(text = source_text) ] response = text_translator.translate(content = input_text_elements, to = target_languages, from_parameter = source_language) translation = response[0] if response else None if translation: for translated_text in translation.translations: print(f"Translated text: {translated_text.text}") # Write the translated text to the output file with open(cmd_line_args.output_file, 'a', encoding='utf-8') as f: f.write(f"{translation}\n") except HttpResponseError as exception: print(f"Error Code: {exception.error.code}") print(f"Message: {exception.error.message}") def on_session_stopped(args): print("Continuous recognition session stopped.") global session_stopped session_stopped = True
- Create a
SpeechRecognizer
object using theSpeechConfig
object and anAudioConfig
object that specifies the input audio file:audio_input = speechsdk.audio.AudioConfig(filename=input_audio_file) speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
- Connect the event handlers to the corresponding events of the
SpeechRecognizer
object and start the continuous recognition process asynchronously:speech_recognizer.recognized.connect(lambda recognition_args: on_recognized(recognition_args, cmd_line_args.in_lang, cmd_line_args.out_lang)) speech_recognizer.session_stopped.connect(on_session_stopped) speech_recognizer.start_continuous_recognition_async().get()
- Wait for the
session_stopped
event to be triggered before proceeding to the next audio file or terminating the script:while not session_stopped: time.sleep(0.5)
By following these steps, I’ve set up continuous recognition using the Azure Cognitive Services Speech library. This approach allows the script to process audio input in real-time, handle pauses or interruptions in speech, and perform actions, such as translation, as soon as speech is recognized.
One-shot recognition
One-shot recognition, also known as single-utterance recognition, processes an entire audio file or a single utterance and returns the recognition result once the audio input is complete. This approach is suitable for short audio clips or situations where real-time feedback isn’t necessary.
To perform one-shot recognition, you would create a SpeechRecognizer
object, just like in continuous recognition, and then call the recognize_once_async()
method:
result = speech_recognizer.recognize_once_async().get()
The recognized text can be accessed using the result.text
property. One-shot recognition is easier to implement, as it requires only a single function call, but it lacks the real-time feedback and flexibility of continuous recognition.
Translate the transcriptions
Once the audio files are transcribed, the script translates the transcriptions into the desired output language using the Azure Translator library. In the on_recognized
event handler, the translation process is performed as follows:
- Retrieve the source and target language codes from the
translator_language_codes
dictionary:source_language = translator_language_codes[in_lang] target_languages = [translator_language_codes[out_lang]]
- Create a list of
InputTextItem
objects containing the transcribed text:input_text_elements = [ InputTextItem(text = source_text) ]
- Call the
translate
method of theTextTranslationClient
object, passing the input text elements, target languages, and source language:response = text_translator.translate(content = input_text_elements, to = target_languages, from_parameter = source_language)
- Process the translation response and write the translated text to the output file:
translation = response[0] if response else None if translation: for translated_text in translation.translations: print(f"Translated text: {translated_text.text}") # Write the translated text to the output file with open(cmd_line_args.output_file, 'a', encoding='utf-8') as f: f.write(f"{translation}\n")
Run the audio translation script
To run the script, use the following command:
python audio_translation.py --in-lang <input_language> --out-lang <output_language> <input_audio_pattern> <output_file> [--transcription <transcription_output_file>]
Replace the placeholders with the appropriate values:
<input_language>
: The input language (currently supported:english, spanish, estonian, french, italian, german
)<output_language>
: The output language (currently supported:english, spanish, estonian, french, italian, german
)<input_audio_pattern>
: The path to the input audio files with a wildcard pattern (for example, ./*.wav)<output_file>
: The path to the output file containing the translations<transcription_output_file>
(optional): The path to the output file containing the transcriptions
For example:
python audio_translation.py --in-lang english --out-lang spanish ./input_audio/*.wav output.txt --transcription transcription.txt
This command transcribes and translates all .wav
files in the input_audio
directory from English to Spanish. The translations are saved in output.txt
, and the transcriptions are saved in transcription.txt
.
Output:
python .\azure_translator.py --in-lang spanish --out-lang english '.\Spanish test.wav' .\translation.txt
Processing audio file: .\Spanish test.wav
Transcribed text: Esta es una prueba del sistema de transmisiĂłn de emergencia. Solo es una prueba si esto fuera una emergencia real, estarĂa corriendo para salvar mi vida.
Translated text: This is a test of the emergency transmission system. It's just a test if this was a real emergency, I would be running for my life.
Continuous speech recognition session stopped.
Conclusion
By following this tutorial and using the provided audio translation script, you can efficiently transcribe and translate audio files into different languages. This powerful tool, powered by Azure Cognitive Services, opens up numerous possibilities for language learning, content localization, and accessibility services. With the added customization options and a choice between continuous recognition and one-shot recognition, the script becomes an even more versatile solution for your audio translation needs. I encourage you to explore further and experiment with the script to discover its full potential and adapt it to various use cases.
Happy coding!
0 comments