Speech-to-Text Conversion with Azure
Photo by Andrea Piacquadio
Objective:
Create a system that leverages Azure services for converting spoken language into written text. This project focuses on using Azure Speech Services to perform speech-to-text conversion.
Technologies and Services Used:
- Azure Speech SDK: To interact with Azure Speech Services.
- Azure Speech Services (Speech-to-Text): For converting spoken language into text.
- Azure Storage (Optional): To store the converted text data.
Steps:
1. Azure Speech Service Setup:
- Create an Azure Speech resource on the Azure Portal.
- Obtain the necessary API keys and endpoint for authentication.
2. Development Environment:
- Use a programming language of your choice (e.g., Python, C#).
- Install the Azure Speech SDK for your chosen language.
3. Integration with Azure Speech Services:
- Use the Azure Speech SDK to connect to Azure Speech Services.
- Implement a method to send audio data for speech-to-text conversion.
4. Speech-to-Text Conversion:
- Develop a script or application that utilizes Azure Speech Services to convert spoken language into text.
- Handle different audio file formats (e.g., WAV, MP3).
5. Text Data Handling:
- Process the converted text data as needed (e.g., store in a database, analyze sentiment, extract key phrases).
6. Optional: Azure Storage Integration:
- Implement Azure Storage to store the converted text data for future reference or analysis.
Example Code (Python — Using Azure Speech SDK):
```python
from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig
# Set up Azure Speech Service
speech_key = “your_speech_key”
service_region = “your_service_region”
speech_config = SpeechConfig(subscription=speech_key, region=service_region)
audio_config = AudioConfig(filename=”path/to/audio/file.wav”)
# Initialize Speech Recognizer
speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
# Perform Speech-to-Text Conversion
result = speech_recognizer.recognize_once()
# Display the converted text
if result.reason == ResultReason.RecognizedSpeech:
print(“Recognized: {}”.format(result.text))
elif result.reason == ResultReason.NoMatch:
print(“No speech could be recognized”)
elif result.reason == ResultReason.Canceled:
cancellation_details = result.cancellation_details
print(“Speech Recognition canceled: {}”.format(cancellation_details.reason))
if cancellation_details.reason == CancellationReason.Error:
print(“Error details: {}”.format(cancellation_details.error_details))
```
If you want to process all the audio files from BLOB storage by serverless Azure Function. Then you can follow the steps below:
To perform batch processing of audio files using a Flask Azure Function (serverless), you can follow these general steps. This example assumes you have a collection of audio files in a specific storage container and you want to process them using Azure Speech Services.
1. Set Up Your Flask Azure Function:
1. Create a new Azure Function App in the Azure Portal.
2. Add a new HTTP-triggered function with Python and Flask template.
3. Configure the necessary environment variables, such as the connection string to your storage account and the API key for Azure Speech Services.
2. Install Required Packages:
In your Azure Function’s requirements.txt, add the necessary packages:
```
azure-functions
azure-cognitiveservices-speech
azure-storage-blob
```
The example assumes a function get_audio_files_from_storage() to retrieve the list of audio files, but the implementation of this function is not explicitly detailed in the given code snippet.
If you intend to use Azure Blob Storage for storing your audio files, you can integrate the Azure Storage SDK for Python (azure-storage-blob). Here’s a modified version of the example to illustrate the use of Azure Blob Storage:
3. Modify Your Flask Function Code:
Update your Flask function code to handle batch processing. Here’s a simplified example:
```python
import os
import json
import azure.functions as func
from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
def process_audio(file_path):
# Set up Azure Speech Service
speech_key = os.environ[‘SpeechKey’]
service_region = os.environ[‘SpeechRegion’]
speech_config = SpeechConfig(subscription=speech_key, region=service_region)
audio_config = AudioConfig(filename=file_path)
# Initialize Speech Recognizer
speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
# Perform Speech-to-Text Conversion
result = speech_recognizer.recognize_once()
# Return the converted text
return result.text if result.reason == ResultReason.RecognizedSpeech else None
def get_audio_files_from_storage(container_name):
# Set up Azure Storage
storage_connection_string = os.environ[‘AzureWebJobsStorage’]
blob_service_client = BlobServiceClient.from_connection_string(storage_connection_string)
container_client = blob_service_client.get_container_client(container_name)
# Retrieve list of audio files
audio_files = [blob.name for blob in container_client.list_blobs()]
return audio_files
def main(req: func.HttpRequest) -> func.HttpResponse:
# Retrieve the list of audio files from storage (replace ‘your-container-name’)
audio_files = get_audio_files_from_storage(‘your-container-name’)
# Process each audio file
results = []
for file_path in audio_files:
text_result = process_audio(file_path)
results.append({“file”: file_path, “text”: text_result})
# Return the results as JSON
return func.HttpResponse(json.dumps(results), mimetype=”application/json”)
```
4. Batch Processing Logic:
- Use Azure Storage SDK to list and retrieve the audio files from your storage container.
- Iterate over the list of files, calling the `process_audio` function for each file.
- Aggregate the results and return them as JSON.
5. Testing:
Test your Azure Function locally using tools like Azure Functions Core Tools or by deploying it to Azure and triggering it through an HTTP request.
6. Deployment:
Deploy your function to Azure Function App using Azure CLI or Azure DevOps.
Note:
Ensure that your storage account, Azure Speech Service, and Azure Function App are properly configured with the necessary keys and connection strings.
Learning Resources:
- Azure Speech SDK Documentation
- Azure Speech-to-Text Documentation
This sample code from Azure example github repo
```python
#!/usr/bin/env python
# coding: utf-8
# Copyright © Microsoft. All rights reserved.
# Licensed under the MIT license. See LICENSE.md file in the project root for full license information.
import logging
import sys
import requests
import time
import swagger_client
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG,
format=”%(asctime)s %(message)s”, datefmt=”%m/%d/%Y %I:%M:%S %p %Z”)
# Your subscription key and region for the speech service
SUBSCRIPTION_KEY = “YourSubscriptionKey”
SERVICE_REGION = “YourServiceRegion”
NAME = “Simple transcription”
DESCRIPTION = “Simple transcription description”
LOCALE = “en-US”
RECORDINGS_BLOB_URI = “<Your SAS Uri to the recording>”
# Provide the uri of a container with audio files for transcribing all of them
# with a single request. At least ‘read’ and ‘list’ (rl) permissions are required.
RECORDINGS_CONTAINER_URI = “<Your SAS Uri to a container of audio files>”
# Set model information when doing transcription with custom models
MODEL_REFERENCE = None # guid of a custom model
def transcribe_from_single_blob(uri, properties):
“””
Transcribe a single audio file located at `uri` using the settings specified in `properties`
using the base model for the specified locale.
“””
transcription_definition = swagger_client.Transcription(
display_name=NAME,
description=DESCRIPTION,
locale=LOCALE,
content_urls=[uri],
properties=properties
)
return transcription_definition
def transcribe_with_custom_model(client, uri, properties):
“””
Transcribe a single audio file located at `uri` using the settings specified in `properties`
using the base model for the specified locale.
“””
# Model information (ADAPTED_ACOUSTIC_ID and ADAPTED_LANGUAGE_ID) must be set above.
if MODEL_REFERENCE is None:
logging.error(“Custom model ids must be set when using custom models”)
sys.exit()
model = {‘self’: f’{client.configuration.host}/models/{MODEL_REFERENCE}’}
transcription_definition = swagger_client.Transcription(
display_name=NAME,
description=DESCRIPTION,
locale=LOCALE,
content_urls=[uri],
model=model,
properties=properties
)
return transcription_definition
def transcribe_from_container(uri, properties):
“””
Transcribe all files in the container located at `uri` using the settings specified in `properties`
using the base model for the specified locale.
“””
transcription_definition = swagger_client.Transcription(
display_name=NAME,
description=DESCRIPTION,
locale=LOCALE,
content_container_url=uri,
properties=properties
)
return transcription_definition
def _paginate(api, paginated_object):
“””
The autogenerated client does not support pagination. This function returns a generator over
all items of the array that the paginated object `paginated_object` is part of.
“””
yield from paginated_object.values
typename = type(paginated_object).__name__
auth_settings = [“api_key”]
while paginated_object.next_link:
link = paginated_object.next_link[len(api.api_client.configuration.host):]
paginated_object, status, headers = api.api_client.call_api(link, “GET”,
response_type=typename, auth_settings=auth_settings)
if status == 200:
yield from paginated_object.values
else:
raise Exception(f”could not receive paginated data: status {status}”)
def delete_all_transcriptions(api):
“””
Delete all transcriptions associated with your speech resource.
“””
logging.info(“Deleting all existing completed transcriptions.”)
# get all transcriptions for the subscription
transcriptions = list(_paginate(api, api.get_transcriptions()))
# Delete all pre-existing completed transcriptions.
# If transcriptions are still running or not started, they will not be deleted.
for transcription in transcriptions:
transcription_id = transcription._self.split(‘/’)[-1]
logging.debug(f”Deleting transcription with id {transcription_id}”)
try:
api.delete_transcription(transcription_id)
except swagger_client.rest.ApiException as exc:
logging.error(f”Could not delete transcription {transcription_id}: {exc}”)
def transcribe():
logging.info(“Starting transcription client…”)
# configure API key authorization: subscription_key
configuration = swagger_client.Configuration()
configuration.api_key[“Ocp-Apim-Subscription-Key”] = SUBSCRIPTION_KEY
configuration.host = f”https://{SERVICE_REGION}.api.cognitive.microsoft.com/speechtotext/v3.1"
# create the client object and authenticate
client = swagger_client.ApiClient(configuration)
# create an instance of the transcription api class
api = swagger_client.CustomSpeechTranscriptionsApi(api_client=client)
# Specify transcription properties by passing a dict to the properties parameter. See
# for supported parameters.
properties = swagger_client.TranscriptionProperties()
# properties.word_level_timestamps_enabled = True
# properties.display_form_word_level_timestamps_enabled = True
# properties.punctuation_mode = “DictatedAndAutomatic”
# properties.profanity_filter_mode = “Masked”
# properties.destination_container_url = “<SAS Uri with at least write (w) permissions for an Azure Storage blob container that results should be written to>”
# properties.time_to_live = “PT1H”
# uncomment the following block to enable and configure speaker separation
# properties.diarization_enabled = True
# properties.diarization = swagger_client.DiarizationProperties(
# swagger_client.DiarizationSpeakersProperties(min_count=1, max_count=5))
# properties.language_identification = swagger_client.LanguageIdentificationProperties([“en-US”, “ja-JP”])
# Use base models for transcription. Comment this block if you are using a custom model.
transcription_definition = transcribe_from_single_blob(RECORDINGS_BLOB_URI, properties)
# Uncomment this block to use custom models for transcription.
# transcription_definition = transcribe_with_custom_model(client, RECORDINGS_BLOB_URI, properties)
# uncomment the following block to enable and configure language identification prior to transcription
# Uncomment this block to transcribe all files from a container.
# transcription_definition = transcribe_from_container(RECORDINGS_CONTAINER_URI, properties)
created_transcription, status, headers = api.transcriptions_create_with_http_info(transcription=transcription_definition)
# get the transcription Id from the location URI
transcription_id = headers[“location”].split(“/”)[-1]
# Log information about the created transcription. If you should ask for support, please
# include this information.
logging.info(f”Created new transcription with id ‘{transcription_id}’ in region {SERVICE_REGION}”)
logging.info(“Checking status.”)
completed = False
while not completed:
# wait for 5 seconds before refreshing the transcription status
time.sleep(5)
transcription = api.transcriptions_get(transcription_id)
logging.info(f”Transcriptions status: {transcription.status}”)
if transcription.status in (“Failed”, “Succeeded”):
completed = True
if transcription.status == “Succeeded”:
pag_files = api.transcriptions_list_files(transcription_id)
for file_data in _paginate(api, pag_files):
if file_data.kind != “Transcription”:
continue
audiofilename = file_data.name
results_url = file_data.links.content_url
results = requests.get(results_url)
logging.info(f”Results for {audiofilename}:\n{results.content.decode(‘utf-8’)}”)
elif transcription.status == “Failed”:
logging.info(f”Transcription failed: {transcription.properties.error.message}”)
if __name__ == “__main__”:
transcribe()
```