Speech-to-Text Conversion with Azure

7 min readDec 9, 2023

Photo by Andrea Piacquadio

Objective:

Create a system that leverages Azure services for converting spoken language into written text. This project focuses on using Azure Speech Services to perform speech-to-text conversion.

Technologies and Services Used:

- Azure Speech SDK: To interact with Azure Speech Services.

- Azure Speech Services (Speech-to-Text): For converting spoken language into text.

- Azure Storage (Optional): To store the converted text data.

Steps:

1. Azure Speech Service Setup:

- Create an Azure Speech resource on the Azure Portal.

- Obtain the necessary API keys and endpoint for authentication.

2. Development Environment:

- Use a programming language of your choice (e.g., Python, C#).

- Install the Azure Speech SDK for your chosen language.

3. Integration with Azure Speech Services:

- Use the Azure Speech SDK to connect to Azure Speech Services.

- Implement a method to send audio data for speech-to-text conversion.

4. Speech-to-Text Conversion:

- Develop a script or application that utilizes Azure Speech Services to convert spoken language into text.

- Handle different audio file formats (e.g., WAV, MP3).

5. Text Data Handling:

- Process the converted text data as needed (e.g., store in a database, analyze sentiment, extract key phrases).

6. Optional: Azure Storage Integration:

- Implement Azure Storage to store the converted text data for future reference or analysis.

Example Code (Python — Using Azure Speech SDK):

```python

from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig

# Set up Azure Speech Service

speech_key = “your_speech_key”

service_region = “your_service_region”

speech_config = SpeechConfig(subscription=speech_key, region=service_region)

audio_config = AudioConfig(filename=”path/to/audio/file.wav”)

# Initialize Speech Recognizer

speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

# Perform Speech-to-Text Conversion

result = speech_recognizer.recognize_once()

# Display the converted text

if result.reason == ResultReason.RecognizedSpeech:

print(“Recognized: {}”.format(result.text))

elif result.reason == ResultReason.NoMatch:

print(“No speech could be recognized”)

elif result.reason == ResultReason.Canceled:

cancellation_details = result.cancellation_details

print(“Speech Recognition canceled: {}”.format(cancellation_details.reason))

if cancellation_details.reason == CancellationReason.Error:

print(“Error details: {}”.format(cancellation_details.error_details))

```

If you want to process all the audio files from BLOB storage by serverless Azure Function. Then you can follow the steps below:

To perform batch processing of audio files using a Flask Azure Function (serverless), you can follow these general steps. This example assumes you have a collection of audio files in a specific storage container and you want to process them using Azure Speech Services.

1. Set Up Your Flask Azure Function:

1. Create a new Azure Function App in the Azure Portal.

2. Add a new HTTP-triggered function with Python and Flask template.

3. Configure the necessary environment variables, such as the connection string to your storage account and the API key for Azure Speech Services.

2. Install Required Packages:

In your Azure Function’s requirements.txt, add the necessary packages:

```

azure-functions

azure-cognitiveservices-speech

azure-storage-blob

```

The example assumes a function get_audio_files_from_storage() to retrieve the list of audio files, but the implementation of this function is not explicitly detailed in the given code snippet.

If you intend to use Azure Blob Storage for storing your audio files, you can integrate the Azure Storage SDK for Python (azure-storage-blob). Here’s a modified version of the example to illustrate the use of Azure Blob Storage:

3. Modify Your Flask Function Code:

Update your Flask function code to handle batch processing. Here’s a simplified example:

```python

import os

import json

import azure.functions as func

from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig

from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

def process_audio(file_path):

# Set up Azure Speech Service

speech_key = os.environ[‘SpeechKey’]

service_region = os.environ[‘SpeechRegion’]

speech_config = SpeechConfig(subscription=speech_key, region=service_region)

audio_config = AudioConfig(filename=file_path)

# Initialize Speech Recognizer

speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

# Perform Speech-to-Text Conversion

result = speech_recognizer.recognize_once()

# Return the converted text

return result.text if result.reason == ResultReason.RecognizedSpeech else None

def get_audio_files_from_storage(container_name):

# Set up Azure Storage

storage_connection_string = os.environ[‘AzureWebJobsStorage’]

blob_service_client = BlobServiceClient.from_connection_string(storage_connection_string)

container_client = blob_service_client.get_container_client(container_name)

# Retrieve list of audio files

audio_files = [blob.name for blob in container_client.list_blobs()]

return audio_files

def main(req: func.HttpRequest) -> func.HttpResponse:

# Retrieve the list of audio files from storage (replace ‘your-container-name’)

audio_files = get_audio_files_from_storage(‘your-container-name’)

# Process each audio file

results = []

for file_path in audio_files:

text_result = process_audio(file_path)

results.append({“file”: file_path, “text”: text_result})

# Return the results as JSON

return func.HttpResponse(json.dumps(results), mimetype=”application/json”)

```

4. Batch Processing Logic:

- Use Azure Storage SDK to list and retrieve the audio files from your storage container.

- Iterate over the list of files, calling the `process_audio` function for each file.

- Aggregate the results and return them as JSON.

5. Testing:

Test your Azure Function locally using tools like Azure Functions Core Tools or by deploying it to Azure and triggering it through an HTTP request.

6. Deployment:

Deploy your function to Azure Function App using Azure CLI or Azure DevOps.

Note:

Ensure that your storage account, Azure Speech Service, and Azure Function App are properly configured with the necessary keys and connection strings.

Learning Resources:

- Azure Speech SDK Documentation

- Azure Speech-to-Text Documentation

This sample code from Azure example github repo

```python

#!/usr/bin/env python

# coding: utf-8

# Licensed under the MIT license. See LICENSE.md file in the project root for full license information.

import logging

import sys

import requests

import time

import swagger_client

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG,

format=”%(asctime)s %(message)s”, datefmt=”%m/%d/%Y %I:%M:%S %p %Z”)

# Your subscription key and region for the speech service

SUBSCRIPTION_KEY = “YourSubscriptionKey”

SERVICE_REGION = “YourServiceRegion”

NAME = “Simple transcription”

DESCRIPTION = “Simple transcription description”

LOCALE = “en-US”

RECORDINGS_BLOB_URI = “<Your SAS Uri to the recording>”

# Provide the uri of a container with audio files for transcribing all of them

# with a single request. At least ‘read’ and ‘list’ (rl) permissions are required.

RECORDINGS_CONTAINER_URI = “<Your SAS Uri to a container of audio files>”

# Set model information when doing transcription with custom models

MODEL_REFERENCE = None # guid of a custom model

def transcribe_from_single_blob(uri, properties):

“””

Transcribe a single audio file located at `uri` using the settings specified in `properties`

using the base model for the specified locale.

“””

transcription_definition = swagger_client.Transcription(

display_name=NAME,

description=DESCRIPTION,

locale=LOCALE,

content_urls=[uri],

properties=properties

)

return transcription_definition

def transcribe_with_custom_model(client, uri, properties):

“””

Transcribe a single audio file located at `uri` using the settings specified in `properties`

using the base model for the specified locale.

“””

# Model information (ADAPTED_ACOUSTIC_ID and ADAPTED_LANGUAGE_ID) must be set above.

if MODEL_REFERENCE is None:

logging.error(“Custom model ids must be set when using custom models”)

sys.exit()

model = {‘self’: f’{client.configuration.host}/models/{MODEL_REFERENCE}’}

transcription_definition = swagger_client.Transcription(

display_name=NAME,

description=DESCRIPTION,

locale=LOCALE,

content_urls=[uri],

model=model,

properties=properties

)

return transcription_definition

def transcribe_from_container(uri, properties):

“””

Transcribe all files in the container located at `uri` using the settings specified in `properties`

using the base model for the specified locale.

“””

transcription_definition = swagger_client.Transcription(

display_name=NAME,

description=DESCRIPTION,

locale=LOCALE,

content_container_url=uri,

properties=properties

)

return transcription_definition

def _paginate(api, paginated_object):

“””

The autogenerated client does not support pagination. This function returns a generator over

all items of the array that the paginated object `paginated_object` is part of.

“””

yield from paginated_object.values

typename = type(paginated_object).__name__

auth_settings = [“api_key”]

while paginated_object.next_link:

link = paginated_object.next_link[len(api.api_client.configuration.host):]

paginated_object, status, headers = api.api_client.call_api(link, “GET”,

response_type=typename, auth_settings=auth_settings)

if status == 200:

yield from paginated_object.values

else:

raise Exception(f”could not receive paginated data: status {status}”)

def delete_all_transcriptions(api):

“””

Delete all transcriptions associated with your speech resource.

“””

logging.info(“Deleting all existing completed transcriptions.”)

# get all transcriptions for the subscription

transcriptions = list(_paginate(api, api.get_transcriptions()))

# Delete all pre-existing completed transcriptions.

# If transcriptions are still running or not started, they will not be deleted.

for transcription in transcriptions:

transcription_id = transcription._self.split(‘/’)[-1]

logging.debug(f”Deleting transcription with id {transcription_id}”)

try:

api.delete_transcription(transcription_id)

except swagger_client.rest.ApiException as exc:

logging.error(f”Could not delete transcription {transcription_id}: {exc}”)

def transcribe():

logging.info(“Starting transcription client…”)

# configure API key authorization: subscription_key

configuration = swagger_client.Configuration()

configuration.api_key[“Ocp-Apim-Subscription-Key”] = SUBSCRIPTION_KEY

configuration.host = f”https://{SERVICE_REGION}.api.cognitive.microsoft.com/speechtotext/v3.1"

# create the client object and authenticate

client = swagger_client.ApiClient(configuration)

# create an instance of the transcription api class

api = swagger_client.CustomSpeechTranscriptionsApi(api_client=client)

# Specify transcription properties by passing a dict to the properties parameter. See

# https://learn.microsoft.com/azure/cognitive-services/speech-service/batch-transcription-create?pivots=rest-api#request-configuration-options

# for supported parameters.

properties = swagger_client.TranscriptionProperties()

# properties.word_level_timestamps_enabled = True

# properties.display_form_word_level_timestamps_enabled = True

# properties.punctuation_mode = “DictatedAndAutomatic”

# properties.profanity_filter_mode = “Masked”

# properties.destination_container_url = “<SAS Uri with at least write (w) permissions for an Azure Storage blob container that results should be written to>”

# properties.time_to_live = “PT1H”

# uncomment the following block to enable and configure speaker separation

# properties.diarization_enabled = True

# properties.diarization = swagger_client.DiarizationProperties(

# swagger_client.DiarizationSpeakersProperties(min_count=1, max_count=5))

# properties.language_identification = swagger_client.LanguageIdentificationProperties([“en-US”, “ja-JP”])

# Use base models for transcription. Comment this block if you are using a custom model.

transcription_definition = transcribe_from_single_blob(RECORDINGS_BLOB_URI, properties)

# Uncomment this block to use custom models for transcription.

# transcription_definition = transcribe_with_custom_model(client, RECORDINGS_BLOB_URI, properties)

# uncomment the following block to enable and configure language identification prior to transcription

# Uncomment this block to transcribe all files from a container.

# transcription_definition = transcribe_from_container(RECORDINGS_CONTAINER_URI, properties)

created_transcription, status, headers = api.transcriptions_create_with_http_info(transcription=transcription_definition)

# get the transcription Id from the location URI

transcription_id = headers[“location”].split(“/”)[-1]

# Log information about the created transcription. If you should ask for support, please

# include this information.

logging.info(f”Created new transcription with id ‘{transcription_id}’ in region {SERVICE_REGION}”)

logging.info(“Checking status.”)

completed = False

while not completed:

# wait for 5 seconds before refreshing the transcription status

time.sleep(5)

transcription = api.transcriptions_get(transcription_id)

logging.info(f”Transcriptions status: {transcription.status}”)

if transcription.status in (“Failed”, “Succeeded”):

completed = True

if transcription.status == “Succeeded”:

pag_files = api.transcriptions_list_files(transcription_id)

for file_data in _paginate(api, pag_files):

if file_data.kind != “Transcription”:

continue

audiofilename = file_data.name

results_url = file_data.links.content_url

results = requests.get(results_url)

logging.info(f”Results for {audiofilename}:\n{results.content.decode(‘utf-8’)}”)

elif transcription.status == “Failed”:

logging.info(f”Transcription failed: {transcription.properties.error.message}”)

if __name__ == “__main__”:

transcribe()

```

Speech-to-Text Conversion with Azure

Written by Think Different - Dhiraj Patra