Explained Fine Tuning Small Language Model
Hereโs a concise Azure-based architecture for fine-tuning a small language model using Hugging Face datasets:
๐ท Azure Architecture for Fine-Tuning Small LLM with Hugging Face Data
1. Data Preparation Layer
- Azure Blob Storage
- Store raw Hugging Face datasets (
.json
,.csv
, etc.) - Can integrate directly with Hugging Face
datasets.load_dataset()
using local download and upload.
2. Compute Layer
- Azure Machine Learning (AzureML) Workspace
- Manage training jobs, compute targets, and experiment tracking.
- AzureML Compute Cluster (GPU)
- Use NC/T4-v3/ND-series VMs for training (cost-effective for small models).
- Supports distributed training with Hugging Face + DeepSpeed if needed.
3. Training Environment
- Custom Docker Image (Optional)
- Based on Hugging Face + PyTorch/TensorFlow +
transformers
+datasets
- Built using AzureML Environment or stored in Azure Container Registry (ACR).
4. Model Fine-Tuning
- Script runs via AzureML using:
Trainer
API from Hugging Face- Dataset loaded from local copy of Hugging Face datasets
transformers
anddatasets
libraries
5. Model Storage
- AzureML Model Registry
- Register trained models for inference or versioning
- Optionally push to Azure Blob Storage or Hugging Face Hub
6. Inference or Deployment (Optional)
- Azure Kubernetes Service (AKS) or Azure App Service
- Serve model as REST endpoint using FastAPI/Flask
๐งฉ Optional Enhancements
- Azure DevOps / GitHub Actions: CI/CD pipeline for training
- Azure Key Vault: Secure API keys (e.g., Hugging Face tokens)
- Azure Monitor: Logs, metrics, GPU utilization
๐น Context:
This Python script loads a small portion of the UltraChat 200k dataset from Hugging Face for fine-tuning or prototyping purposes. It prints the dataset size and a random sample from it.
from datasets import load_dataset
๐ Imports the load_dataset
function from the datasets
library by Hugging Face to fetch datasets from the hub.
from random import randrange
๐ Imports randrange
to randomly pick an index from the dataset.
# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft[:2%]')
๐ Loads 2% of the train_sft
split from the ultrachat_200k
dataset using streaming for quick testing/training.
print(f"dataset size: {len(dataset)}")
๐ Prints the total number of records in the sampled dataset.
print(dataset[randrange(len(dataset))])
๐ Prints a random sample from the dataset to inspect its structure/content.
๐น Context:
This code splits the previously loaded dataset into train/test sets and saves them as .jsonl
files for downstream tasks like model training and evaluation.
dataset = dataset.train_test_split(test_size=0.2)
๐ Splits the dataset into 80% training and 20% test sets using Hugging Faceโs built-in method.
train_dataset = dataset['train']
๐ Extracts the training portion from the split dictionary.
train_dataset.to_json(f"data/train.jsonl")
๐ Saves the training set in .jsonl
(JSON Lines) format inside the data/
directory.
test_dataset = dataset['test']
๐ Extracts the test portion from the split dictionary.
test_dataset.to_json(f"data/eval.jsonl")
๐ Saves the test set to eval.jsonl
for evaluation purposes.
๐น Context:
This code imports essential Azure ML SDK modules to build, run, and manage machine learning workflows using Azure Machine Learning services (v2 SDK).
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
๐ Imports two authentication methods:
DefaultAzureCredential
: used in automated environments (e.g., CLI login, managed identity)InteractiveBrowserCredential
: for local interactive browser login
from azure.ai.ml import MLClient, Input
๐ Imports:
MLClient
: main interface to interact with AzureML workspace (submit jobs, register models, etc.)Input
: defines inputs for components and pipelines
from azure.ai.ml.dsl import pipeline
๐ Imports the @pipeline
decorator used to define ML pipelines (multi-step workflows)
from azure.ai.ml import load_component
๐ Imports function to load pre-defined or custom ML components from YAML or registry
from azure.ai.ml import command
๐ Imports command
function to define a custom component using a command-line script
from azure.ai.ml.entities import Data
๐ Imports Data
entity to register and manage datasets in AzureML workspace
from azure.ai.ml import Input
๐ (Redundant) Already imported earlier. Used to specify component/pipeline inputs.
from azure.ai.ml import Output
๐ Imports Output
class to define outputs for ML components
from azure.ai.ml.constants import AssetTypes
๐ Imports constants like AssetTypes.URI_FILE
, AssetTypes.MLTABLE
, etc., to describe input/output data types
๐น Context:
This code initializes an Azure ML client (MLClient
) using credentials. It first tries to load config from a local file (e.g., config.json
or .azureml/config
). If that fails, it manually creates the client with explicit workspace details.
credential = DefaultAzureCredential()
๐ Authenticates using the default method (CLI, environment, managed identity, etc.)
workspace_ml_client = None
๐ Initializes the ML client variable.
try:
workspace_ml_client = MLClient.from_config(credential)
๐ Tries to load the Azure ML workspace from a local config.json
file using the provided credential.
except Exception as ex:
print(ex)
๐ If loading fails, it prints the exception for debugging.
subscription_id= "Enter your subscription_id"
resource_group = "Enter your resource_group"
workspace= "Enter your workspace name"
๐ Defines the workspace manually by specifying your Azure subscription ID, resource group, and workspace name.
workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)
๐ Creates the MLClient
manually using the credential and workspace details.
๐น Context:
This code creates and registers an Azure ML Environment using a prebuilt Docker image and a custom Conda environment. Itโs used to run training or pipeline jobs with specific dependencies.
from azure.ai.ml.entities import Environment, BuildContext
๐ Imports classes to define a custom ML environment, optionally including Docker build context (not used directly here).
env_docker_image = Environment(
image="mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:latest",
conda_file="environment/conda.yml",
name="llm-training",
description="Environment created for llm training.",
)
๐ Creates a new AzureML environment:
image
: uses a curated GPU-compatible image preconfigured for Hugging Face NLPconda_file
: adds additional dependencies fromconda.yml
name
&description
: metadata for easy identification
workspace_ml_client.environments.create_or_update(env_docker_image)
๐ Registers (or updates) the environment in the Azure ML workspace so it can be reused in jobs or pipelines.
๐น Context:
This is a Conda environment definition (conda.yml
) to be used in Azure ML for fine-tuning language models with Hugging Face, logging via MLflow/W&B, and tracking via AzureML.
๐ Explained conda.yml
:
name: model-env
๐ Name of the environment.
channels:
- conda-forge
๐ Uses conda-forge
as the source for non-pip dependencies.
dependencies:
- python=3.8
- pip=24.0
๐ Installs Python 3.8 and pip version 24.0 via conda.
- pip:
- bitsandbytes==0.43.1 # For 8-bit/4-bit quantization
- transformers~=4.41 # Hugging Face Transformers for model APIs
- peft~=0.11 # Parameter-Efficient Fine-Tuning (e.g. LoRA)
- accelerate~=0.30 # Hugging Face performance optimizer
- trl==0.8.6 # Transformers Reinforcement Learning (e.g., PPO)
- einops==0.8.0 # For tensor operations (used in many transformer libs)
- datasets==2.19.1 # Hugging Face Datasets
- wandb==0.17.0 # Weights & Biases for experiment tracking
- mlflow==2.13.0 # MLflow logging & tracking
- azureml-mlflow==1.56.0 # AzureML integration with MLflow
- torchvision==0.18.0 # Vision utilities (e.g., needed if vision-LM or multi-modal models used)
โ
Ready to be used in AzureML pipelines or Environment
creation.
This is โQLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generationโ by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:
- Quantize the pretrained model to 4 bits and freezing it.
- Attach small, trainable adapter layers. (LoRA)
- Finetune only the adapter layers, while using the frozen quantized model for context.
%%writefile src/train.py
import os
#import mlflow
import argparse
import sys
import loggingimport datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_datasetlogger = logging.getLogger(__name__)
###################
# Hyper-parameters
###################
training_config = {
"bf16": True,
"do_eval": False,
"learning_rate": 5.0e-06,
"log_level": "info",
"logging_steps": 20,
"logging_strategy": "steps",
"lr_scheduler_type": "cosine",
"num_train_epochs": 1,
"max_steps": -1,
"output_dir": "./checkpoint_dir",
"overwrite_output_dir": True,
"per_device_eval_batch_size": 4,
"per_device_train_batch_size": 4,
"remove_unused_columns": True,
"save_steps": 100,
"save_total_limit": 1,
"seed": 0,
"gradient_checkpointing": True,
"gradient_checkpointing_kwargs":{"use_reentrant": False},
"gradient_accumulation_steps": 1,
"warmup_ratio": 0.2,
}peft_config = {
"r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM",
"target_modules": "all-linear",
"modules_to_save": None,
}
train_conf = TrainingArguments(**training_config)
peft_conf = LoraConfig(**peft_config)###############
# Setup logging
###############
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = train_conf.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()# Log on each process a small summary
logger.warning(
f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
+ f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}"
)
logger.info(f"Training/evaluation parameters {train_conf}")
logger.info(f"PEFT parameters {peft_conf}")################
# Modle Loading
################
checkpoint_path = "microsoft/Phi-3-mini-4k-instruct"
# checkpoint_path = "microsoft/Phi-3-mini-128k-instruct"
model_kwargs = dict(
use_cache=False,
trust_remote_code=True,
attn_implementation="flash_attention_2", # loading the model with flash-attenstion support
torch_dtype=torch.bfloat16,
device_map=None
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
tokenizer.model_max_length = 2048
tokenizer.pad_token = tokenizer.unk_token # use unk rather than eos token to prevent endless generation
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'##################
# Data Processing
##################
def apply_chat_template(
example,
tokenizer,
):
messages = example["messages"]
# Add an empty system message if there is none
if messages[0]["role"] != "system":
messages.insert(0, {"role": "system", "content": ""})
example["text"] = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False)
return exampledef main(args):
train_dataset = load_dataset('json', data_files=args.train_file, split='train')
test_dataset = load_dataset('json', data_files=args.eval_file, split='train')
column_names = list(train_dataset.features) processed_train_dataset = train_dataset.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer},
num_proc=10,
remove_columns=column_names,
desc="Applying chat template to train_sft",
) processed_test_dataset = test_dataset.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer},
num_proc=10,
remove_columns=column_names,
desc="Applying chat template to test_sft",
) ###########
# Training
###########
trainer = SFTTrainer(
model=model,
args=train_conf,
peft_config=peft_conf,
train_dataset=processed_train_dataset,
eval_dataset=processed_test_dataset,
max_seq_length=2048,
dataset_text_field="text",
tokenizer=tokenizer,
packing=True
)
train_result = trainer.train()
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
#############
# Evaluation
#############
tokenizer.padding_side = 'left'
metrics = trainer.evaluate()
metrics["eval_samples"] = len(processed_test_dataset)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
# ############
# # Save model
# ############
os.makedirs(args.model_dir, exist_ok=True)
torch.save(model, os.path.join(args.model_dir, "model.pt"))def parse_args():
# setup argparse
parser = argparse.ArgumentParser() # add arguments
parser.add_argument("--train-file", type=str, help="Input data for training")
parser.add_argument("--eval-file", type=str, help="Input data for eval")
parser.add_argument("--model-dir", type=str, default="./", help="output directory for model")
parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
parser.add_argument(
"--batch-size",
default=16,
type=int,
help="mini batch size for each gpu/process",
)
parser.add_argument("--learning-rate", default=0.001, type=float, help="learning rate")
parser.add_argument("--momentum", default=0.9, type=float, help="momentum")
parser.add_argument(
"--print-freq",
default=200,
type=int,
help="frequency of printing training statistics",
) # parse args
args = parser.parse_args() # return args
return args
# run script
if __name__ == "__main__":
# parse args
args = parse_args()
# call main function
main(args)
Your training script is mostly well-structured for QLoRA-based fine-tuning using Phi-3. Here are some quick improvements and confirmations:
โ Key Confirmations
- QLoRA Logic is correct: frozen 4-bit model + LoRA adapters trained via
peft
. trust_remote_code=True
is needed for Phi-3 โ โflash_attention_2
enabled โ โpadding_side = 'right'
for training and'left'
for eval โ โSFTTrainer
fromtrl
is used correctly.
โ ๏ธ Minor Suggestions
BitsAndBytesConfig
is missing
Add this to apply actual 4-bit quantization:
quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained(checkpoint_path, quantization_config=quantization_config, **model_kwargs)
- TrainingArguments Overridden?
--learning-rate
,--batch-size
args exist but are unused inTrainingArguments
. If you want to override dynamically:
training_config.update({ "learning_rate": args.learning_rate, "per_device_train_batch_size": args.batch_size, }) train_conf = TrainingArguments(**training_config)
- Model Save Bug
You saved model using:
torch.save(model, os.path.join(args.model_dir, "model.pt"))
- Use HuggingFace-style save:
trainer.model.save_pretrained(args.model_dir) tokenizer.save_pretrained(args.model_dir)
Here is the clarified version of your code with detailed comments explaining each part:
from azure.ai.ml.entities import AmlCompute
# Define the compute size to be used. You can change this value if needed.
# By default, this example uses the "Standard_NC24ads_A100_v4" which corresponds to 1 x A100 (80GB).
compute_cluster_size = "Standard_NC24ads_A100_v4" # This is a specific GPU machine type with 80GB of memory# Specify the name of the compute cluster. If you already have a GPU cluster, provide its name here.
# If it doesn't exist, a new cluster with this name will be created.
compute_cluster = "gpu-a100"try:
# Try to retrieve the existing compute cluster by its name.
compute = ml_client.compute.get(compute_cluster)
print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
# If the compute cluster does not exist (an exception is raised), create a new one.
print(
f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
)
try:
# Attempt to create a new compute cluster with the provided settings.
# This uses the AmlCompute class to define the properties of the new cluster.
print("Attempt #1 - Trying to create a dedicated compute")
compute = AmlCompute(
name=compute_cluster, # The name for the compute cluster.
size=compute_cluster_size, # The size (type) of the compute cluster (e.g., GPU-based machine).
tier="Dedicated", # Specify the tier for the compute (options are "Dedicated" or "LowPriority").
max_instances=1, # Limit to 1 instance. If you want more nodes for multi-node training, increase this value.
)
# Begin the creation of the compute cluster and wait for it to complete.
ml_client.compute.begin_create_or_update(compute).wait()
except Exception as e:
# If an error occurs while creating the compute cluster, print an error message.
print("Error")
Key Concepts:
AmlCompute
: This is an Azure Machine Learning (AML) compute resource. It represents a virtual machine or cluster used to run ML workloads.compute_cluster_size
: This specifies the hardware configuration of the compute resource, particularly the GPU type and memory. Here,"Standard_NC24ads_A100_v4"
indicates a high-performance GPU (A100 with 80GB memory).ml_client.compute.get()
: This function checks if a compute cluster with the specified name (compute_cluster
) exists.ml_client.compute.begin_create_or_update()
: This starts the creation of a new compute cluster if one doesn't already exist.
Error Handling:
- If the compute cluster already exists, it is reused.
- If it does not exist, the code attempts to create a new compute cluster using the specified GPU configuration.
Here is the clarified version of your code with detailed comments explaining each part:
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration
# Define the job that will be run on the Azure ML compute resource
job = command(
# Define the inputs required for the job.
inputs=dict(
# Train file (input data) provided as a URI pointing to a file in a remote storage location.
train_file=Input(
type="uri_file", # Type is a file URI, which means the file is located remotely.
path="data/train.jsonl", # Path to the training data file.
),
# Evaluation file (input data) similarly provided as a URI.
eval_file=Input(
type="uri_file", # Type is a file URI, pointing to the evaluation data file.
path="data/eval.jsonl", # Path to the evaluation data file.
),
# Additional hyperparameters for training.
epoch=1, # Number of training epochs.
batchsize=64, # Batch size used during training.
lr=0.01, # Learning rate for optimization.
momentum=0.9, # Momentum for optimization.
prtfreq=200, # Frequency of printing status during training.
output="./outputs" # Path where the model's output (e.g., trained weights) will be saved.
),
# Path to the local code directory that contains the training script (e.g., `train.py`).
code="./src", # This directory contains the source code for the training job.
# Specify the compute target where the job will run.
compute='gpu-a100', # Use the 'gpu-a100' cluster (GPU-based compute). # The command that will be run on the compute instance. This launches the training script with the specified inputs.
command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}", # Specify the environment that contains the dependencies for the job (e.g., specific versions of libraries).
environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/52", # Distribution settings for the job, indicating that this is a PyTorch-based job.
distribution={
"type": "PyTorch", # Indicate that the job is using PyTorch.
"process_count_per_instance": 1, # Run one process per compute instance.
},
)# Submit the job to Azure ML workspace and return the job object.
returned_job = workspace_ml_client.jobs.create_or_update(job)# Stream the output of the job to the console to monitor the training process.
workspace_ml_client.jobs.stream(returned_job.name)
Key Concepts:
command
: This function defines a command job that runs a script with specified inputs and compute configurations.Input
: Represents an input that the command job will consume. This can be a file or a parameter passed to the script.command
: Specifies the exact command to run within the Azure environment (usingaccelerate
to run thetrain.py
script).compute
: Specifies the compute cluster (gpu-a100
) that will be used to run the training job.environment
: Refers to the environment in Azure ML containing the dependencies needed for running the job.distribution
: Specifies how the job will be distributed across compute resources, in this case using a single process on each instance for PyTorch training.
Workflow:
- Inputs: Define the files (training and evaluation data) and hyperparameters (epochs, batch size, etc.).
- Code: Point to the local directory where the training script (
train.py
) resides. - Compute: Use a GPU-based compute cluster (
gpu-a100
). - Command: Run the training script with the provided arguments.
- Environment: Use a pre-defined Azure ML environment with the necessary libraries and dependencies.
- Distribution: Specify that the job uses a single process per instance (for distributed training, this can be adjusted).
- Job Submission: Submit the job to the Azure ML workspace, and stream the job logs to monitor the training process.
Here is the updated version of your code with a detailed comment explaining how to check if the trained_model
output is available:
# Get the job name of the returned job
job_name = returned_job.name
# Print the outputs of the job to check if the trained model is available
print("Pipeline job outputs: ", workspace_ml_client.jobs.get(job_name).outputs)
Key Concepts:
returned_job.name
: This retrieves the job name from the job object returned after submission.workspace_ml_client.jobs.get(job_name)
: This fetches the job details by using the job name.outputs
: The.outputs
property contains the output of the job, including model artifacts, logs, and other generated files. By printing this, you can check if thetrained_model
is part of the outputs.
What this does:
- This code checks the output of the pipeline job after it has been completed to verify whether the
trained_model
is available in the outputs.
Here is the updated version of your code with detailed comments explaining each part:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes
# Define the model to be registered in the Azure ML workspace
run_model = Model(
# Define the path to the model output from the job. This points to the location where the model is saved.
# It uses the job's output path to locate the model artifact.
path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/outputs/mlflow_model_folder",
# Provide a name for the model that will be registered in the workspace.
name="phi-3-finetuned",
# Provide a description of the model.
description="Model created from run.",
# Specify the type of model asset. In this case, it is an MLflow model.
type=AssetTypes.MLFLOW_MODEL,
)# Create or update the model in the Azure ML workspace.
# This will register the model if it doesn't already exist or update it if it does.
model = workspace_ml_client.models.create_or_update(run_model)
Key Concepts:
Model
: Represents a model artifact in Azure Machine Learning, which can be a trained model or any machine learning asset.path
: Specifies the location of the model in the Azure ML job's output. In this case, it points to the model's saved location in the job output artifacts.name
: The name you assign to the model in Azure ML to identify it.description
: A brief description of the model for better context.AssetTypes.MLFLOW_MODEL
: This indicates that the model is an MLflow model, which is a popular framework for managing machine learning models.create_or_update
: This method will either register a new model or update an existing one if the model already exists under the same name.
What this does:
- This code registers the model output from the Azure ML job as an asset in the Azure ML workspace, allowing you to manage, version, and deploy the model.
Hereโs the clarified version of your code with detailed comments:
from azure.ai.ml.entities import (
ManagedOnlineEndpoint,
IdentityConfiguration,
ManagedIdentityConfiguration,
)
# Check if the endpoint already exists in the workspace
try:
# Attempt to retrieve the online endpoint by its name
endpoint = workspace_ml_client.online_endpoints.get(endpoint_name)
print("---Endpoint already exists---")
except:
# If the endpoint doesn't exist, create a new online endpoint # Define the online endpoint configuration
endpoint = ManagedOnlineEndpoint(
name=endpoint_name, # Name of the endpoint to create
description=f"Test endpoint for {model.name}", # Provide a description for the endpoint
identity=IdentityConfiguration(
type="user_assigned", # Identity type (user-assigned managed identity)
user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)], # Attach a managed identity if provided
) if uai_id != "" else None, # Only set the identity if a valid `uai_id` is provided
)# Trigger the endpoint creation or update process
try:
# Begin creating or updating the endpoint, and wait for the operation to complete
workspace_ml_client.begin_create_or_update(endpoint).wait()
print("\n---Endpoint created successfully---\n")
except Exception as err:
# If the creation or update fails, raise a detailed error
raise RuntimeError(
f"Endpoint creation failed. Detailed Response:\n{err}"
) from err
Key Concepts:
ManagedOnlineEndpoint
: Represents an online endpoint in Azure ML, which allows you to deploy and manage machine learning models as a web service.IdentityConfiguration
: Defines the identity configuration for the endpoint. In this case, it's a user-assigned managed identity.ManagedIdentityConfiguration
: Specifies the details of the user-assigned managed identity that will be used by the endpoint.uai_id
: A variable that stores the resource ID of the user-assigned managed identity, used for authentication.
Workflow:
- Check for Existing Endpoint: It first checks if the endpoint with the specified
endpoint_name
already exists in the workspace. - Create Endpoint if Missing: If the endpoint doesnโt exist, it creates a new managed online endpoint with the specified name and description. The identity configuration is added only if a valid
uai_id
is provided. - Trigger Creation/Update: It then triggers the creation or update of the endpoint. If the endpoint already exists, it will update the existing one; if not, it will create a new one.
- Error Handling: If any part of the process fails, a
RuntimeError
is raised with the detailed error message.
What this does:
- This code manages the creation or updating of an online endpoint for serving models. It ensures that the endpoint either exists or is created, with the possibility of associating a user-assigned managed identity for authentication.
Hereโs the clarified version of your code with detailed comments explaining each part:
# Initialize deployment parameters
# Name of the deployment, used to uniquely identify it
deployment_name = "phi3-deploy" # SKU (Stock Keeping Unit) name defines the size and capability of the deployed model. In this case, itโs a "Standard_NCs_v3" type.
sku_name = "Standard_NCs_v3" # Request timeout in milliseconds (90 seconds)
REQUEST_TIMEOUT_MS = 90000 # Set environment variables for the deployment.
# These variables will be used for authentication and other configuration during the deployment process.
deployment_env_vars = {
"SUBSCRIPTION_ID": subscription_id, # Azure Subscription ID where the resources are located
"RESOURCE_GROUP_NAME": resource_group, # Resource group name that holds the resources
"UAI_CLIENT_ID": uai_client_id, # Managed Identity (UAI) client ID for authentication
}
Key Concepts:
deployment_name
: The unique identifier for the deployment of the model. This name is used to refer to the specific deployment in the Azure ML workspace.sku_name
: The SKU name determines the resources (e.g., CPU, GPU) available for the deployment.Standard_NCs_v3
is a specific SKU that includes GPU resources.REQUEST_TIMEOUT_MS
: Defines the maximum time for a request to complete, in milliseconds. This is useful to avoid hanging requests or to set a limit on the time for deployment.deployment_env_vars
: A dictionary that contains environment variables necessary for the deployment process. These are typically used for authentication and configuring the deployment parameters.
What this does:
- This code initializes the parameters required for deploying a model. It sets the deployment name, the SKU for the deployment, the timeout duration for the request, and necessary environment variables (subscription ID, resource group, and managed identity client ID). These parameters will be used later in the deployment process.
Hereโs the clarified version of your code with detailed comments explaining each part:
from azure.ai.ml.entities import Model, Environment
# Define the environment for the deployment
env = Environment(
# The Docker image used for the environment, in this case, a curated Azure ML image for foundation model inference
image='mcr.microsoft.com/azureml/curated/foundation-model-inference:latest',
# Inference configuration, specifying routes for liveness, readiness, and scoring
inference_config={
"liveness_route": {"port": 5001, "path": "/"}, # Health check route to check if the model is alive
"readiness_route": {"port": 5001, "path": "/"}, # Health check route to check if the model is ready
"scoring_route": {"port": 5001, "path": "/score"}, # The route used for model inference (scoring requests)
},
)
Key Concepts:
Environment
: This represents the environment in which the model will be deployed. It specifies the Docker image and the inference routes to handle various checks and scoring.image
: Specifies the Docker image to be used in the environment. In this case, it's a curated Azure ML image designed for foundation model inference.inference_config
: A dictionary that defines various routes for health checks and model inference.liveness_route
: Defines the port and path used to check if the model service is alive (responsive).readiness_route
: Defines the port and path to check if the model is ready to handle inference requests.scoring_route
: The route used for making prediction (scoring) requests to the model.
What this does:
- This code defines an environment for model deployment using a curated Azure ML image for foundation model inference. It specifies routes for liveness and readiness checks, as well as a route for scoring requests (model inference). These settings help ensure that the deployed model can be monitored and used for predictions once deployed.
Hereโs the clarified version of your code with detailed comments explaining each part:
from azure.ai.ml.entities import (
OnlineRequestSettings,
CodeConfiguration,
ManagedOnlineDeployment,
ProbeSettings,
Environment
)
# Define the deployment configuration for the online model deployment
deployment = ManagedOnlineDeployment(
# The name of the deployment, used to identify it in the Azure ML workspace
name=deployment_name,
# The name of the endpoint where the model will be deployed
endpoint_name=endpoint_name,
# The model ID for the model being deployed
model=model.id,
# The instance type (SKU) for the deployment, determining the resources available
instance_type=sku_name,
# Number of instances (nodes) to use for the deployment. Here, we use 1 instance.
instance_count=1,
# Environment to use for the deployment (previously defined)
environment=env,
# Environment variables for the deployment, used for configuration
environment_variables=deployment_env_vars,
# Request settings, such as the timeout for each request to the model (in milliseconds)
request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS),
# Liveness probe configuration to monitor if the model is alive and responsive
liveness_probe=ProbeSettings(
failure_threshold=30, # Number of consecutive failures before considering the model as unhealthy
success_threshold=1, # Number of successful checks before considering the model as healthy
period=100, # Frequency (in seconds) of the health checks
initial_delay=500, # Time to wait before starting health checks
),
# Readiness probe configuration to monitor if the model is ready to handle requests
readiness_probe=ProbeSettings(
failure_threshold=30, # Number of consecutive failures before considering the model as unhealthy
success_threshold=1, # Number of successful checks before considering the model as ready
period=100, # Frequency (in seconds) of the readiness checks
initial_delay=500, # Time to wait before starting readiness checks
),
)# Trigger the deployment creation and wait for it to complete
try:
workspace_ml_client.begin_create_or_update(deployment).wait() # Create or update the deployment in the workspace
print("\n---Deployment created successfully---\n") # Notify successful deployment
except Exception as err:
# If an error occurs, raise a RuntimeError with the error details
raise RuntimeError(
f"Deployment creation failed. Detailed Response:\n{err}"
) from err
Key Concepts:
ManagedOnlineDeployment
: This is used to configure an online deployment in Azure ML, specifying the model, environment, instance type, request settings, probes, etc.OnlineRequestSettings
: Defines settings for requests, such as the timeout for a request to the deployed model.ProbeSettings
: Used for setting up liveness and readiness probes to monitor the health and readiness of the deployed model. The probes help ensure that the model is alive and ready to serve requests.liveness_probe
: Checks if the model is still running and responsive.readiness_probe
: Checks if the model is ready to handle requests (e.g., after initialization).environment_variables
: These are environment variables passed to the deployment, which can be used for configuration (e.g., authentication).
What this does:
- This code defines and configures an online deployment for the model in Azure ML. It specifies the model, instance type, environment, request settings, and probes for monitoring the deploymentโs health and readiness.
- It then triggers the creation or update of the deployment in the Azure ML workspace. If successful, a success message is printed; otherwise, an error is raised.
Hereโs a clarified version of your code with detailed comments:
# Deleting the online deployment from the Azure ML workspace
workspace_ml_client.online_deployments.begin_delete(
name=deployment_name, # The name of the deployment to delete
endpoint_name=endpoint_name # The endpoint associated with the deployment
)
# Deleting the online endpoint from the Azure ML workspace
workspace_ml_client._online_endpoints.begin_delete(
name=endpoint_name # The name of the endpoint to delete
)
Key Concepts:
begin_delete
: This method is used to initiate the deletion of resources such as deployments or endpoints in the Azure ML workspace. Thebegin
prefix indicates that the operation is asynchronous and may take time to complete.online_deployments.begin_delete
: Deletes an online deployment, which is the specific model deployment tied to an endpoint._online_endpoints.begin_delete
: Deletes an online endpoint, which is the URL or interface through which models are served for inference.
What this does:
- The first line deletes the specified online deployment (identified by
deployment_name
) that is associated with the providedendpoint_name
. - The second line deletes the specified online endpoint (identified by
endpoint_name
) from the Azure ML workspace, which essentially removes the API endpoint for serving the model.
Notes:
- The use of
_online_endpoints
with a leading underscore suggests it's a potentially internal or less publicly exposed API, though it's still functional in this context. The typical approach would be usingonline_endpoints.begin_delete
as well, but if_online_endpoints
is what works in your case, it's acceptable.
Code has been explained here you can find at https://github.com/Azure/azure-llm-fine-tuning