Abstract

In this project, we will be using AWS Sagemaker to finetune a pretrained model that can perform image classification. We will use Sagemaker profiling, debugger, hyperparameter tuning and other good ML engineering practices to finish this project.

The dataset will be dog breed classification

Dataset

The dataset contains images from 133 dog breeds divided into training, testing and validation datasets.

Dataset is a set of pictures of different dog types (Golden Retriever, Akita, etc).

Data is stored in S3

# Command to download and unzip data
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip

Session setup

session = sagemaker.Session()
bucket= session.default_bucket()
role = sagemaker.get_execution_role()
region = session.boto_region_name

Store data

os.environ["DEFAULT_S3_BUCKET"] = bucket
!aws s3 sync ./dogImages s3://${DEFAULT_S3_BUCKET}/dogImages

Hyperparameter Tuning

import sagemaker
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)



hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch-size": CategoricalParameter([32, 64, 128, 256, 512]),
    "epochs": IntegerParameter(2, 4)
}

objective_metric_name = "accuracy"
objective_type = "Maximize"
metric_definitions = [{"Name": "accuracy", "Regex": "\((\d+)%\)"}]

Estimators for Hyperparameters

The hyperparameters which we are tuning are

lr epochs batch-size

estimator = PyTorch(
    entry_point="hpo.py",
    role=role,
    py_version='py36',
    framework_version="1.8",
    instance_count=1,
    instance_type="ml.m5.xlarge"
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,
    max_parallel_jobs=4,
    objective_type=objective_type,
)

Fit Hyperparameter Tuner

train_images_path = f"s3://{bucket}/dogImages/train"
print(train_images_path)
validation_images_path = f"s3://{bucket}/dogImages/valid"
print(validation_images_path)

tuner.fit({"train": train_images_path, "test": validation_images_path}) 

s3://sagemaker-us-east-1-820131057864/dogImages/train
s3://sagemaker-us-east-1-820131057864/dogImages/valid
.........................!

Get the best estimators and the best HPs

best_estimator = tuner.best_estimator()

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()


{'_tuning_objective_metric': '"accuracy"',
 'batch-size': '"256"',
 'epochs': '4',
 'lr': '0.002564503867458704',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"pytorch-training-2023-02-14-07-57-50-222"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-820131057864/pytorch-training-2023-02-14-07-57-50-222/source/sourcedir.tar.gz"'}

Model Profiling and Debugging

Set up debugging and profiling rules and hooks

from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
]

from sagemaker.debugger import DebuggerHookConfig, ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)
debugger_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "10", "eval.save_interval": "1"}
)
hyperparameters = {
    "batch-size": "256",
 "epochs": 4,
 "lr": "0.002564503867458704",
}

Create an estimator

from sagemaker.pytorch import PyTorch


estimator = PyTorch(
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    source_dir=".",
    entry_point="train_model.py",
    framework_version="1.8",
    py_version="py36",
    hyperparameters=hyperparameters,
    profiler_config=profiler_config,
    debugger_hook_config=debugger_config,
    rules=rules,
)

estimator.fit({"train": train_images_path, "test": validation_images_path}, wait=True)


2023-02-14 09:18:46 Completed - Training job completed
VanishingGradient: NoIssuesFound
Overfit: NoIssuesFound
LowGPUUtilization: NoIssuesFound
ProfilerReport: IssuesFound
Training seconds: 1251
Billable seconds: 1251

Debugger Report

from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys

trial = create_trial(estimator.latest_job_debugger_artifacts_path())


[2023-02-14 11:15:04.165 datascience-1-0-ml-g4dn-xlarge-94fad2f4401e538ca1255dfa1e84:125 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2023-02-14 11:15:04.176 datascience-1-0-ml-g4dn-xlarge-94fad2f4401e538ca1255dfa1e84:125 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-820131057864/pytorch-training-2023-02-14-08-54-36-852/debug-output

Model Deployment

from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data=estimator.model_data, 
                             role=role, 
                             entry_point='deply.py', 
                             py_version='py36',
                             framework_version='1.8')


predictor = pytorch_model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

Project Repository

Image Classification using AWS SageMaker