Post

MLOps 101

MLOps 101

MLOps 101

Building a Machine Learning Model is tedious, Yet, the payoff is worth it, especially when the system you’re building aligns with business objectives and adds intelligence to decision-making processes.

If model doesn’t meet its intended objectives, the journey yields valuable learnings that can guide future improvements. MLOps Introduction

Technology used:

  • pytorch-lightning:: Framework build on top of pytorch
  • DVC:: Data Managment Tool
  • Hydra ⚙️:: Configs Managment Tool
  • Tensorboard:: Visualization Tool
  • Docker 🐳 and Compose:: Shipping Containers
  • Gradio:: UI
  • GitHub Actions:: Continous Integration and Continous Deployment
  • AWS ☁️::
    • IAM:: Manage Permissions and Roles
    • S3:: Store Artifacts
    • EC2:: Compute Resource to Training Model
    • ECR:: Similar to DockerHub
    • Lambda:: Serverless compute
    • CDK:: Cloud Development Kid as easy deploy for cloudFormation

pre-requisties:

pretty much to code in python and MNIST in torch

checkout the code here:: DogBreedsClassifier 🐕

Setup 1: Setting up DVC🗃️

git is not indent to store large binary files, but git-lfs does. But there are subtle difference when it comes to DVC especially on binary files like image,audio,tensors.

It stores md5 hash value on Remote-server::[AWS-S3 Backend] and should manage by SCM git here DVC Intro

1
2
3
4
5
git init && dvc init
dvc add path/data_folder/
dvc remote add -d myremote s3://bucket-name 
dvc push
dvc pull

Setup 2: AWS☁️

AWS Cost Management is a ART to manage it

Configure AWS and CDKon command-line terminal and giving permission to it.

1
2
3
4
5
6
7
8
9
10
11
12
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip -y
unzip awscliv2.zip
sudo ./aws/install

aws --version

# configuration
aws configure -list

export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account)
export CDK_DEFAULT_REGION=$(aws configure get region)

IAM user must have certain Permission Policies. So, they’ll have to push artifacts to S3 and Pull ECR from

1
2
3
- AdministratorAccess
- AmazonElasticContainerRegistryPublicFullAccess
- AmazonS3FullAccess

Setup 3: GitHub Secrets & Configure EC2

Place you track all your code and GitHub-Action kicks starts on-push to your repo.

On train and test pipeline we need compute EC2 machine. run it on your hosted-machine to listen

1
2
3
4
5
6
7
8
9
# OS:: Linux
# Architecture:: X64
# Name: g4dn.xlarge
mkdir actions-runner && cd actions-runner              ## create a folder
curl -o actions-runner-linux-x64-2.321.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.321.0/actions-runner-linux-x64-2.321.0.tar.gz  ## Download latest runner
echo "HASH-VALUE  actions-runner-linux-x64-2.321.0.tar.gz" | shasum -a 256 -c  ## validate the hash
tar xzf ./actions-runner-linux-x64-2.321.0.tar.gz      ## extract the installer
./config.sh --url https://github.com/Muthukamalan/repo --token TOKENS   ## Create the runner and start the configuration experience
./run.sh &  ## Last step, run it!

GitHub Action

add it Github Workflow to listen

1
runs-on: self-hosted

GitHub Action

Setup 4: Hydra configs

Configuration apart from your source code and make use it in command-line it best-pratice. Hydra will do it for ourself. Can be overriden through CLI also.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
configs/
├── callbacks
│   └── default.yaml        # early-stop, lr-monitor, model-checkpoint
├── data
│   └── dogs.yaml
├── experiment
│   ├── hparams.yaml        # optimizing search grid by `Optuna` TPEsampler
│   └── pretrain.yaml       # make torchscript model `script=true`
├── hydra
│   └── default.yaml
├── logger
│   └── tensorboard.yaml    # csv, mlflow
├── model
│   └── mamba.yaml          # mambda with different hparams
├── paths
├── trainer
│   └── default.yaml
├── __init__.py
├── eval.yaml
└── train.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
hydra:
  mode: "MULTIRUN" 
  launcher:
    n_jobs: 1   
  sweeper:
    sampler:
      _target_: optuna.samplers.TPESampler
      seed: 32
      n_startup_trials: 3 # number of random sampling runs before optimization starts
    direction: minimize   # we use test_metrics['test/loss_epoch']
    study_name: optimal_searching
    n_trials: 2     ## number of possible combination you wanna try
    n_jobs: 16      # threads
    params:
      # https://github.com/facebookresearch/hydra/discussions/2906
      model.dims: choice([6,12,24,36],[12,24,48,72])  
      model.depths: "[3,3,15,3],[3,4,27,3], [3,3,9,3]"     
      model.head_fn: choice('norm_mlp','default')
      model.conv_ratio: choice(1,1.2,1.5)        
      data.batch_size: choice(16,32,64)

hydra helps to run program with multiple combination of hyperparameters.

setup 5: Tensorboard

Most of Data Science work doing data clean-up and visualising all along the way. There are ton of tools out there.

Training

Pytorch-lightning wraps your internal training grad and train-loop and optimization step.

If you’re an EC2 machine g4dn.xlarge or single T4 instance with 256Gb storage run

1
docker compose up train

or in a local setup with respective GPU

1
2
make hparams    # to hparams search
make train      # to train model

Attaching Hardware work horse to make sure proper utilisation of Instance during faster training and inference.

  • io operations
  • CPU usage
  • GPU usage

Hardware Setup Runner - g4dn.large

While doing Hyperparameter Search we’re monitoring

  • Loss
  • accuracy
  • Learning-Rate
  • ..

In parallel co-ordination plot will see possible combination of hparams we tired on Hparams Search

Test-Pytest

CODE-Coverage help to visuall show how much test-cases covered for model:: unittest, input and output size checks and more.

Evalution

We are testing the torch-scripted model and doing confusion matrix for training, testing and validation dataset

1
2
3
4
5
6
7
trainer.test(model, datamodule, ckpt_path=trainer.checkpoint_callback.best_model_path )

plot_confusion_matrix(
    model=model,
    datamodule=datamodule,
    path=os.path.join(Path(cfg.paths.root_dir), "assets", ""),
)
Training DatasetTesting DatasetValidation Dataset
Train confusionTest confusionVal confusion

Inference

Huggingface offers free to deploy our model and interactive UI with minimal amount of boilerplate

Added a worflow on GitHub Actions to push model to gradio-spaces

1
gradio deploy

Gradio Interface

Once the model been trained , we saving copy torchscripted model in gradio/ folder. Will be load by torch later

1
2
model  = torch.jit.load('best_model.pt',map_location=device)
outputs= model(inputs)
  • Deploy on LAMBDA Function
1
2
3
# requirements to run CDK
aws-cdk-lib==2.170.0 
constructs>=10.0.0,<11.0.0

To make it simple we use lambda-web-adapter to ease deployment on aws lamdda, where web server binaries should be in /opt/extensions/ directory.

check gradio-web-adapter

1
2
3
4
5
6
7
8
9
10
FROM public.ecr.aws/docker/library/python:3.11.10-slim
# Install AWS Lambda Web Adapter
COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.8.4 /lambda-adapter /opt/extensions/lambda-adapter
ENV PORT=8000
WORKDIR /code
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD [ "python","app.py"]  # make sure app runs 8000 port

Lambda Function

1
{ "app": "python3 cdk.py"}

To deploy on lambda function on AWS

1
2
cdk deploy --verbose
cdk destroy  

If the files are not been empty, In here, ECR we pushed Dockerfile ot don’t delete. Make sure check for unnecessary credit bills.

DEMO

This post is licensed under CC BY 4.0 by the author.