MLOps 101

Posted Nov 30, 2024 Updated Dec 1, 2024

By muthukamalan

5 min read

MLOps 101

Building a Machine Learning Model is tedious, Yet, the payoff is worth it, especially when the system you’re building aligns with business objectives and adds intelligence to decision-making processes.

If model doesn’t meet its intended objectives, the journey yields valuable learnings that can guide future improvements.

Technology used:

pytorch-lightning:: Framework build on top of pytorch⚡
DVC:: Data Managment Tool
Hydra ⚙️:: Configs Managment Tool
Tensorboard:: Visualization Tool
Docker 🐳 and Compose:: Shipping Containers
Gradio:: UI
GitHub Actions:: Continous Integration and Continous Deployment
AWS ☁️::
- IAM:: Manage Permissions and Roles
- S3:: Store Artifacts
- EC2:: Compute Resource to Training Model
- ECR:: Similar to DockerHub
- Lambda:: Serverless compute
- CDK:: Cloud Development Kid as easy deploy for cloudFormation

pre-requisties:

pretty much to code in python and MNIST in torch

checkout the code here:: DogBreedsClassifier 🐕

Setup 1: Setting up DVC🗃️

git is not indent to store large binary files, but git-lfs does. But there are subtle difference when it comes to DVC especially on binary files like image,audio,tensors.

It stores md5 hash value on Remote-server::[AWS-S3 Backend] and should manage by SCM git here

git init && dvc init
dvc add path/data_folder/
dvc remote add -d myremote s3://bucket-name 
dvc push
dvc pull

more on:: difference in DVC vs Git

Setup 2: AWS☁️

AWS Cost Management is a ART to manage it

Configure AWS and CDKon command-line terminal and giving permission to it.

  
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip -y
unzip awscliv2.zip
sudo ./aws/install

aws --version

# configuration
aws configure -list

export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account)
export CDK_DEFAULT_REGION=$(aws configure get region)

IAM user must have certain Permission Policies. So, they’ll have to push artifacts to S3 and Pull ECR from

  
- AdministratorAccess
- AmazonElasticContainerRegistryPublicFullAccess
- AmazonS3FullAccess

Setup 3: GitHub Secrets & Configure EC2

Place you track all your code and GitHub-Action kicks starts on-push to your repo.

On train and test pipeline we need compute EC2 machine. run it on your hosted-machine to listen

  
# OS:: Linux
# Architecture:: X64
# Name: g4dn.xlarge
mkdir actions-runner && cd actions-runner              ## create a folder
curl -o actions-runner-linux-x64-2.321.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.321.0/actions-runner-linux-x64-2.321.0.tar.gz  ## Download latest runner
echo "HASH-VALUE  actions-runner-linux-x64-2.321.0.tar.gz" | shasum -a 256 -c  ## validate the hash
tar xzf ./actions-runner-linux-x64-2.321.0.tar.gz      ## extract the installer
./config.sh --url https://github.com/Muthukamalan/repo --token TOKENS   ## Create the runner and start the configuration experience
./run.sh &  ## Last step, run it!

add it Github Workflow to listen

  
runs-on: self-hosted

Setup 4: Hydra configs

Configuration apart from your source code and make use it in command-line it best-pratice. Hydra will do it for ourself. Can be overriden through CLI also.

  
configs/
├── callbacks
│   └── default.yaml        # early-stop, lr-monitor, model-checkpoint
├── data
│   └── dogs.yaml
├── experiment
│   ├── hparams.yaml        # optimizing search grid by `Optuna` TPEsampler
│   └── pretrain.yaml       # make torchscript model `script=true`
├── hydra
│   └── default.yaml
├── logger
│   └── tensorboard.yaml    # csv, mlflow
├── model
│   └── mamba.yaml          # mambda with different hparams
├── paths
├── trainer
│   └── default.yaml
├── __init__.py
├── eval.yaml
└── train.yaml

  
hydra:
  mode: "MULTIRUN" 
  launcher:
    n_jobs: 1   
  sweeper:
    sampler:
      _target_: optuna.samplers.TPESampler
      seed: 32
      n_startup_trials: 3 # number of random sampling runs before optimization starts
    direction: minimize   # we use test_metrics['test/loss_epoch']
    study_name: optimal_searching
    n_trials: 2     ## number of possible combination you wanna try
    n_jobs: 16      # threads
    params:
      # https://github.com/facebookresearch/hydra/discussions/2906
      model.dims: choice([6,12,24,36],[12,24,48,72])  
      model.depths: "[3,3,15,3],[3,4,27,3], [3,3,9,3]"     
      model.head_fn: choice('norm_mlp','default')
      model.conv_ratio: choice(1,1.2,1.5)        
      data.batch_size: choice(16,32,64)

hydra helps to run program with multiple combination of hyperparameters.

setup 5: Tensorboard

Most of Data Science work doing data clean-up and visualising all along the way. There are ton of tools out there.

Training

Pytorch-lightning wraps your internal training grad and train-loop and optimization step.

If you’re an EC2 machine g4dn.xlarge or single T4 instance with 256Gb storage run

docker compose up train

or in a local setup with respective GPU

make hparams    # to hparams search
make train      # to train model

Attaching Hardware work horse to make sure proper utilisation of Instance during faster training and inference.

io operations
CPU usage
GPU usage

While doing Hyperparameter Search we’re monitoring

Loss
accuracy
Learning-Rate
..

In parallel co-ordination plot will see possible combination of hparams we tired on

Test-Pytest

CODE-Coverage help to visuall show how much test-cases covered for model:: unittest, input and output size checks and more.

Evalution

We are testing the torch-scripted model and doing confusion matrix for training, testing and validation dataset

  
trainer.test(model, datamodule, ckpt_path=trainer.checkpoint_callback.best_model_path )

plot_confusion_matrix(
    model=model,
    datamodule=datamodule,
    path=os.path.join(Path(cfg.paths.root_dir), "assets", ""),
)

Training Dataset	Testing Dataset	Validation Dataset

Inference

Huggingface offers free to deploy our model and interactive UI with minimal amount of boilerplate

Added a worflow on GitHub Actions to push model to gradio-spaces

Deploy on Huggingface Gradio:: DogBreeds Classification

gradio deploy

Once the model been trained , we saving copy torchscripted model in gradio/ folder. Will be load by torch later

  
model  = torch.jit.load('best_model.pt',map_location=device)
outputs= model(inputs)

Deploy on LAMBDA Function

  
# requirements to run CDK
aws-cdk-lib==2.170.0 
constructs>=10.0.0,<11.0.0

To make it simple we use lambda-web-adapter to ease deployment on aws lamdda, where web server binaries should be in /opt/extensions/ directory.

check gradio-web-adapter

  
FROM public.ecr.aws/docker/library/python:3.11.10-slim
# Install AWS Lambda Web Adapter
COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.8.4 /lambda-adapter /opt/extensions/lambda-adapter
ENV PORT=8000
WORKDIR /code
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD [ "python","app.py"]  # make sure app runs 8000 port

  
{ "app": "python3 cdk.py"}

To deploy on lambda function on AWS

cdk deploy --verbose
cdk destroy  

If the files are not been empty, In here, ECR we pushed Dockerfile ot don’t delete. Make sure check for unnecessary credit bills.

DEMO

Python, MlOps

MLOps

This post is licensed under CC BY 4.0 by the author.

MLOps 101

Technology used:

pre-requisties:

Setup 1: Setting up DVC🗃️

Setup 2: AWS☁️

AWS Cost Management is a ART to manage it

Setup 3: GitHub Secrets & Configure EC2

Setup 4: Hydra configs

setup 5: Tensorboard

Training

Test-Pytest

Evalution

Inference

DEMO

Trending Tags