Final project for the DevOps Upskill course.
This project tries to solve an issue I encountered during my bachelor thesis. I was developing ML models and had to compare the results and performance. The process was predominantly manual, involving extensive use of Excel sheets for results analysis. This approach was not only time-consuming but also prone to errors and inefficiencies.
This is an approximation of how working on my thesis went. The goal was to compare several different models performance for a specific task. The total time available for me to develop and write the thesis was 4 months. According to the Value Stream Map above, for 3 models developed (the amount that were developed in the end), it would take 130 days or 4.33 months. The main bottleneck was tracking the performance of the models, collecting the results and comparing them manually.
To address the issue of tracking and preparing models, this project introduces an automated process for ML model training and result tracking on a local Kubernetes cluster. It allows a Python developer to submit their training jobs to the local cluster, track and compare their performance in the MLflow UI. Additionally, a CI pipeline is available for the Python code, which also builds and pushes the Docker container to Dockerhub.
Topics from the DevOps course which are covered:
- Value stream mapping
- Source control
- Branching strategies: Trunk-based development
- Building Pipelines
- Continuous Integration
- Security
- Docker
- Kubernetes
- Infrastructure as code
- Secrets management
infrastructuredirectory contains the Terraform code for setting up the infrastructure. It assumes that a minikube cluster has been started locally.main.tfcontains the provider versionsmlflow.tfcontains the resources related to the MLflow tracking server.postgres.tfcontains the resources related to the Postgres DB which is neede for the tracking server.variables.tfcontains variables.
trainingdirectory contains the Python code for training and registering a model with MLflow and sklearn. Additionally, contains one example manifest for running the example training code..github/workflowsdirectory contains 3 different GitActions pipelines:infrastructure-commit.yamlis a pipeline which runs whenever changes are made to the /infrastructure directoty. It checks the terraform formatting, lints the Terraform code, validates it and performs a Checkov checkpython-commit.ymlis a pipeline which runs whenever changes are made to the/trainingdirectory or the mainDockerfile. It runs black, pylint, gitleaks, SonarCloud, build the Docker image, checks it for vulnerabilities with Trivy and then pushes it to Dockerhub.mlflow-commit.ymlis a pipeline which runs whenever changes are made to themlflow.Dockerfileor itsrequirements.txt. It builds the Docker image for the MLflow tracking server, checks it for vulnerabilities with Trivy and pushes it to Dockerhub.
- Minikube cluster running on local machine. Ingress addon of Minikube enabled. Kubectl installed.
- Terraform >= 1.7.0.
- Choose a password for the database and set it as a Terraform variable in the development environment.
export TF_VAR_db_password=<password>
- Make sure you have everything from 'Requirements' section set up already.
- From the
infrastructurerepository, runterraform init terraform plan - If plan looks good to you, you can run
terraform apply - Get the name of the MLflow server pod. It is the one starting with
mlflow-server. We need this for the next step.kubectl get pods -n mlflow - Forward the port to be able to view the MLflow UI.
You can now access the MLflow UI at http://localhost:5000/.
kubectl port-forward <mlflow-server-pod-name> -n mlflow 5000:5000 - Apply the demo training manifest from
/trainingon the mlflow namespaceThe job will be finished in several seconds, and then you can see the results on the MLflow UIkubectl apply -f training/elasticnet_manifest.yml -n mlflow
