 Welcome to my talk entitled reducing technical tech for ML platforms. So there is a little about me before we hop on to the talk. I work as a senior data engineer for episodes on the NMP team. I love to architect software and design data handling by thanks. I tried for designing out simple architectures, but as we say that designing out simple architectures will be a difficult job. Also, I have given out my social handles over there, over here in case if you want to reach out to me, post it off to discuss all the content I have presented over here. So let's move to the agenda part. What is the agenda for today's talk? So we'll start with defining out what is technical debt. Then move to understanding CD for ML that is continuous delivery for ML. After that, some discussing out some CI CD best practices and then how does that translate to ML Ops. Finally, I'll present out a ML Ops solution that we have developed at Episodes. So what is technical debt? So whenever you find yourself in a situation saying that let us release it now, we can fix it later or we can improve the process later on. Those are some instances where you may be introducing technical debt in your systems. So technical debt is the coding that you must do tomorrow or that you have to ultimately do into the near future because you took out a shortcut in order to deliver the software today. Now that can be due to meeting out some deliverable, meeting out some timelines or maybe you had lack of skills to get that process done at that point of time. But that is something that you have to ultimately do it in your future. So we'll discuss out aspects of technical debt over here which is closely related to automations and what lack of automation can do to your software cycles. So one of the major impact of having technical debt in your system is that it slows down your development. If you don't have enough automations developed in your system, then releasing out a simple feature or a simple change can take you a lot of time. So over here we will talk about how we have automated our whole testing pipelines. So if that is not automated, then the developer may have to perform some manual checks or there may be some human interventions that are coming between. So these are some things can actually make your whole development and the deployment cycle really slow. So now if you want to consider that do you have technical debt in your system or not, then you can ask this simple question to yourself, which is that now you have your whole pipeline setup. Now how much time does it take for you to create out the copies of that pipeline? Can you create a similar environment as production environment? And suppose that if you want to have A-B testing or do some particular kind of type of releases, then is it possible over there? And does your data scientist have to worry about the whole ML infrastructure that would be required to post your pipeline? So these are some questions to ponder upon to see if you have the type of technical debt we are discussing over here in your system or not. So to bring out, to reduce the technical debt in the machine learning system and to add more automation into the machine learning system where comes the continuous delivery. So that it is very essential and if you don't have it from scratch, then when your project has started and if you are even ignoring it at this point, then we may end up in a situation like this. So adding it at a later stage of the project does increase the job by a huge amount. So at episodes, we decided to build out our system where the CI CD is a priority. Now the whole CI CD system, there was some standard set for this whole CI CD system that first was that it should act as a gatekeeper to hold the production and the dev environment. As in if someone is pushing out something to dev environment, it should not break anything. And if that change is going from dev to production environment, then nothing should be breaking to, nothing should break the production environment too. It should be fully automated so that we can help improve the developer experience and the productivity. So if you find your developers doing some repeated jobs and that's where the productivity really hits. So we wanted to build out an environment where we can ship faster, ship consistent code and we should, we can ship out fearlessly. As in if my change is going out today, I should not worry about it, it can affect the production environment. Because if I'm testing out the whole changes in a production like environment, I'll get more confidence that it will work into the production environment. So that was the whole vision of creating out the whole CI CD system around the ML systems. Now when there are ML models involved, you can have different releases, you can have multiple versions of model getting deployed. And when there is so much diverse environments involved, there is a possibility that something can go wrong. And if anything is going wrong, the rollback should be very easy. So for us, we have treated rollback as a one single click event. So if we click on a button, a rollout would happen automatically. So essentially this whole system should bridge out the gap between the data science and the data engineering tasks. There should be very less friction when we are moving from data to modeling it and to pushing out the whole ML models to production. And then ultimately, if we have to repeat the cycle again, this should not introduce any friction and be pretty automated. Now these were some aspects that are related to CI CD. But when you move from CI CD to ML logs, there are some extra steps that are involved over here. So essentially it would be first would be the building out the model. So once the developer has built out the model, the next step is that he wants to save those model artifacts. Artifacts are the things that are the outputs that we receive once the training of that model is done. So he has to push those artifacts to a long term store. So for us, we are using AWS Cloud. So we use AWS S3 to store the whole artifacts. Once the model is prepared, he will run some basic checks to see if the model is worthy enough to be even tested into the testing environment, generate out reports to see how this model is looking like as an overview. And once he is sure that he can take this model for testing into a test environment, he will deploy it to a test environment. And for this, the environment should be like a live production and there should be some automated tests that should be in place. So once the model is deployed into the test environment, some tests would be run, performance support would be generated, and that would be emailed or notified to the stakeholders. Also, the whole test environment should give out the flexibility to the developer so that he can test his deployed model locally or manually also. And suppose he has the different versions of model prepared, then the whole test environment should have the flexibility of deploying the models parallely through the test pipeline. So it should not be limited that you can only deploy one model at a time. So that's how the test environment should look like. And once the whole testing is done, the model would be deployed to production environment. And this will again be whole automated because the steps that are involved into the test environment with production in terms of loading out would be quite similar. So this should be automated and if the model would be deployed. Now coming out to how MLOps look at episodes, so we have build out production like this environment. We have a stateless provisioning of resources or even if someone come and destroy out some particular resource on an infrastructure, we can build out that resource and the architecture will come up by itself automatically. So now into our whole NLP pipeline, we have five to six machine learning projects working together to build out the entrance output. So as the project or the ML projects are increasing, the whole pipeline that we are building out should scale by itself. As in it should be easy to replicate it across the ML projects. So we have built out the pipeline into that way. So whatever infrastructure that we have written out is written through Terraform. So Terraform helps you to write infrastructure as whole. So whatever resources that we are deploying on AWS, we are making it so Terraform over there. So the whole architecture and the environment should be foolproof, automated, and it should be flexible enough to make the developers like easy. So over here you see the whole ML model lifecycle. So the first step is developing out the model. Once the developer has developed out the model, we'll prepare it out for production that is generated through some basic tests and see if he can push out the model for testing. Once you push out the model for testing, it will go through CICD pipelines, the artifacts will be pushed to a test location and then there are deployment strategies associated with it. And we host our ML models as APIs. So a containerization would happen and the API would be hosted through the container. And the deployment should be easily scaled up. And so after this is deployed to a test environment, once the automated tests are performed and we are ready to deploy to production, the same two steps happens. CICD pipelines are run to push that model to production and a similar step. So if you notice over here, these two boxes are seen to the test environment also and also the production environment. So it means this is the place where we should do automation so that we are not doing the repeated tasks manually. And once the model is into the production, it would be monitored to see, look out for the performance and whatever feedback that the developers receive, they will make out the change again into their model and then we'll again push it out for whole testing and then blazing out the change in production. So that's how the lifecycle looks like. And we have automated all the human steps, all the manual steps that are involved over here to ease out the whole development lifecycle. So this is the basic architecture of how the APIs are hosted in our environment. So we make use of AWS elastic container service, which is a service that allows you to host your Docker containers. So for that, a basic requirement is that you have pushed your Docker image to the elastic container registry that is a registry to store out the Docker images. And once that image is pushed, a whole infrastructure is provisioned to start up your Docker container. And then that container is put behind a load balancer so that if someone wants to query or post request to that API, that request would go through the load balancer. So over here I have shown out three stages. So the first one is the production state or the main you can call it a master state. The second is the staging state and the third is the whenever so this feature stage is essentially the test environment. So if you want to post out a Docker container on ECS, the basic step is that you push out your image to ECR. You build out a task or a task definition and you put that task definition behind the service and then that service is put behind a load balancer. So essentially, if I am putting out my code to a feature branch that the Docker image corresponding to that code base that I pushed into the feature branch should be pushed to the ECR. A task should be created and a service should be created in response to that and the service should be put behind this load balancer so that we can query the API that is built out from this feature branch code base. And similarly, once this feature branch is merged into the staging branch, the same process would be followed and the same for when it is pushed to production. So that's how the whole life cycle or the architecture pipeline look like over onto the hosting the APIs of AWS ECS for us. So now to automate the steps that I've shown over here we have built out two GitHub actions. So the first GitHub action is triggered whenever there is a full request open from a feature branch against the develop branch. So whenever there's an open PR action, a GitHub action is run, which into that build, it takes up the feature branch code base, it builds the whole Docker image and push it to ECR. It creates a task definition on AWS. It creates a service on AWS ECS and then put that ECS service behind a load balancer. So these all steps are performed using Terraform when it comes to the program of the resources on AWS and this Terraform is executed inside this GitHub action. And similarly we have built out one more GitHub actions, which is executed whenever there is a PR close action that is either it's a merge, it's with merged commits or unmerged commits. So depending on so once the PR is close, what we want is that whatever test environment we had deployed it should be destroyed. So for that we have whenever there is a close PR we delete this task definition that we have coefficient we delete the service and we delete the rule that we have created for A and B. So these simple two GitHub actions enables us to put the whole automation in place. So the data scientist just has to worry about pushing this code to a feature branch and then opening up a pull request. And once that pull request is closed, the whole test environment is destroyed automatically. So I'll walk you through the first GitHub action that I've over discussed over here, the whole steps that are involved in building up that. So like I told you that the first GitHub action triggered whenever there's a pull request raised against the targeted branch over here we have to take that branch as developed so whenever there's a pull request that is raised against a developed branch. So in this job, we have so whatever the file and the code basically has pushed, we have to build out a Docker image and then push back to ECR super that we are fetching the code base of that feature branch into this build job. We are configuring configuring the AWS credentials to push out and build out the resources on AWS environment. And for provisioning out resources on AWS we have used Terraform to have the whole infrastructure as code in place so we are setting up Terraform over here. Now the next step is that we should log into Amazon ECR because once we have once we will build the Docker image we will push it to ECR. And once that image is pushed, we'll log out of the Amazon ECR. Next step is to deploy the, deploy the resources on AWS, which is the build out it our second nation build out the service. So what we are assuming over here that you have your whole Terraform code base plays into this test under for service folder under the CI folder. And the basic steps when you want to deploy through Terraform should be that you initialize your Terraform into the directory and have the state save into the. So the state is saved either on to your local or you can have your state saved into S3. So we have provided as you will validate your Terraform code plan that whatever what all resources would be deployed in response to this execution. Apply the changes as in deploy the task definition service and the rule. And once that is done, a message is posted on to the pull request that your test environment is deployed. So that's how it works like. And the first GitHub action workflow. We have seen how GitHub action should look like if you want to deploy your resources using Terraform into the GitHub action environment. So I have prepared out a short demo that I'll show you that how can be how can how will the action run. So over here we will just be creating out a task definition version. So whenever a pull request is open against master a task definition version should be deployed. So to look out to the EC to the ECR environment. So over here we have this repository that shows the Docker image. So whenever we'll run that action image should be pushed image into this repository with a branch tag that would be my feature branch should be pushed over here. And then we have the task definition. So our task definition with my branch tag should be also created over here. So I'll start with making a simple change into so I'll walk you through first through the repository that I've prepared. So in this repository, I have a application file, which is a simple task application, a whole world application. I have a Docker file that wraps up this application for a Docker container. So I'm taking this application adding the app.fi and then finding this and then building out the Docker image using this Docker file. Also, to provision out the resources on AWS, I have put up Terraform code environment, but we'll not go into the details. The scope is not to get familiar with the Terraform, but how the process is carried out. So what I'll do, I'll push out a simple change into this app.fi. In response to that, our new Docker image should be pushed to the ECR environment and our new task definition should be created. So let's start with making a change room. So let's, I am doing, I'm changing the greet message into this app.fi. So before that, let me create a new branch. I'll create a new branch in demo one. And then I will do the greeting message. So I have just seen the app.fi and I'll push this change to upstream. So just launched upstream. And the next what I'll do is I'll open up a pull request and in response to that pull request, action would be run. The image would be pushed onto the AWS environment and our task definition will be created. So let's start with opening up a request. So update to greet message. And I am opening up against the master branch in this case. Okay, so pull request is open. Let's see if we have our action. Okay, so a new action is triggered in response to the pull request that we have opened. In this action, I have defined the same steps that is building and pushing out the Docker image and running some Terraform code that will deploy the last definition over there. So this will take several seconds sometimes or a minute to build to run your GitHub action. So over here, if you can see that it had multiple steps that fetching out the ECR deposit freeze, configuring your credentials, all those steps that we have seen into the slides a while ago. And it will build your Docker image and push it to ECR and then initialize, validate, plan and apply your root Terraform changes. So let this complete and then we'll move out to the AWS console. So to have a better visibility, I'll show it to you on the AWS console that these resources are deployed. And all those resources should contain a branch tag that will actually distinct everyone's feature branch from each other's resources. So this is marked as clean means that our GitHub action is completed. Let's see if we have the ECR. So we have our own version of Docker image pushed over here. Okay, yeah, so we have the image pushed over here. The branch name for us was demo 5.1. So the image tag remains the same. And similarly, task definition should be created with the same tag over here. Like you can see our task definition is created in response to what we had just pushed with the branch tag. So this is how we make use of branches to distinct each other's resources, like each feature branch resources from one another. So this was simple. The developer had to do work just to push a scope base and open a purple device. And after that, the whole deployment of the infrastructure was automated and we made use of Terraform. Setting up Terraform in the direction is pretty easy. So let's head back to this slide. And look at the takeaways. So what we have seen over here that GitHub actions plus infrastructure as full is the key to automation in this whole pipeline that we have shown. And not just GitHub action if you are using any other CIE, like if you're using Travis CI or Jenkins to perform your CSP, then you can use that, but make sure that you have full infrastructure return as full so that you can replicate it across the environments. So now once the deployment is done, what are the steps that you have to take care while building out the system. So over here, what we have done is that whenever there's a push commit to the main branch, as in when we want to deploy to a production environment, a similar GitHub workflow is applied. And in our case, the whole scaling of this ML APIs are on demand that is depending on the number of requests this API would get, it would scale it out itself. So we are not doing any scaling up ahead. So we don't have to pay for resources that are not actually being utilized. The only concern or the only job for a data scientist or ML engineer is to push out a Docker file or push out his changes to a feature branch and open up a pull request. After that, all these steps are fully automated in terms of deploying the ML infrastructure. The way we have built out the system is it helps us to keep the test and the production environment configuration pretty similar. And it is pretty easy to replicate this whole environment across ML projects. You just have to build out the similar data actions in your projects and then you can have this replicate across your repositories. Now we have seen that how can we deploy the resources to a test environment, but one major point is that once that is deployed, what do you have to do? And if your changes, once they are pushed to production, you have to make sure that whatever resources that you are developing to the test environment should be destroyed as soon as they are out of use or they are not required anymore. So for that we have this PR clause even so make sure that you don't have your state resources lying into the AWS environment and it's a responsibility to keep your AWS environment clean. So such a clause even should be there. So this was a quick presentation on how we have automated deployments into our ML infrastructure and how it is helping keeping the data scientists like easy and also reducing work at the data engineer side. So that's all for today. Thank you for joining in. Let me know if you have any questions.