 So hi, I'm Shavai, I'm a developer advocate at Millisearch and I've also been a contributor to Kubeflow. Hello, I'm Rishit, I'm an undergraduate student under researcher at the University of Toronto and I work with academic machine learning research. I've also been a contributor to various open source projects including Kubeflow, PyTorch and more. Alright, so of course some of you who might be aware with machine learning might be aware with a typical machine learning lifecycle and pipelines so I'll not go too deep into it but primarily whenever you're dealing with any machine learning workflow you'll start with data collection, you'll have the machine learning code itself which usually is the smallest part of the entire pipeline and once you have trained and fine tuned your model it goes into production and of course as part of the production process you'll be having a lot of serving infrastructure that will be used to go ahead and launch your model and then of course that will involve a lot of CPUs, depending on the size of the model especially today when we're looking at large variance models you require a complex architecture of involving multiple GPUs or multiple TPUs at the same time and then of course once the model goes into production you'll also be handling monitoring. So as you see that since there are so many different components to this machine learning pipeline it can be very complex when you're setting this up for the first time and primarily I like to divide them into four different bits of course starting with the version control so whenever you're dealing with a lot of data this data set can be very destructed and there may be a lot of complexities especially dealing with removing a lot of the different issues that the data might have so the data cleaning process itself takes a lot of time and especially when you have to label this data especially if you're having very large data sets there will be a lot of heterogeneous data as well. So data labeling itself is a very huge challenge that you have to solve as one of the generally the first steps in the entire machine learning pipeline then of course coming to the model training and fine-tuning so as a data scientist or machine learning engineer one of the biggest tasks is to fine-tune the model so that it can fit perfectly with what you're training it on and then of course to expect it to give good results at the end once you have launched the machine learning or it's about to be launched into production. Now once the model is fine-tuned enough the next big thing is to deploy it right and especially for some of the larger models that we're looking at today deployment strategies have changed drastically whereas with some of the small models you could run them on a single CPU today you require strategies where you might need to run them on multiple GPUs and of course like me and I gave a talk at KubeCon regarding how you can effectively use multiple GPUs strategies so feel free to interact with us if you want to learn more about that but of course today scaling up your machine learning models has become a lot more complex and tools especially like Kubeflow or flight require special configuration that you might need to provision in order to have multiple GPUs working in together to be able to run some of these larger models so things like data parallelism, model parallelism are adding more complexity to this entire deployment and finally when it comes to monitoring so of course once the model goes into production typically the models degrade over time as you use them in production so having to always closely monitor the performance of the models in production is also very important and you have to of course maintain these models have to continuously update the models as the data set changes of course and the nature of the model in production changes so in ensuring consistency and reliability of these models in production takes up a lot of time so what we see is that how GitOps principles can allow you to make some of these things especially things like data versioning, infrastructure provisioning and changes in your manifest files when your model is in production how GitOps can actually help that and Rishit will walk through some of these benefits of using GitOps sure so let's get right onto it and this is a very deceiving title machine learning allows GitOps we want that but let's change this to well sometimes and I'll also start by saying I don't use any GitOps principles in most of my academic machine learning research okay I'll be honest in all of my academic research but why don't I do it if this talk is centered around machine learning allows GitOps why don't I do it well because at the heart of GitOps patches need to be declarative and in a lot of academic scenarios that just isn't possible that just isn't possible for most academic scenarios but both of us also use GitOps in the machine learning development workflow so let's start by walking through an example and we'll take a look at couple of these examples to see how it can be used efficiently in some parts of the development workflow so okay the experiment I have at the moment is I'm trying to train a simple ResNet model and I want to do this on the tiny ImageNet data set which is 100,000 images and I have a few GPUs doing this stuff so this is the experiment we have set up and ideally while the model development part you want to make a hypothesis so I start by making a hypothesis that I update the batch size to be larger because a larger batch size almost always gives you better model performance almost always so because we can handle it I make this batch I think a larger batch size might improve it and this is what my change looks like for the repository just a declarative change out there so I simply update the batch size and so the key thing to note over here is that the hypothesis I make is declarative and I make it as a pull request at least for this example so it's versioned and immutable starts sounding a bit like I can't use some of the GitHub's principles now that my patches are declaratively put in so I say we can now probably go on to the pull and reconcile part so what I now do is I see how does this affect the model performance and well that's where the pull and reconcile part comes in and I see a report at least for the experiment we have over here and what I also do is I store and compare multiple experiments so this is the experiment I ran yesterday so I took quite some compute power and I ran I trained 100 models on TinyMajinet with the GitHub's workflow I was just talking about and that did make it a lot easier for me training those 100 models comparing them out so this is TensorFlow so a lot of the comparisons happen in TensorFlow but orchestrating all of these 100 training jobs making them in such a way that all of these are well versioned I can see this part impacts my whole training job in this way and I finally get it as a beautiful TensorFlow report so that's what in this report that's the link that we get a TensorFlow link and all of the runs are added to the TensorFlow link so this seems pretty interesting for our use case because you are now able to easily do some of the parts we wanted to do while the modeling process and by the way just as a side note I wasn't trying to get a pretty good model performance but using this kind of workflow actually allowed me to do pretty well on TinyMajinet so infrastructure provisioning so this also declaratively provisions infrastructure and a lot of it for this example at least is taken from Kerasio and they have a beautiful guide on it we also have a talk about this at KubeCon so I won't go into provisioning infrastructure or declaratively doing that but finally what I want to do is the GitOps loop so I take a look at all of these TensorFlow experiments TensorFlow comparisons across multiple different runs and I say this model in this patch looks good it satisfies all things I want to do and then I just promote it and there you have it so by the way all of this is open source you can try it out so these are with TinyMajinet of course you can make it work for anything you want and this also brings me to this interesting to this really apt image from CodeFresh and this talks about the entire GitOps loop and this might sound very familiar to most of you since it's CDCon plus GitOpsCon and this is pretty similar to your usual process while modeling machine learning algorithms and with a few changes you can start leveraging the powers of GitOps and the powers of all the GitOps principles like one of the examples I just showed so we'll also talk about a few more silos inside the machine learning workflow where we have been able to use and implement GitOps pretty effectively so yes you can do a lot more inside the machine learning development workflow with GitOps and Shwai will talk about data versioning which we didn't really talk about in this experiment so I'll side note DVC is called data version control and this is a tool that allows you to very easily manipulate both your data and model so of course as you might know that when you are dealing with machine learning especially in the fine tuning phase you might have to run a lot of different experiments with different set of models with different set of model parameters or different data sets as well and as you kind of fine tune this model to maintain all these changes you also want to ensure that you can very effectively track these changes with your experimentation so in order to make machine learning model experimentation easier and so that you can very easily manage your model and data so both data versioning and your model versioning can be done very easily with the help of DVC so it's an open source tool so this of course directly uses the principle of GitOps where we are essentially looking at the different versions of code that's a very standard principle that's employed by Git so over here what I've done is that you can see that I just ran a very quick experiment and of course like because with the time constraint I ran it over an MNIST dataset it's a very simple machine learning model and if you look over here that I have actually done DVC experiment show which basically keeps a track of all the different experiments that you have run and you can take a look at when that experiment is run and you can very effectively or efficiently just change between these experiments and each and every experiment as you run it and you log it with the help of DVC it will keep a track of all the different model changes that you have done and also the changes in the dataset that you have actually done so you have dedicated commands that are very fine in tune with similar to how Git works so for example if you do DVC add that basically helps you to keep a track of your data changes and your model changes similar to how we use Git add and then you can use DVC checkout to change between these different experiments similar to how you do Git checkout to go between different branches and here is an example where I have basically been able to do two different Git commits and I had logged one model under one different commit which again was a separate experiment and again I had a second different model with more number of images so I basically changed my number of images and I can very easily track this with the help of DVC and with the help of Git so of course that's one aspect of being able to use GitOps principles directly in the realm of machine learning with the help of DVC but over to Rishad to take a look at some of the other aspects with the help of tube flow sure so well most of what we have been talking about at least my demo was at least the infrastructure provisioning for the quick experiment we saw was Kubernetes clusters uses ROCD in the backend and I just wanted to show you how the GitOps things work together but GitOps is more than that you can most certainly use anything just makes it easier for a demo and it's also what I use but another thing which I wanted to quickly demonstrate and we are almost at time so I'll make this quick is so I have Kubeflow over here so all of these are customized directories for installing Kubeflow and another thing which is pretty common is probably having custom Kubeflow components so you would probably have make some changes so one of the changes I've made is I'm now using Metal LB for the networking parts so this is one of the changes I make and think about this as adding a new custom component to Kubeflow so with GitOps and let's just do it ourselves so this is also set up to a GitHub repository and I do have GitHub actions doing it and essentially the idea of showing this is how easy it is to make use of GitOps principles to deploy Kubeflow and I actually tried doing that with Argo so at the moment because all of it is configured to my GitHub repository as well you can also feel free to check this out this is for the latest Kubeflow release and I can just start doing that and it will fetch the head and start provisioning the resources we are already at time and I see a stop sign but you should see all of these in Argo CD UI which is pretty straightforward but essentially there you have it any new Kubeflow component you add or customize that's right out of the way it deployed for you the part is deployed and Kubeflow is being deployed so this Kubeflow deployment will take around 30 to 45 minutes so about it you will see this up on Argo CD UI all these resources being deployed and essentially you can also make use of GitOps powers so that's it for the talk thank you I am Rishad and I am Shravan so we are open to any questions if we have time