 Hello everyone. Thank you all for attending my talk today. I'm Hima Virabi and I'm working as a software engineer here in the AI ops team and the AI center of excellence at Red Hat. In today's talk I mainly want to go over what exactly is DevOps from a data science perspective. What are some of the GitOps principles that could actually be applied. For a machine learning model development lifecycle, a quick demo on an overview of a machine learning use case and how we can integrate GitOps into it and finally leaving some time for any Q&A that you may have. With that, I also want to mention I have a couple of poll questions that I've created. And I think Hima and Eric would probably be dropping it in. So please do take some time to answer those poll questions. I would definitely love to hear what kind of audience is sort of interested in this kind of topic. And yeah, also drop any other questions that you may have. And I'll try to get them answered by the end of this presentation. So with that, let's go ahead and get started. So what exactly is DevOps? DevOps is part of the agile manifesto and one of the principles which I believe is highly relevant is stated as so. Our highest priority is to satisfy the customer through early and continuous delivery of valuable software. Now in the case of machine learning examples, it would basically be continuous delivery of valuable insights from data. So how exactly is this different from your traditional DevOps approach? So when you look at your traditional software development lifecycle, we know that it comprises of a series of steps. So for example, you start off with creating your source code for your software or application. You go ahead and start building out this software or application. You go ahead and create your integration testing to make sure your application is meeting all the requirements. Finally, once you're happy with these tests, go ahead and deploy it into a production environment. And lastly, you have monitoring in place to observe the performance of this software in your production environment. And typically, all of this takes place in a couple of minutes. So meaning that this entire end to end sort of lifecycle is pretty rapid and it's pretty fast. Now, this is not entirely the case when you consider a machine learning development lifecycle, for example. So machine learning model deployments are quite different in nature as compared to your software deployments. A recent survey conducted by Algorithma actually found that 55% of companies have not even been able to deploy their machine learning models. And those which managed to have these machine learning models deployed, it was seen that very few of them around 15% managed to have it within a week. Some of them or vastly a majority of them around 50% of them managed to have it within a time frame of one week to three months. And then you had a small portion who were able to do it within more than three months and some even taking a couple of years. So clearly we can see that there is some sort of discrepancy between the fast release of software applications versus your machine learning deployment. So why is this so? Now, when you look at machine learning model deployment process, the core component is basically training your machine learning model. So when you're training this machine learning model, one of the challenges we see is that the code that you have for your model training purposes is independent from the data itself. And also the machine learning model is the resulting artifact that comes out of this model training process. So in your code of your machine learning model, it's basically where you define all the algorithms that you're looking into for training your machine learning model. So this could be deep neural networks, for instance, it could be clustering algorithms, it could be regression based models that you're looking into. And this is where you're actually defining the architecture of your model, along with configuring various parameters required to train these machine learning models. The data is of course a very important aspect for your machine learning model. You need to validate your data, you need to clean it up, and also ensure that you're able to shuffle it accordingly and extract all the relevant features that are needed for your machine learning model training. And finally, you'll have the machine learning model created, ultimately out of this. And we basically look at all its outputs, we look at the predictions of these models. We try to understand the performance, the accuracy, and we look at the key metrics for looking at the observing the performance of this model. And once you're happy with it, we go ahead and try to get it deployed into production. So as you can see, the machine learning code is only a small part of this entire solution. There are multiple components involved and acting sort of dependently throughout this process. So often, we also see various teams who are involved in this entire machine learning model development process, right. So for instance, you have your data engineers who are looking at consolidating data from various sources, making it easily available and accessible for your data scientists. You have a data scientist who are mainly concerned with extracting this data, building the machine learning model for it. And in RT, we use a tool called Jupyter Hub or Jupyter Notebooks for actually writing up the code for this machine learning model. And once we have this machine learning model available by the data scientist, they typically work with DevOps engineers or they work with other developers on the team to have this machine learning model running into your production or open shift platforms, for example. Now, having worked with data scientists myself and in fact being involved in machine learning projects myself, I have often seen that data scientists are now sort of also be responsible for managing the deployments of their models as well. So it's equally important that the data scientists are also very familiar with the DevOps principles, with the GitOps principles, so that it makes it easier for them to understand this end-to-end process. So with that comes the importance of GitOps, right. So like I said, in our team, we have multiple data scientists who are collaborating, who are working together to get the code for your machine learning models. And hence we need a central location to sort of consolidate all of these models and data for which we use GitHub or GitLab. And this basically allows you to track and modify any of the changes that you might need, similar to how your software applications need to be version controlled in Git. We also would want the machine learning models to also be version controlled and tracked inside Git. And this also allows us to have external contributors, so they can submit their PRs, somebody who wants to sort of improve the performance of the model. They can just create a pull request with the repository with their changes and we have our continuous integration pipelines in place. So this can be like a Jenkins pipeline, which is basically checking whether your code successfully goes through. And if so, it basically containerizes your entire application, creates a container image out of it and you can then push this into your query repository. And then comes the aspect of continuous deployment. So you have your application manifests again, which are residing within Git and our team uses Argo CD as the deployment tool to have these applications deployed and having them running within OpenShift platform. So essentially you see that GitOps is also highly relevant for a data science perspective. So now that we know this entire GitOps principles and how it's sort of applicable, let's look at the entire workflow of how this looks like from an end to end machine learning model deployment process. So as you can see, first of all, we have the data enthusiasts who are basically looking at the data, trying to figure out some insights from it, performing all of their analysis and creating the machine learning model. And like I said, we use something called as Jupyter Hub as the primary tool here. You'll also see this in the demo, which I have in the next couple of minutes. So once they have this sort of created and developed, they basically push all of their code into Git. And once you have that into Git is where we want to now have these containerized images pushed into Quake, for which we have what we call as tecton pipeline. So this tecton pipeline is nothing but an open source end to end framework basically, which allows you to create your CI CD systems, allowing developers to build a test and deploy these applications across various cloud platforms. And we work with our talk team, which basically is the AI DevSecOps teams within our organization, and they primarily work a lot with setting up this kind of continuous integration and continuous deployment pipelines. So we were able to get their help to have this configured for our use case. So essentially what this tecton pipeline does is for all the changes that we submit and we push it into your GitHub repository, you can associate your changes with the new tag. Just like how you have new releases and new tags for software development, we do this, follow the same process. We make all our changes associated with the tag. And when you create this new tag inside your GitHub repository, it triggers this tecton pipeline, right? So what the pipeline does is it triggers that there's a new tag being created for your repository. It creates the container for your entire machine learning application code, and it pushes this container image into Quake. So Quake is nothing but your container image registry, which allows you to store, build, distribute and deploy your containers. So that's the pipeline in basically in the configuration we specify, which is the Quake repository you want this image to go into, and there are certain other configurations that we would need to set up inside the repository. And once that's done, it sort of pushes it on its own. It takes a couple of minutes to run through all the checks and then makes that image available now inside Quake. And this image can now be used inside your production, which is the third and final step. So you can use this to sort of expose your machine learning model. Let's say you want to expose it as a web service. You want to sort of publish all your predictions of your machine learning model into some kind of web application. So you basically can do all of that and have that final deployment sort of customized for your use case. And like I said, in our team, we use Argo CD for sort of managing these deployment manifests. And ultimately, let's say your goal is to have some kind of dashboard or some kind of insights that you want to share about what this model is doing with your stakeholders. And you could also have that exposed as well as part of your entire pipeline. So this is basically the overall workflow. And in the demo, I mean, you want to focus particularly on step one and step two, basically seeing what is it like for a data scientist to interact with this sort of CI pipeline. So now that we know this entire workflow, you can sort of get an idea that most of the customers sort of questions or requirements from an end to end machine learning perspective is sort of being addressed over here. Starting with how do we have our code and how we have our simulation and deployments to production. And our solution to this is we're looking at a container based CI CD approach. How do we want to look at maintaining and managing all these different configurations. Our approach to that is solved with our end to end get off sort of free work that we have set up. And of course, how do we do all of our data processing. We're using all the open source AIML stacks, which are available. Like I mentioned, Jupiter, Jupiter hub or Jupiter notebooks is one of the primary open source tools we use to do all of our data running and data processing. And finally, where do we do all your machine learning model training and deployment and pushing it finally into production. And for this is where we actually use the open data hub platform, which has CI CD enabled with it. And if you had the chance to sort of attend the keynote today, you would have heard about this open data hub and this operate first sort of initiative. If not, we also had a prior talk, I believe by Michael and Tom, who sort of gave a walkthrough of what exactly is the open data hub platform. So before I sort of dive into the demo aspect of it, I will be using one component of open data hubs. So for those who may not have heard about it. Open data hub is an end to end AI machine learning platform available here at Red Hat. And it basically integrates a collection of multiple open source projects, which can be used as from a data scientist perspective, which can be used as a data engineer, which can be used from a monitoring perspective. So basically integrates all these bunch of tools. And some of the tools, like I said, our Jupiter hub, our go CD, we have monitoring tools like for me to use in the fauna. We have visualization tools like Apache super set and various other open source tools available. So if you're interested to learn more about this, I would encourage you to go to open data hub.io. And we also have a community over there available or looking for any contributions as well. And if you're simply just interested to sort of play around with this, I would definitely suggest to have a look at that. So with that, I think I'll go ahead and move on to the demo. So let me go ahead and quickly switch screens a bit. So this is the Jupiter hub instance. We have publicly available as part of our operate first initiative. So you don't need to be connected to VPN or anything of that sort it's publicly available. I'll also try to drop it in chat in case you guys are interested later on to sort of see where this is. So when you go to this URL, you see this button to sign in with open shift. So you can go ahead and click over here. We have this you basically this entire Jupiter hub instance is hosted on the mass open cloud cluster. So that's where it's currently hosted at. So this is the login for that. You can go ahead and click on Google account. And I'll go ahead and log in with my red hat. And once that's done, Jupiter hub actually throws this spawner UI option for you. So it has a bunch of notebook images that you can actually spawn from. So for each of our projects, we have actually created a customized notebook image for every project and we've made them available here. And apart from that, we also have some default notebook images that you can play around with. If you're interested in playing around with TensorFlow. If you're interested to play around with spark. These are some notebook images that are supported. For this demo, I'll go ahead and just spawn a simple default. A minimal notebook. And once that's done, you go ahead and click on start. And then I'll take a few minutes to just get the PVC attached for your particular server and once that's done, you should be able to be successfully into the Jupiter hub UI. So while that's doing taking time to spin up, I want to go through this particular machine learning use case that I'm showing in my demo today. So this is the repository. I created a fork of this. This is the upstream repository within our teams organization. I've just taken a fork of it for the purpose of this demo. So what we're essentially doing in this project is we are trying to identify if we can apply AI and machine learning for identifying flaky tests or predicting any test failures that we see in our continuous integration testing pipelines. So the ultimate goal of this is basically to alleviate developers from manual time sort of debugging why exactly did a certain test fail. So we just want to understand the CI pipeline a little bit more in depth and see if we can come up with some kind of intelligent solution, allowing developers more insights into what the CI sort of tests look like. And the data set that we're actually looking at is the test grid data set. So test grid data set is publicly made available by Google cloud platform. So they've made this test grid data available that we're actually looking at, which basically gives us a combination of tests that have run on different open ship platforms or on different red hat platforms. Along with the timestamps of when these tests were run, whether the tests passed with the tests failed, or would be flaky in nature. So that's essentially what this project is trying to do. It's still in a very early sort of exploratory phase right now, but I thought it would be interesting project to look at from a data science perspective. So let's see if we can have it up. So yes, the server came up. And once that's up you see a bunch of folders so basically I've been getting all my forks and repository inside over here so you can see the OCP CI analysis repo. I've already done that over here. And you also have a terminal, which again allows you to easily contract with your repository that you want to base off of. So you can see I have already set up my OCP repo and I have my local for people as well. So let's go ahead and look at some of the notebooks. So as I mentioned, we're looking at the test grid data set. So I have this notebook here called test grid clustering.ipind. So that's the extension of your notebooks when you create this code within Jupyter Hub and typically it's solely based on piping. So most of the packages and everything would be dependent on piping. So in this notebook, what we're doing is basically going over the test grid data set and try to see if we could find some interesting features out of it. And like I said, we have a bunch of piping packages that we are sort of using throughout this notebook. So we start go ahead and start importing them and we define these in our FIP files as well, which reminds me to install all of these dependencies so that the notebook runs without any issues and any dependency issues. So while that's installing all of these packages, I'll just go over what the notebook is doing. So as you can see here, we have the test grid data set, right? So we have tests, each individual tests on this column on the left side. And here we have all the timestamps associated for every test and all the different runs executed for each of these tests. If the tests failed, if the tests were successful, if they were having any sort of fake nature, all of these are sort of indicated in this test grid data set. So we are basically trying to look at these different grids that we obtain from test grid. We're trying to see if there are any duplicates such grids based on different platforms where we're collecting this data from. Just simple answering simple questions like are there any subgroups within these tests, which tests do all the sort of groups share in common and just sort of looking at this data from an analytical point of view. And then having all of these sort of key metrics that we want to track like what was the test case, test coverage percentage, what was the total number of tests and so on. And how can we also compare these tests if there was any common sort of behavior among them. So that that's more of the analytical steps that we're doing within these notebook. And we're also trying to sort of plot it in a more visualized appealing manner. And, and the main part is basically what are the features that we want to look at right. So, for example, I want to look at what is my test pass rate. I want to look at how many failures we've been having. What is the sort of trend that you see in your failures. Is there like a failure streak commonly observed for your tests. So what is the frequency of these kind of tests. So these are some sort of features that we came up with or key metrics that we thought would be relevant for our analysis of these tests. And hence we consolidate them into this neat data frame, which has the six key features that we want to further build our machine learning model on top of. So now that we have these features ready. And what exactly we want to sort of observe, we can now visualize visualize these data points that you have. So, like I said here we have like six dimensions but of course if you're looking at a 2D plot, you need to sort of cut down your dimension so that it's suitable so from a six dimensions we go into two dimensions. So we have this Python function or this basically a technique called the PCA technique. Don't want to go into too much technical details, but it's basically a method where you can cut down these dimensions so you basically say that I want to cut it down to two dimensions. And then you go ahead and plot your data points accordingly. So the purpose of this plot was to actually sort of see if, as you can see these kind of points are scattered right so you can see that some seem to be sort of clustered in one group, some seem to be outliers, and so on. So this sort of was a motivation for us to go ahead and train a clustering model that so it gives us labels on what are the different type of clusters based on the features we've defined and how each of these individual data points get grouped into these clusters. So for that we have this clustering model or DB scan. So we've basically invoked to train this DB scan clustering model and then finally give us the output of this clustering model. So it's basically just an array. So for every data points it's telling us which cluster does it belong to and so on. And finally, we want to save all of this cluster model. So you can store it in like a pickle file you can store it in a job lift file, depending upon your use case. And once we do that, I'll go ahead and run all of these cells above. So yeah, once you do that, you can basically load this stored pickle file or job lift file, whatever it is, and we can check to see if it's the same. Yes, we see that it's resulted in the same as what we've seen over here. So now let me go ahead and just change this name just for the purpose of the demo. So I'm just going to load it again. You can do it here as well. So go ahead, store it with a different name, and then load this as a different name. So now you should be able to see this new pickled model file available. So now I want to actually push these changes, right? So you can see that there's this new model file available. I do not want to modify my notebook, so I'm going to ignore that particular change that happened. And I just want to add this new pickle file. I'm going to see this is my updated trained model. And now you should see my latest comment over here on top. And now I'm just going to push this to my branch. I think there's all those. All right. So this basically is pushing all my changes into my repository. And if we go here, just do a quick refresh, you should see this new commit being pushed. We should see our new train model over here. This test because updated train model, right? So awesome. Now how do you want to trigger your text on pipeline, right? So like I said, there are these tags associated that can be configured in your repository. If we go to the repository here, you would see a bunch of tags. So let's create a new tag version 0.0.5.3. Now I'm going to push this with this new tag. So when you go to your repository, do a quick refresh. I see this new tag has been created in my repository. Now the most important thing for setting up the CI pipeline is we have this YAML file where you're actually setting up this entire CI pipeline, which the top team were happy enough to tell us how to do it. So they have also documentation as well on how you can set this up. So the main important things that you need to provide here is basically details on where you want to push this image, right? Of your container image of your application. So we are using Quay in our case. I am using my own personal account in Quay. I have a project created within Quay where in which I want to sort of push all my images too. So this is what the Quay image registry would look like. I have this CI analysis project created for my project and these are the various tags associated with it, right? So for every tag that happens, it basically gets pushed and the image once the pipeline runs successfully, it gets pushed into this repository over here. And we also have a tecton dashboard. So like I said, all of these are being triggered with the tecton pipeline. Since you saw that I pushed it as a new image, as a new tag, it got triggered and this is the latest pipeline run that's been running with the tag release run. And if we go into more detail, you'll get to see what are all the tasks associated, what's the status of these tasks and the steps within these tasks. And it's not necessarily that all the tasks would pass, some of them would also get skipped. So like you can see here, four of them were skipped and they were not relevant for our particular tag release. And finally, this is the last step that's currently running. And once it's successful, so basically this is the final part where it's actually pushing it into Quay. So once this is completely green and once all of these steps have completed, we should eventually see this new image tag being created in our repository here. So those are pretty much the steps that we need to do to configure the CRI pipeline. And this is the repository the thought team uses and they have a great documentation on how this entire architecture is set up for them and how you can use it for your purposes. And that's about it. I think in terms of my demo, while this is sort of taking few more seconds, I think few more minutes to just run. I just want to go back and conclude by sort of saying that there are many other tools out there which sort of give the same kind of results. It doesn't have to be tacked on. It doesn't have to be, it doesn't have to be any of these sort of tools that we have used in our team. But there are also some great tools that we also want to explore more in the future, such as Cubeflow Pipelines or Ellyra, which is basically nothing but a JupyterLab extension. And like how you saw in the Jupyter Hub, the Ellyra basically allows you to create like a drag and drop pipeline for your entire machine learning model development and you can allocate resources for every single step in your machine learning deployment process. So this is something new that we want to definitely try out and sort of integrate, making it easier for data scientists also to get hands on into setting up these sort of pipelines and workflows for their use cases. So with that, thank you so much. If you have any questions, feel free to have them answered now. I think I'll have a quick look so that I can stop sharing. Okay, one question I please. How many ML containers are typically generated by the pipeline? One with, I think I lost that question. There we go. One with all the ML or is it broken out into different model containers. So in this certain use case, it's just one machine learning container that's being sort of created by the pipeline. But you could also definitely break it down as you know you have a project which has multiple sort of models that you're trying to train accordingly. I would definitely say it's left to sort of your use case and your customization that you can do. In this example, it's definitely just one simple container that we are looking at. We also have another question, William Henry. How many ML containers are typically generated by the pipeline? One with all the ML or is it broken out into different model containers? Yes, yes, I just, I think I just answered that question. I think the other question was how do you handle data versioning and hyperparameter tuning in your architecture using Kubeflow as part of it in some other way. So yes, so hyperparameter tuning is definitely something we have right now not really invested in a certain tool per se. In the past we've used this open source tool called MLflow. So MLflow was pretty useful for us where you can sort of, it gives you like a UI kind of interface where you can actually plug and drop your hyperparameters and tune your model. But like I mentioned, we now have these Kubeflow pipelines and we now have the project Ilyra, which I think would definitely be a good tool to sort of invest for this kind of hyperparameter tuning use case. But yeah, as of now, we are not really looking too much into that.