 All right, great. All right, yeah. So, hi everyone, my name is Anish Astana and I'm a senior software engineer with the AI services organization at Red Hat. And I'm Humer Khan, I'm a senior software engineer working under the office of the CTO at Red Hat, work with the AI ops team. Yeah, and today we're here to talk to you about modernizing machine learning workflows with GitOps. So to give a brief overview of our talk, I'll be starting off with an introduction to the Open Data Hub, which is our ML workflows platform. I'll then talk about some of our main sources of toil, our journey to automated, and finally, how GitOps and Argo CD solve many of our problems. Then, Humer will be running through a short demo just showcasing what I talked about. So, for any AI ML projects you have, there's always gonna be teams of data scientists, data engineers, DevOps, product owners, business developers that need to collaborate and work together, right? Sharing collaboration around AI ML development is difficult, historically it's been pretty manual and sometimes error-prone. Another important challenge is compute resources. AI ML workflows are compute heavy and contrary to popular belief, CPU memory and storage are not unlimited of the cloud. A final challenge, one that most people don't think about initially, but is still very important, is the production development lifecycle. How do you make that workflow efficient and running correctly? So, let's discuss some of the main personas and admin of an ML platform would have to deal with. If you're looking at the diagram, you can see there's really three main personas or sets of requirements that we're concerned with. The first is that of the data engineer, all your data pipelines, right? The AI ML workflow starts with prepping and ETLA, the data into a data lake or storage system. This data needs to be stored efficiently and needs to be easily accessible for your data scientists. The next phase is the actual model development, your ML pipeline. This includes exploration and analysis of incoming data, feature selection, model creation, training and validation. Once the model is created, the last phase is serving the model in a production environment. This phase is not a static fire and forget deploy, but a constant series of optimizations. Once the model is served, you need to continue monitoring the performance of the model and making adjustments where necessary, sometimes even strapping it all together. This cycle of monitoring, optimizing and serving requires input and collaboration from DevOps, data scientists, data engineers and the business function. The Open Data Hub is a meta project integrating open source tools and technologies to provide and enter an AI ML platform addressing all of these personas. To break that down further, a meta project integrates multiple open source projects into one project that can be easily deployed and managed by users. Now, the Open Data Hub is great for initial setup, but there's still some missing functionality in regards to things that back from SREs need. I'll be talking through this via two of the most commonly used applications for the Open Data Hub, which are Jupyter Hub for our data scientists and Argo workflows for our data engineers. So, Jupyter Hub is a cloud native service for development and management, I'm sorry, deployment and management of Jupyter Hub notebooks. It supports different kernels and languages such as Python and R, and it's almost standard across the industry nowadays for data science use cases. One of the cool things about Jupyter is that it supports customer images. What this means is that users can spin up notebook pods with different packages or different repositories available right from the get go. You can see how this would make onboarding of new data scientists, team members or even collaboration a lot easier. While very useful, the process of creating custom images and making them available on Jupyter Hub can be a little heavy on both the operations team as well as the data science team. There's a lot of communications, sometimes offline and asynchronous and a lot of back and forth, right? Like, hey, I need this package and then like someone else myself, I've definitely done it, like screws it up and I go back and I give them something wrong. So being able to come up with a workflow to make it easier to make these things available on Jupyter Hub goes a long way. Next up, I'll be talking about Argo workflows. So I know this diagram is very similar to the previous one. I promise it's not because I was easy, but because they function very similarly. At the end of the day, you have a central server. In this case, Argo, I guess you can see my mouse. So in this case, Argo, that users are accessing via Firefox, Chrome, whatever your browser or via your CLI to do something, right? For Jupyter Hub, it's spinning up notebook walls for Argo, it is creating workflows. Workflows can be thought of as any sort of job that you need to run in general. An example job could be some sort of data transformation job that you have running every 12 hours to convert CSV files to parquet, which is much more efficient in terms of storage. With our goal, they have this really nifty UI that users can use to create jobs, right? Like there's not that many safeguards around it. So you can create as many, as few jobs as you like. The problem is twofold there, right? One, the jobs are transient. If they're created by the UI, they're not backed up anywhere. So if the cluster explodes or if someone just accidentally deletes a lot of stuff, all their workflows are gone. Secondly, it makes it challenging for us as platform SREs to keep a track of everything that's running on the cluster. How do we make sure someone isn't abusing the system? We don't want to check on the Argo UI often or like you could open shift. And while monitoring and alerting based on usage, you're not for sort of reactive approaches there. It'd be nice if we could have some safeguards available right from the get go. Now that I've talked a little bit about our two main sources of toil, I'll take a step back and describe this journey we've taken over time. Initially, like I imagine most projects, we just had a lot of engineers applying manual commands, you know, OC applied as chef template.yaml or whatever. Sometimes you store stuff in Git, sometimes you wouldn't. Sometimes we would just modify objects directly on the cluster or you can see how that is scale. We created a lot of silos which made collaboration a lot more challenging than it needed to be, but it worked. Luckily, we had someone on our team who was very familiar with Jenkins. So they just set up a Jenkins server for us. How this was structured was that we had jobs, managing applications. So if you wanted to update an application, you just trigger the corresponding Jenkins job, which could be slow. No one really liked working with Jenkins, writing the jobs wasn't fun, so to speak. That's a nice people who kind of skip that skip and just modify the Jenkins job in Git and then make a change directly in the cluster and say, oh yeah, it's working fine, but that wasn't always the case. Full credit to Jenkins though, it was super stable and pretty much always did what we told it to do. It just didn't have enough safeguards around us, not telling the wrong things. This kind of came to a head one day when one of our, a job sneaked through that was fully reviewed. This job ended up deleting everything in every single namespace of the cluster as opposed to a single namespace, which, yeah, well, we had an empty cluster available for users to not use. Luckily, the Jenkins jobs were good enough that with some extra manual intervention, you were able to get things back up and running. But it was enough for us to say, okay, screw it, we have no more Jenkins. Our next step was using batch scripts. Everyone had commands they were running, we had the templates that we needed to exist on the cluster, right? So why not just use batch scripts? I just threw it in a small, simple thing. It's one more client. This ended up looking like one batch script for application repository. So if a VR was merged in, you just run the batch script, it would update the deployment. This felt a little better for the team than Jenkins because you don't have to go through that clunky UI or write all this, you know, funny looking code. But having these manual processes didn't help us that much, right? People still sometimes forget to do things or what have you. A couple of major pain points with BASH for us for that variable parsing and managing deployments in different clusters and namespaces were becoming very complicated. We ended up having to get these massive, massive batch scripts that just were getting harder and harder to maintain. We also did not have a good solution for managing of secrets. What we ended up doing was just sending TBT encrypted files to other team members and saying, okay, make sure there's a decrypted thing available in this specific location. And actually, no one wants to store stuff in plain text, right? And it was just very fragile. So with that in mind, we started moving over to Ansible. This was actually really good for us for the most part. Ansible support for templating was much better. So repetitive things like Kafka topics or midstream sort of chip to hub were easy to create. It has secret management built in with Ansible world, way not built in, but like very easy to incorporate. And just because of like the variables that templating features, it was actually fairly easy to manage different environments, which meant that our manual error rates fell a lot. The problem here just ended up being that people were still being a little lazy. Once for requests for March, you go your terminal, you log in on the cluster and you run the Ansible script, you wouldn't always pay full attention to the output of the Ansible because it takes forever to run because there's massive things. And sometimes they fail leaving landmines for other people. So with that in mind, we started thinking about what we can do to improve things in the future. So really we had a couple of key goals with our next solution, right? Deploying to different environments should have minimal effort. Secret management needs to be built in. People may be great, but they're not infallible. How do we improve the process to remove people from these tedious things as much as possible? And it would be nice if we had some sort of self-healing capabilities on the cluster so that if someone changes something, it changes it right back. With that in mind, GitOps is the answer to that, right? One of the core ideas behind GitOps is that your platform engineers, your SREs, they do not directly interact with the cluster unless absolutely necessary. Instead, the source of truth must exist in Git repositories. So to do a quick workflow here, right? You have your engineers raising PRs, merging code into the repos with whatever fancy CI they have. You then have CI or manual engineers building images and pushing them up onto some image registry. You then have your SRE operations folks update the config repository to pull in the latest image or you could again have CI doing that for you. Once this repo is updated, you have some deployment tool, GitOps Engine, whatever you wanna call it, reading these Git repos and deploying them onto the cloud. One of the key points here is that operations teams should not directly be accessing the cluster during normal operations. Obviously, this isn't always the case. You may have emergencies, things may break and in that case, you can always disable your GitOps tool and just directly fix things in the cluster. At the time, there were a number of GitOps, this is still the case. There are a number of GitOps tools out there, Flux, Argo CD, Jenkins X, and I'm sure there are many others. We found that Argo CD fit our use case best back then and it was the easiest to get started with, win-win, so that's what we picked. Argo CD is super easy to install manually. The GitHub repository has tons of manifests and files that you can deploy onto your cluster to deploy it. The cool thing about these manifests is that you can structure your Argo CD such that it is responsible for its own deployment, which is a little scary, but also really super cool. Argo CD is actually even easier to install where something called the GitOps operator, I'm not sure if that's available on Kubernetes, but it's definitely a thing on OpenShift. The reason we are not using the GitOps operator just yet is that it currently lacks some features around extending Argo CD. So earlier I had mentioned secret support, right? This isn't a feature that's included by default with Argo CD, but there are plugins that you can incorporate and to integrate with Argo CD to do that, which is what we ended up doing while the case of plugin. In the interest of time, we've had to keep it short so we won't talk too much about it, but if you have questions about it, please put them into chat or hang out in the breakout room afterwards. With all of that done, I will hand things off to Hmer to do the demo. Awesome. Thank you, Anish. Let me just get this shared. So you see the video. So what you're looking at right now is a fresh OpenShift cluster with only Argo CD installed. Not gonna go over how to install Argo CD due to the interest of time, but there are many tutorials and what have yous out there that explain how to do this. You can use the GitOps operator, you can use the Argo CD operator to install that, or you can use the upstream manifests. We use the upstream manifests because currently the operators don't allow you to customize Argo CD and we like to add some additional custom tooling for secrets management and what have you. So all of these manifests can be found in our demo repo over here, which we will link at the end of the slides. And there's also a read me to explain some of the configuration decisions and how you can also go ahead and replicate this in your own setting. So the Argo CD manifests are found here which I'm using customized to primarily source majority of it from the upstream repo with the added configurations for custom toolings. And they're all deployed in this new space over here. If we go ahead and take a look, you can see that the pods are running. And if we go to routes, we can go to Argo server and see that it is deployed over here. And we have OpenShift auth enabled in this configuration using decks. And if we log in, we can right now, anybody that authenticates in this configuration has access to deploy apps. And, but these can be changed to using the Argo CR back to add some extra additional multi-tenancy attached to OpenShift groups and what have you. Cool. So typically in an Argo CD tutorial, you'll have people creating new apps and creating those apps that deploy your applications. So here we like to store all our applications in Git because again, we want to source everything we do via Git as that's kind of part of the GitOps paradigm. And so we try to avoid creating applications via the UI, right? Because then it's not really tracked in Git or declaratively anywhere else and they're not subject to peer review. But to deploy all of these, so there's two ways I can do this, right? I can go and clone this repo and I can OC apply, I can customize build and OC apply everything that's in this folder, right? Okay, but I can also have an Argo CD application to deploy all of this, right? And I could put that in Git as well, right? So why would I do that? Okay, so basically that's called the app of apps pattern and what it will be is an app that deploys other apps, right? So I'll create an application like this going somewhat against our GitOps philosophy just to show you what that looks like. And then just remember everything I'm doing here can be done via manifest and stored in Git, right? So I'm pointing this application to this repo into this path, which contains our Argo CD applications. Okay, and I'll deploy it to the Argo CD namespace. And you can see that it's creating two applications, right? Open Data Hub and KF Devs, right? So let's go back and let's sync this and see what happens when we do so, right? So you see that two new applications have popped up, right? And in much the same way, like this application that we created via the UI can be created in much like how these were created, right? So this is the app that's deploying all of these other apps, right? And we can store this app as a manifest too, right? Awesome. So let's go ahead and sync Open Data Hub. So what this does is it deploys a subscription for this operator, which would be analogous to if I was to go into the OpenShift console, right? And if I was to go to operators and if I was to go to Operator Hub and install OpenShift, sorry, Open Data Hub that way, but then it's not, again, it's not stored anywhere, this information and get, and that is kind of the ideal setup that we wanna go for. So I'll go ahead and sync this, okay? And what we expect is we expect the operator to show up and all of these manifests are located in this repo. Once again, the Open Data Hub application is deploying everything here, right? The KF Devs application is deploying everything in this directory, cool. So let's go ahead and take a look at the OpenShift operator's namespace. It is a global operator. So that's where it would be located. And we can see that Open Data Hub is deployed, right? So let's now look at this application over here, which deploys the custom resources that are managed by the operator here. And we can see that there is a namespace over here that's being deployed. And this is a OpenShift Kubernetes namespace, right? So again, we can track everything really that the cluster itself manages, right? Like the namespaces and what have you and we can store that information and get as well, right? And that's the ideal that we strive towards. The KF Dev is the custom resource that is enacted by this Open Data Hub operator, which will then go ahead and deploy the services that Anish spoke about, Jupyter Hub and Argo workflows and what have you, right? So if you go ahead and if you take a look, all of this is being deployed to the Open Data Hub namespace, which is what was also deployed by Argo CD and created, right? This entire namespace was created by Argo CD, which is what's located over here, right? Cool. And we can see that Argo server has deployed. We are now waiting for Jupyter Hub. While we do that, we can take a look to see if Argo CD a route has come live and it has and let's just give Jupyter Hub a second over here. It's initializing, it's running. It'll take one second for the route to come active. It's not available yet. Just refresh it a couple of times and there it is. It's behind an OAuth proxy. So we have to log in via OpenShift to get access to Jupyter Hub and there it is. So now one workload that Anish spoke that we often encounter is adding new notebook images, which generally consists of just creating an OpenShift or an ImageStream. And we would like to store this information and get as well, have it be subject to peer review, right? So we'll make this part of the KF Devs application. So we'll have this deploy that ImageStream that will then show up as a notebook image here, which we can run and use as a typical Jupyter Hub notebook. So over here I have a PR ready to go, which adds a pet image detection image, just like just a regular Jupyter Hub notebook image that has some data science work code in it. And you can see that it's just basically an ImageStream if we go ahead and if we were to. So now this action is now being subject to peer review, right? So like somebody can go in and say, it looks good to me, comment, and then it gets merged, whether by a human or a bot, whatever, it's gonna show up now in Argo CD, right? Because the status has changed in what's in Git and it shows up here. Typically you would have Autosync enabled, but just for demo purposes, I have it disabled so you can see it and cool. So now we expect the image to show up here and it does and then we can kind of spin it up and we can see the notebook in action. Awesome. So let's, we can go ahead and we can like take a look at which, what the notebook looks like and do whatever data engineer work one might do on here. Another common tool is Argo workflows and one common practice is to add our workflow templates which can be submitted declaratively and we would like to also track this in Git. So I have another PR here ready to go and all it's really doing is it's just adding a workflow. So this is from before I have a rebased so it's fine, we'll ignore that for now but we can see that there's a workflow here that makes you so a secret. Basically this workflow will just print the contents of the secret over here and we're actually adding the secret here which is something that we probably should do and this is why we customize our Argo CD to leverage a customized plugin called KSofts which we'll mention in the read me here as we can't do it in this demo because time is limited but we'll go ahead and we will merge this PR and we can refresh Argo CD and see that the new manifest, the new workflows have showed up. We'll go ahead and sync it and we can refresh Argo workflows and see that our workflow template is now available. We can submit this job and what we expect is to print the contents of the secret that we merged which is super secret password and that is found in this workflows in the secrets over here and it's done and that is indeed the contents of the secret. So that's it for the demo. Thank you guys for watching. Hope you found this demo somewhat meaningful and if you have any questions, we'll stick around to answer them. Thank you once again and take care. All right, thanks for that, Homer. I will quickly share my slides again. This is one slide, I think this is a... So I guess we kind of summarize our talk, right? People may be great, but they're not infallible. Teams are made up of people but the best way to prove things is to enforce processes that remove them from the equation. Everyone means well, but sometimes mistakes happen. You can find a link to Homer's repository in the slide link and I'll paste it in chat again. It's also linked to the Open Data Hub which is another workflow platform we're using and the last thing I've dropped in is for OperateForced which is an open source operations initiative. You can check out the website, it's really cool and they are driving forward a lot of this work. Again, all out in the open so that anyone in the community can benefit from their work. So yeah, thank you, we'll take questions now. Thanks guys. I guess I'll show the structure on my screen. Yeah, thanks folks, that was a really cool presentation. I don't see any questions in the Q&A tab but we're gonna hang around for like a minute or two if you guys are okay with it. That's for me. I'm trying to figure out how to copy paste links. I wrote a copy paste. That's all the links I've run anyways. You'll find a lot of configuration setups, examples and what have yous in the Upper First. That's where we kind of like further customize our RCD and add like extensions to it. I guess it's noon now so, lean coffee lean hours now? Yeah, lunch hour starts now so. Yeah, there's also a coffee hour happening right now which will be discussing today's key node so some people might be interested in that. All right. Okay, thanks a lot folks, I'll see you around. Thanks guys, bye. Bye.