 My name is Charles, and Alex is with me on the stage right here. So I'm an ML Ops engineer moving good. We do a lot of things around ML Ops deploying application on Kubernetes, and we've been using Agil a lot in the last few months and years. So we're from Texas, so we do a lot of things around provision-escalable ML applications on cloud diagnostic infrastructure. We basically all pack customers, operationalize and deploy their models at scale, and we do a lot of things around data lake feature stores and model management. Yeah, so the agenda for today, what it takes to build enterprise machine learning platform, implementing cloud agnostic ML workflows on K8s, machine learning model data prep and feature engineering, event-driven ML model training, and with Argo workflow and events, and implementing model deployment and serving pipeline with KServe. So I'm going to take on the first few sections of the slide. What it takes to build a cloud agnostic ML infrastructure. So the goal of any production-grade ML machine learning project is to build a statistical model using the well-curated data sets. And the main artifacts, like we all know, is the data, the model, and the code. So the data will keep changing. Your code will be changing, and you keep producing model output artifacts. So a typical ML workflow looks like this, where you're acquiring the data. You're doing data prep. You're doing feature engineering. Then you run through multiple iterations of your model training, then onto your server model. And you still monitor the model for drift. So typically, a lot of ML projects start on cloud services like Google, AWS, Azure. We have some customers that do on-prem as well. But ML, like you say, is very difficult, even if you go with the managed service providers, like on Google, on Amazon, or Azure. You go through the process of identifying the data you need, connecting to the data source, preparing the data. Then you go through multiple iterations of training the model onto your server. So there's a lot of moving back to this that makes it really difficult. And one thing that we've seen consultant in this space a lot is if you're trying to build an ML model, you want to make sure you can reproduce the entire pipeline, which is one thing we've been able to successfully do with DevOps process, where you have a process that you can reproduce. You have components that you want to make sure that you have components that you can reuse. So if a data scientist or ML engineer is working on a particular component, you want to be able to share with other team members and things like that. Then the other thing is manageability, being able to manage all the output artifacts for audio trades and things like that, then automating the whole process so that the process runs by itself and things like that. So to get started, a lot of companies use managed services on AWS, or you can do the same thing with Azure, or you can do the same thing with Google. You all manage services that will get you up and going and things like that. But in reality, what you're doing whenever you're trying to build any ML pipeline is the number one thing is you always need to solve the data you ingested. You do data prep, you do feature engineering, and you can now decide to run on any of these major cloud platforms, at least these three major ones. And now you train your model, you do model scoring to see if the model is good enough and you do your inferencing. So in an enterprise, it gets a little bit more complex, especially with a lot of enterprise trying to go with a multi-cloud strategy, where you have some people working on Google, some people working on Azure, or Amazon. So you're trying to basically, different teams work on different infrastructures, and sometimes it's a lot more difficult to transfer the knowledge between teams, especially if a team is all baked into a particular cloud infrastructure provider. So one thing we've done is to basically identify the two main things that you do whenever you're trying to train a model. You have the storage pack, where you're basically ingesting all this data, you're curating the data for your model training, and you have the compute pack, which is where you have your GPUs or CPUs running on the cloud and things like that. So on the storage pack, we basically think about it as a feature store, where all the things you do to get the data ready for the ML engineer or the data scientist, we curate it and we have everything in a feature store. And everything that has to do with your compute, basically running your pipelines and your workflows, we basically abstract it as a containerized application that runs on Kubernetes. So because of this, we have a cloud agnostic layer that can run on any of the cloud providers as long as they're running on Kubernetes. So I'm gonna hand it over to Alex who's gonna talk about all this to the next slide. Yeah, so in order to basically create these feature stores, we use Argo workflows to basically do the data transformation and fill the feature store with the data that's necessary to train the models. But we wanna do this in a cloud agnostic way such that our customers don't need to worry about if they need to lift and shift from GCP to AWS. They're able to do so. So we deploy with Argo and Argo CD to give them flexibility and to basically bring customers up to speed. A lot of our customers are first-time Kubernetes users and we like to be able to introduce them in a way that's gonna be good for the long run. And we introduce all the Argo tools, workflows of NCD because it gives them basically a tool set that's gonna last for the long run. And it's very efficient. So some of the ways that we did this, we basically allow different teams to use Argo workflows to do whatever they need. Some teams are working on feature extraction, some teams are working on training, but they're all using the same common platform which is Argo and this has been very successful for us. So in general, like we need a data lake, feature store, all that good stuff and Argo's been basically our go-to for building these workflows and we've been using the DAGs for some of the more complicated pipelines that we build and it's worked very well for us. Some customers are using TensorFlow, some are using other machine learning frameworks but doesn't really matter as long as you can build an Argo workflow and they know how to build workflows, it works for them. So again, we use Argo CD workflows, events. Some of the ways that we use events are file drop and like a cron-based time triggers are very common and yeah, Argo CD is definitely how we've been selling Kubernetes to our customers. So being cloud agnostic has been very important to us. We make sure to stay open source and just not lock our customers into any sort of hosted solution other than just using open source Argo CD and this has been a great solution for us and with lots of help from the Argo community. So yeah, so basically some of the steps that we use to do this we provision on the cloud offering of their choice. We deploy node pools. We use Argo CD application sets to define basically the base of the cluster and the serving and event components of the cluster and then we deploy it. Here's an example of how we use application set. We kind of use it to define what's the base of a cluster for this client. In this particular use case we have all the namespaces, the role bindings, Argo CD managing itself and everything needed to have a machine learning platform. And here's kind of what our Argo CD dashboard looks like after being deployed. We have three application sets in this example. We got the base which gives all the necessary infrastructure to kind of deploy custom resources such as K-Serve or Argo events and this has worked very well for us and application set has been nice moving away from the app of apps pattern. Application set has been, it's a lot better. So yeah, again, here's a look at how we use application set to deploy all the namespaces needed. We keep our namespaces in one general application so that any of the apps that need to reuse these namespaces are easily able to use them without needing to worry about the namespaces created or not. Here's a look at what the baseline Argo CD dashboard looks like and yeah, I will hand it off to Charles now. Yeah, thanks Alex. So basically we use the Argo CD to provision all the things that we need in a cluster to get ML process started whenever we have a new project. So the next thing we do is a data preparation making sure we're curating the right dataset for data scientists to get started whenever they need to train a model and make it easy for them to do it in a very consistent and repeatable way. We wanna have an agile and efficient process so that once I've prepared the data I can share with multiple data scientists on the team and if we need more features in that datasets we can always recreate it without basically slowing down the existing data scientists or ML engineers working on this and at the end of the day we wanna be able to create a shareable feature set that you can basically authorize a user it seems to use so that it can use to train their model and things like that. So a typical ML workflow for a data scientist looks like this. You just, you get all your data from your Jupyter notebook and things like that and you try to use it to train your model. But the challenge with that is like if you're trying to basically do everything at scale it's not gonna work because you're not gonna easily transition from what you're doing on your local to a production environment. So in that case, for what we do and how we go about it we basically create kind of like a data lake where you can land all your raw data and we have an agile workflow pipeline that transforms the data into a feature store. So we have a feature store, we have a schema with all the feature vectors and all the attributes that you need to train your model and the agile city basically runs on the Kubernetes infrastructure and because we're trying to be cloud agnostic, any solution we build should be able to run on any of the major cloud providers. So from there, the data scientists can now go in and stop pulling all the data they need and start training their models. So just to zoom in on the data lake pack which is where we keep all the raw data. So every time we ingest data we have new delta so we need to keep ingesting the data. It could be streaming, it could be batch and once we ingest the data we trigger a workflow to basically convert it to the feature store requirements. So it looks, for the batch work stream it could be a data that we're landing on S3 bucket or Google storage or set. We have some customers that are on-prem they're not in the cloud so we basically use something like that and for the streaming data we have something like this. Now we have a pipeline that basically converts and transforms this into feature groups and that feature groups can now be used to train the model in the offline mode or you can have the online mode where we create a key value pair that maps to the greater detail, that maps to greater details about the data set that you can now use whenever you're serving the model. So basically in all these places we're leveraging the IGO workflow. So all the way from landing the data I think Alex talked about it earlier where we have the IGO events that whenever the file drops we trigger a workflow so but we're not gonna trigger for every file in the batch process so we have a way of basically having a trigger file that whenever that trigger file lands we just trigger a batch of the files and we process it and we curate the feature store. So this way the data scientists can always have new data they can always go back and retrain their model and if we notice any model drift we have a pipeline that goes back and retrain the model and basically checks what's going on. So we're using IGO workflow in a lot of all these use cases and these are some of the pipelines that we're working on I think Alex has some more details on that. Yeah, so we use batch workflows, streaming, feature store creation and online feature store and offline feature store all using IGO workflows to kind of build out these components. Here's an example of how we built a small data lake. We basically did data ingestion like once the message is fired off we consume the message, we do schema validation on it make sure the message looks like the correct shape and then we write it to the data store. Same example but like in a streaming situation we consume a message from Kafka, we transform the message and then we write to a feature store and here's the feature store creation where we read data and we transform it, create the key value pairs and then write to Silla. The benefits of having a feature store it makes it easier to migrate the workflows we're not tied to any particular cloud provider and in general it's just it gives a common data lake for any machine learning engineer to come in and build a model on Charles. Yeah, sure. So to basically operationalize all these things we need to react to events and event could be things like code change whenever you're coming to the Git repo, data change whenever there's a change in your data sets you want to trigger retraining if it's necessary and if you notice that your model is drifting basically you want to retrain the pipeline. So when we want to operationalize our workflow and production we're basically leveraging IGO events so IGO events is kind of like the list there for us to automatically trigger all these workflows based on code change, feature updates or model drift everyone updates our pipeline and things like that. We also trigger these events based on Kubernetes resource deployment basically monitoring the state of a resource especially with KSAV Alex is gonna talk about that in a little bit. So a typical workflow looks like this where you have the data engineer or ML engineer working on the patch class stuff you check it in if you're using Git it triggers a Git action and we build a container and we push it to the container registry then we have IGO CD checking to see if there's a new container that needs to be deployed if you need to retrain the pipeline then it basically retrains and deploy the models and creates a fashion tag for it. So once the model training is done we basically create another event we have a metrics table where we're basically storing all the metrics it could be the accuracy or the precision of the model and we save the model output artifacts the frozen model in the container registry so once the model drops in this container in this sorry in this bucket we can trigger another workflow as well but before we deploy the workflow the pipeline we wanna check to see if the metrics if there's any improvement in the model so the IGO event workflow process will check to see if there's any significant improvement in the model that we trained and determine if we need to redeploy or not and the same thing with the feature store every time we curate a new data set to the feature store we have a listener that basically triggers an event a workflow event for us to retrain the model and things like that. So some other sensor IGO event sensors that we're building on that we've used things like file drops in the bucket basically if there's a file drop or any change in the bucket due to dropping adding new files or new data sets we can trigger a workflow based on that we check our message cues to see if there's any new message for streaming data and for Kubernetes resources as well we can basically trigger a workflow if there's a new event in the Kubernetes cluster basically sending a notification whenever we're training a model if the model training was successful or not and we use webbooks to connect back to something like Slack so that someone geeky like Alex can retrain a model from his Slack message. So some of the workflows that we're currently working on is the feature stock creation workflow the model training status justification workflow the metrics and the model drop workflow and the model deployment workflow so I think Alex you have an example you want to show? Yeah, so here's an example of the bucket drop where you drop basically this allows us to drop files into a cloud storage bucket after a certain number of files have been reached the threshold is met and a Argo event is triggered that basically spins up a workflow that will take all the files from the input directory and move them over to the training location so what that looks like and make sure that there's the correct number of files needed to move to train and if not it'll wait till another event is fired to do the counting we have this is one way to do it and sometimes we use a file trigger but this is just one of the more basic ways to get it done once the threshold is met the train is the training data set is created and we archive it we zip it up so that it can be used for a new model version yeah and so now once the training zip is created we need to do training and serve the new model if the model's accuracy and precision is meets the correct score that we want to see we do this with the Argo workflow that basically looks at the feature store sees if the metrics value for the current model that got trained meets our threshold if it is we use kserve to deploy a new version kserve uses these inference services you basically just point it to a location S3 or GCS bucket with TensorFlow serving type of model and we have this DAG that we created where we do the validation looking into the metric store if the metrics meet our threshold we'll deploy the new model and we'll send a notification on Slack saying that a new model is being deployed for a little toy example we did we use the cars data set to basically just run a kserve and deploy a model that can determine if the given car is of a certain type here's what the inference service looks like we use Argo CD to ensure that we point everything to the right location in GCS here's what the inference service kind of fans out to and Argo CD gives you a nice view into that once that's done we check the status of the inference service we see that it's available at our URL we can check the metadata on the inference service we see that it's ready to go and then we fire off a sample request with a shell script to confirm that we're able to predict accurately yeah Charles so in conclusion leveraging Argo for our MR projects we've been able to improve our team delivery efficiency so it's easy to ramp up new team members because everybody's working on the same environment once you would strap the resource with Argo CD on your local if you're running Minikube or if you're running on the Kubernetes cluster in the cloud you can get going we have a blueprint and a template so if you want to create a new workflow we're not starting from scratch we're just leveraging the existing template that we have and let's talk I mean cloud is not cheap so we're set up we're able to reduce the cost for our customers in terms of the amount of money that's spent on cloud and things like that so that's it thanks everyone for coming I have a question from the virtual audience what technology are you using for the feature store? yeah so for the feature store for the online feature store we're using SilaDB which is basically kind of like a sand grab no sequel but it allows us to the memory footprint is very low so we can easily use that and for the offline feature store we're basically, we use story buckets and in some cases we're using traditional databases that's the only question I have from the virtual audience is there anybody else with a question? sure what do you use Argos CD for like I didn't quite understand that part how is that helpful when you're managing these workflows that you're running? yeah so, oh you want to take that? yeah so we just use Argos CD to essentially stand up the cluster for our customers a lot of them are first time Kubernetes users and we want to introduce them to K8s in a repetitive or repeatable manner I feel like Argos CD is a good way to kind of introduce them to K8s and keep them organized and it enables us to deploy like new inference services and different type of Argo events so we can create new sensors and have them deployed in a nice manner where you can visually see what's going on it's nice to see the fan out of when you create a sensor basically the pods that get spun up from that sensor and just to add to that it's easier for us to manage the deployment the manifest for all the components that we're running so every time there's a new release of let's say TensorFlow operator or KSAP we can manage the release through GitHub's process so basically Argos CD helps us do that easily thank you everyone thank you