 Hi, everyone. I'm Andrei. I'm a social engineer at Apple and also I am the co-chair of AutoML training group in Kubeflow. I'm Junu, a staff engineer at Nutanix, leading efforts in training and the automobile in Kubeflow. Today we will be presenting on how to build automobile pipelines with Argo workflows and Catip. So before diving into the main topic, we'll just talk about general automated machine learning area or AutoML short. So this main area is very vast and has lots of research happening. And it can be considered as like multiples of domains. For example, hyperparameter optimization or HP tuning, neural architecture search of NAS, feature engineering, model compression, data preparation, data validation part. So when we consider from a components point of view, we could think as like multiple components. So one is the main configuration space. So when we talk about configuration, different configurations, extending from what we have discussed in the previous slide, the feature engineering domain deals with features, hyperparameter tuning deals with hyperparameters, neural architecture search deals with model architectures. And once we have training data passed on to this, new metrics are emitted from models based on the configuration that we have passed on. And using these metrics optimizes based on each environment. Like it creates new configurations and this loop continues till some success criteria is formed. So that is the main, the core, the AutoML field. And the end result is you would basically get a well-optimized ML model. So coming to Cateb, so that's like the main focus in the stock. So it's a Kubernetes native open source project for general AutoML needs. And this is like one of the core Qflow component. So it can be deployed independent from the main Qflow installation. Or it's by default, it is tied with the Qflow, the main deployment. From all the domains that we have talked about in AutoML, it supports hyperparameter tuning, neural architecture search with early stopping mechanism. And out of the box, it supports almost all popular AutoML algorithms. For example, Bayesian, Hyperband, TPE, CMA, et cetera. And the platform is very extensible in the sense that users can deploy on custom AutoML algorithm during runtime without even restarting the deployment. And Cateb is agnostic to ML framework and programming languages, which means user is free to write code in any programming language or ML framework, say whether it is TensorFlow, PyTor, Psykit, or XTBooster, whatever it is. And since it is Kubernetes native, it takes advantage of the underlying Kubernetes infrastructure, so it is portable. So it can be deployed on-prem or on-cloud or in the hybrid nature mode. And we can basically tune any custom resource in Kubernetes using Cateb. And since it is baked with Qflow, it also supports advanced training configurations like distributed training through Qflow operators like the PyTor operator or the TensorFlow operator. So by default, you can use that or you can basically deploy it totally independent from the Qflow installation. So this is one very high-level Cateb architecture. So from a user point of view, he creates an experiment in the form of a spec or a standard Kubernetes manifest in ML format. And the experiment controller, which is responsible for the experiment custom resource, picks it up and creates suggestion custom resource. The suggestion controller looking at the user spec figures out which is the algorithm that that user is interested in. It talks to the right algorithm service and provides or produces a set of hyperparameters. Now, against each set of hyperparameters, the trial controller creates trials and you could think trial as a worker job. So this can be distributed in case of a distributed training or it can be a single-pot training as well. So each trial produces metrics which we have talked about in the earlier slide, which is the evaluation metrics and they are stored in an external metrics TV. The experiment controller looks at these metrics and figures out whether the experiment has to be continued or not. So in short, the experiment or the optimization look continues still. Your success criteria is met or the user budget is satisfied. And once it is completed, user gets to know the best hyperparameters for that experiment and this can be used for finding the best model of that. On to Andrew for the Argo workflow integration part. Thank you, Giorna. So before jumping to implementation, how we use Argo workflow in K-LIP, I just want to briefly talk about what is Argo workflow. So basically is the continued range of the system built on top of Kubernetes and I would say like the factor platform, which widely used across all the industries and especially in Qflow, we're already using Argo in Qflow pipelines and K-LIP. And for us, it's very important to find the orchestrating tool to be able to run out more jobs because in the next slides, I will explain you how we integrating this internally in our infrastructure. So Argo has a lot of different projects that are using Argo underneath and it's very powerful to be able to run this workflow engine inside other CRDs. So speaking about K-LIP integration, so on the previous architecture diagram, as you can see, like we, our trials spawns the workers and since we trying to build the architecture itself by providing the obstructions on top of the worker, our users can deploy almost everything which has the specific states. And this particular example, we can integrate Argo or Qflow directly to our child's jobs and users be able to run their training code inside the one Argo workflow. So why it is actually useful for us? Because by using, by running a training job as the Argo or Qflow, we can just perform very complex ML pipelines inside a training job. We can, our users can do like the pre-processing step, post-processing step, also they can like train the model in a multiple different ways. For example, it's very useful in the multi-object optimization. And also since Argo like provides the great UIs and great artifact tracking, users can also track the metrics results and analyze them in the various workflows. It's also like very efficient, you can need to like train your model on a different data set on a different like volumetric resources. And all of this can be easily integrated inside K-LIP infrastructure and just analyzing the results and making your hyperparameter optimization. So speaking about integration steps, how you can integrate Argo or Qflow inside K-LIP, it's very straightforward. Before I could just explain you, I just want to say that any Kubernetes CRD can be implemented in K-LIP as long as this CRD can create the port and also can allow allows you to inject the sidecars because our metrics collector is just a sidecar container. And also your CRD has a succeed or failure status. So as Jono mentioned before, this CRD can be like everything. We have an example for job, part or job. So all of the Qflow distributed train operators and also have an example for technical pipeline or Argo or Flow. Because Kubernetes job sometimes does not fit all of your needs if you want to run your training pipeline. And if you want to integrate the Argo or Flow inside K-LIP, you just need to simply update your container arguments inside your K-LIP controller. And also you should have the K-LIP controller to be able to give the particular access for K-LIP controller and to be able to reconcile the Argo or Flow resources. So after just updating this step, you just need to restart your K-LIP controller. And after that, you can easily integrate Argo or Flow inside your K-LIP experiment. And think about how you, speak about how you're going to define your Argo or Flow in your K-LIP experiment. We have an API called trial template where our users usually define the trial template and define all the APIs required to run this template during the experiment run. So in this template, user have to define four different APIs. So the first is primary port labels, which identify the port and identify the labels of the port, which metrics collector will be injected. Because imagine like you in Argo or Flow, for example, you can have two different ports and each port can do a different workflow operation like preprocessing step and actual training. So all metrics collector needs to be injected only in the training step. And I will show you this in the demo as well. And also like in the primary container name, you need to define the name of the container, which metrics collector needs will be wrapped. And also the success and failure condition in the JSON format. And this condition, when your resources met this condition, a controller will automatically reconcile this customer service during the experiment run. And speaking about how actually users define the Argo or Flow inside experiment spec, we have the trial spec API. And here users can define the whole workflow inside their experiment. And if you're familiar with Argo or Flow, this is very simple workflow that we're going to use. And I'm going to show you this in the demo. So in this workflow, we have two simple steps. So the first one is preprocessing, where it is just we simply bring the random value and just getting this value by randomly dividing the 6,000 on 10 and 100. And then we're getting this value and just pushing it to the next step, which is more actual model training. And we can see in this example, that we have the output parameter from the first step. So this one and the input parameter from our actually optimization algorithm. So it's learning great. So we're going to pause the learning rate from creative algorithm. And also we're going to pause the number of examples from our previous step. And these two steps, we're going to run our... So each trial, I'm going to run the whole workflow, which has like these two different steps. And we're going to use this example in our demo. So yeah, and with that, I just want to jump to the demo and just to show you how you can easily run the Argo or Flow inside creative experiment. So I'm going to use exactly the same experiment as I defined to you. So this is the example that we're going to use. I just want to copy that to our frontend. So this is the KDPI that we have. So our users can define the new experiments. So the experiments... So they can define all of the APIs by using these UIs, such as metadata, trial threshold. So for example, how many trials they want to run with parallel, what is the maximum number of trials, what is the maximum number of failed trials. Also, very important, users can define the objective, where they just specify which metrics they want to optimize and what kind of additional metrics they want to collect. So in this particular example, we're going to tune the validation accuracy. In the meantime, we're going to collect training accuracy from our training process. So this is the search algorithms. As Jonah mentioned, we support like a lot of like novel automail algorithms here. Users can easily define them based on their needs. Also, they can specify the early stopping techniques. And this is the hyperparameters that users also can specify. They can change the domain, the search space definition, the different types of hyperparameters. And also like this is the metrics collector, where users can use the different APIs for metrics collector, how they want to analyze the metrics from the training process. And this is the trial template. So it's again, this is the very important and I just explained you how users can use the trial template API to be able to just define the argor flow inside the experiment run. So we're going to pass the whole YAML inside this edit form. And we're going to change only the maximum number of trials to eight. And we're going to use two trials in parallel. So again, as I mentioned before, each trial just running the argor flow. And we're going to run two trials in parallel. So two argor close in parallel for each evaluation run. So after creating the experiment, user can click create. You can see the status experiment. If we jump to another namespace, we can analyze other users' experiment. So this is the optimal trial that the users can also analyze and understand the best hyperparameters. They can click to the experiment with some results, but we're going to back to this later. So if we jump to argonamespace, our experiment currently creating, and we're going to jump to a terminal to just analyze the results from our experiment run. So in my namespace, currently I'm running the argor flow container, sorry, argor flow controller, which is basically reconciled all of the workflows and also the credit controller and the UIs for argon and credit. So first of all, we want to check the trials. So once experiment is creating, two trials also have been created and they're currently in the running state. And we also can check the workflows in the argon namespace. So as we can see here, the two trials have been created and two workflows also have been created. So if we're going to describe one of the trials, we are going to see some of the interesting results. So here we can see the learning rate that is generating about our KDP algorithm. And we're using just a random algorithm here. But we only tune learning rate as I mentioned before. And we can check the ports. So inside our argon namespace, we see that each workflow creates two different ports. So the first port is data processing port and the second port is actual model training. And so if we're going to analyze the results, for example, from this trial, first of all, we can check the logs from our preprocessing step. So our preprocessing step, just generating the value like 1508, going to pass this value to the second step, which is actual model training. And this value is being used during the training run. And also we pass the learning rate inside our training job. So this is the simple MNIST example, just a simple CNN model. And during the train run, after the training is finished, we just, our matrix collector, actually pass all of these results and send them to the DB. So make the new prediction and the new actually generating new hyperparameters. And which is very cool, like since we're using Argo underneath, under our trial job, we can use all of the richness of the Argo, like all of the features from Argo workflows. For example, we can jump to the Argo UI and we can analyze all this workflow, how they're running, what is the, how the value of pass between the workflows, what are the current status. And also we can jump to the particular workflow to see the whole of the steps. So as I mentioned, this is a very simple two step workflow, but we can analyze the input parameters, output parameters. We can just check the logs. And of course, you can like, you can define various sophisticated workload here, then have a different volumes, which your workflow will be using. And you can, again, as I mentioned before, you can have like the multiple training, training optimization during your, like hyperparametuning run. You can check the manifest from this UI. So you can see that learning create just was generated and was substituted by Argo, by KDIT controller. And we need to wait until all of the workflows will be finished because all of the trials also will be finished and then we can just see the results. So if we're going to jump to the KDIT UI, here our users can analyze the experiment run. They can see the experiment name, their current status, the best trial, the current best hyperparameters, some information about running trials, some information about experiment conditions, for example, if they're going to click to the child page, here they can see all of the trials, all of the statuses of the trials, and what is the best trial, we're also highlighting them for the user. So as you can see, like it automatically adding new and new trials and users also can click to the child to see the metrics that our metrics collector just parsed from the training step. And again, since we just getting all the two metrics, we see like the validation queries and training queries in this graph. So depends on the results, the best hyperparameters are generated. And if you click to the details, you're going to see like more detailed information about the experiment. And of course, if you're like familiar with Kubernetes, you can always see the whole YAML for the experiment run for the more detailed information. So again, this is a very simple example, but it's very powerful because using Argo in KRIP has a lot of advantages, as I mentioned before, and users can definitely like to define the very sophisticated workflows inside the optimization step. And it helps a lot for us to make some, like against some multi-objective optimization, which just a simple distributed run doesn't support it. So it's very efficient and it has a lot of potential in case of hyperparameter optimization. So once the experiment is finished, we can see the best trial which is narrating this learning crate and users can just take this learning crate and put them in the final model and just push this model to production. So this is a very simple example that I really want to show you how you can easily integrate the Argo workflows in KRIP. And again, as I said, it has a lot of potential and it's like in a very early stage of adopting the workflows inside AutoML experiments. And just before we wrap up, I really want to mention very important things about community. So all of this would not be possible without our great community. And I really encourage you to follow this guide to run this Argo flow examples by yourself. Inside, you can even use the local kind cluster. You don't need any on-premise or public clouds to be able to run this example. If you want to join our regular meetings, please feel free to check these links from Argo, Flow and KRIP. We had regular meetings. Also, you can check our GitHub repositories. And if you have any questions, please feel free to ask us in the Slack channel. We have the Qflow Slack channel. We have the Argo Flow Slack channel. So I'm going to have you to answer any of your questions. And if you're using KRIP, please update the documents list. We really want to have the interaction with our customers to just understand the problems that we have in our current infrastructure. And if you want to contribute, please check the developer's guides, our issues with this label. You can also submit your proposal and if you want to learn more about KRIP, please feel free to check our presentations. And with that, thank you so much for listening to us. We are more than happy to answer all of your questions. Thank you.