 Hi everyone, I'm Yuan from RQED, I'm a maintainer of Argo workflows and co-chair of Kubeflow training working group. Hi everyone, I'm Andrei, I'm the software engineer Apple and also I'm the co-chair of working group, AutoML and training in Kubeflow. So today we're going to speak about how we leverage Argoflow in KTIP and how we can manage thousands of automated machine learning experiments with this integration. So let me first jump to the Kubeflow. So KTIP is part of Kubeflow umbrella. So if you don't like know, Kubeflow is the open source project for ML ops on top of Kubernetes. Kubeflow contains different components to perform a different way of ML activities such as it has known solution for notebooks and Jupyter labs also has the components for this butchering operator with wide support of open source ML framework such as TensorFlow, PyTorch, MXNet, XGBoos and MPI. Also Kubeflow leveraged the functionality for ML metadata and has an own component for ML pipelines, which I think many of you know about. Also Kubeflow has a component for AutoML specifically KTIP for hyperbaramutuning and neural hardware search. And we have this serving component for model serving in clouds, which with a lot of like very unique functionality. Also Kubeflow can be easily deployed on any sort of the public clouds or on-prem and can offer the BIPI interface SDKs or kubectl to make interaction with these components. Let me jump to the KTIP because this would be our main focus in this presentation. So KTIP as I mentioned before is part of the Kubeflow components and it's the project for AutoML specifically for HP tuning, early stop making and neural hardware search. In the meantime, we are working on making additional support for feature engineering and model compression to allow you to do other like AutoML features on top of the cloud. Also you can run KTIP to perform your custom AutoML algorithms. So we provide you a platform to do it like in a cloud native way. Also we can have a new feature to actually orchestrate any Kubernetes custom resources and I will just jump to this to the next couple of slides how we can do it. And since we are on top of the Kubernetes we are like agnostic to ML frameworks and have a native integration with Qflow components such as training, notebooks and the pipelines. So jumping to the KTIP architecture it's quite straightforward. So when the user submits the experiment, have the experiment controller which is reconciled this experiment and then we have a suggestion controller which is responsible to spawn the algorithm service. When the algorithm service is completed the couple of services which is produced the hyperparameters based on the experiment specification, then these hyperparameters pass to the trial controllers and trial controller basically spawns trial in the parallel execution. So we have a new feature to support any type of the trial worker to be run as a trial whether it can be a simple commercial job, TF job or even Argo workflow. And in this worker you basically produce the training and then we have a metrics collector which parse the necessary metrics from the workers and send this metrics to the DB. Then these metrics pass back to the experiment controller and we're getting these evaluation results to the algorithm service to produce new hyperparameters. This process repeated again and again when the hyperparameter job is finished user can get the best hyperparameters and use them in production training. So jumping to why we actually need Argo and what are the current problems we have. The main problem that's in the evaluation step in hyperparameter we have a couple of problems such as usually the training process is not like quite straightforward when you can just perform a simple job to run your training maybe you need to run some pre-processing data maybe you want to run some post-processing data and all of these steps can be done during your evaluation. So basically the simple commercial job doesn't like give us the functionality to cover all these problems and that is why we're moving forward to using the complicated workflows such as Argo to be able to resolve these issues. Also have problems with the multi-object optimization when we want to tune experiment in the different objective and also we can do some parallel training with which Yvonne will be talking about in the next couple of slides. So next let me just pass to Yvonne and he will speak about the Argo flows and how we can solve these problems. With all those challenges in real-world machine learning pipelines I'm going to talk about how Argo workflows mix it easy and then introduce a few common use cases for machine learning pipelines. So Argo workflows is a container native workflow engine for Kubernetes. The main use cases for Argo workflows include machine learning pipelines, data processing, ETL, infrastructure automation, continuous delivery and integration. On the right hand side is a screenshot of what the Argo workflows UI looks like. The diagram at the bottom gives some example ecosystem projects that use Argo workflows. More can be found at the awesome Argo GitHub repository linked below. Let's first talk about memorization cache functionality in Argo workflows that will be leveraged when dealing with the pre-processing challenges that Andrew mentioned previously. Argo workflows controller creates cache which can save the output of a step to be used in the next step. For example here step B requires the output from the previous step A. When the workflow is executed for the first time Argo workflows will create a cache for step A. The cache contains the result of step A and is saved as a key value pair in a Kubernetes config map. Once step A finishes step B will be executed. The next time so when the same workflow executes again it will check whether a cache from step A already exists and whether it's still fresh. For example if the cache is created 10 seconds ago and that step B thinks this is fresh it will retrieve the saved output from the cache and use it directly in step B without wasting resources and time to re-execute step A. Here's how to use the memorization functionality in Argo workflows. In the template spec on the left hand side we can specify the memorization spec. Here we specify the key in the cache to be cache key. The max age represents the maximum duration before we consider a cache as old when future workflows or steps try to use the cache. We also specify the name of the config map that we want to save the cache to. Here's what the config map looks like on the right hand side. The key is cache key and the data contains the output parameter produced from this particular step which is the parameter hello with value world. Let's take a real-world machine learning workflow as an example to see how memorization can be leveraged. Assume that we triggered a Cative experiment that executes a machine learning pipeline using Argo workflows. A simple machine learning workflow may look like this. First there's data ingestion step that's responsible for ingest data from the data source. You may have a cache store that's in place using Argo or Kubernetes to check whether the data has been updated or not recently in order to steep this particular data ingestion step if nothing changes in the data set. Otherwise you would have to execute that data ingestion from scratch which casts a lot of computational resources. After we've ingested the data we start model training step. The model training can have multiple workers and multiple data charts depending on the selected distributed training strategy. Here for example we are running the distributed model training step using all reduce. The model training may consist of code written in frameworks such as TensorFlow or PyTorch and then you can use Kubeflow to submit a distributed TensorFlow training job so that the algorithm developers or data engineers they don't have to worry about the infrastructure side of things. Kubeflow will communicate with Kubernetes requests necessary to request necessary computational resources for each of the workers and parameters that TensorFlow can just focus on the algorithms or the models. We can also use CARTIP for more complicated model training that leverages hyperparameter tuning, neural architecture search, early stopping and so on. Let's take a look at how this can be achieved with Argo workflows. On the left hand side we define the entry point of the workflow which consists of sequential steps for both data ingestion and distributed TensorFlow training. The data ingestion step picks a parameter that represents the location of the data that we will save to once the data ingestion is finished. In the data ingestion step we save the data set to the specified S3 path and then cache the location with the max age of one hour and then in the distributed model training step we are training a TensorFlow model using Kubeflow's TF job with the data set that we just saved. When this workflow gets executed again within an hour the data ingestion step will be skipped and the training step will reuse the previously generated data set. Next let's take a look at the more complex pipeline that involves multi-objective optimization in order to achieve better overall performance for machine learning problem. Here we'd like to build three different models with three different model architectures such as logistic regression, neural networks and decision trees and with different objectives. Here we're using accuracy, AUC and loss. There are two different data ingestion steps that ingest two different data sets. In model we use the different data set and the other models. After we've trained, finished training these three models we then trigger a cut-tip experiment that collects the metrics and suggestion and suggests an optimized set of hyperparameters. Once the suggestion is made we will trigger a new workflow that uses the suggested hyperparameters. Here's how to implement this pipeline in Argo workflows. First we construct the bag that consists of the major components that we showed in the previous diagram. The data ingestion step consists of steps to execute data ingestion from two different data sources. We use the with sequence syntax to loop through different data sources. The model training steps consist of step templates for different model types, different data sources and objectives. We assume that the single model training template used in these model training steps support these different parameters. Next, Andrei will give a live demo. Thanks, Yuan. I'm going to give you a demo regarding to the cache how we can leverage in catip. Let me quickly jump to the UI first of all. This is Kubeflow UI. I hope you're familiar with that. I'm going to jump directly to the catip one which is part of the Kubeflow umbrella. As you can see here, this is Kubeflow UI where we can submit new experiments. We can specify the necessary information for your hyperparameter experiment such as metadata, trial threshold, for example, how many trials you want to run in parallel, what is the maximum number of trials, what is the maximum file number of trials. Also, you can specify the objective. The main metrics you want to tune, the additional metrics you want to collect, the goal for your objective metrics and other necessary information. For the search algorithm, catip out of the box supports various number of different algorithms. We continually evolve into adding new algorithms and even like, as I mentioned before, provide an option to deploy custom algorithms. We can also specify the earliest of the techniques to avoid overfitting for your hyperparameter experiment. Then we can set the hyperparameters. We can add a new parameter with support of various number of distributions, a categorical double integer and discrete. We can specify the range of hyperparameters. You can specify the step. You can also edit the hyperparameters. Then we can jump to the metrics collector specification and to the trial template, which is actually executing the training during your hyperparameter evaluation. In this particular example, I want to take the example with our workflow as trial template. Let me just copy first of all the whole YAML to our UI and just to submit this experiment where we can analyze the results. Before I jump into this UI, I just want to quickly introduce what kind of experiment we are running. We are running a simple catip experiment with the ARC workflow as a trial. In terms of the objective, we are going to tune validation accuracy with the additional metrics that we are going to collect is the training accuracy. For the algorithm, we just select a simple random algorithm and we are going to run two parallel trials and the maximum number of trials of five. Basically, in this example, each trial is the ARC workflow. We are going to try to spawn the separate ARC workflow in parallel and we are going to execute the workflow inside the ARC workflow. Basically, we are going to tune learning rate and with these kind of ranges. Let me jump directly to the trial template. What do you need to specify in a trial template to be able to run ARC workflow? You just need to set primary pod labels, primary container name, the success condition when your workflow is finished and the failure condition when your workflow is failed. In terms of the workflow, if you are familiar with ARC, it will be very easy to understand how it looks like. Basically, we have two steps. The first step is data preprocessing and the second step is the model training. In the first step, data preprocessing, we are going to generate the number of examples. This is a super simple tool example, but it just shows the power of using Argo and Caddyp because you can have very complicated preprocessing here. As you all mentioned before, we can store the value of preprocessing in the cache. Basically, we are generating the random value and we try to store this value in the cache. Then we are reusing this value in the next workflows to not run preprocessing again and again. Then we just basically pass this number of examples to our second step, which is model training. As you can see here, we are getting the number of examples from the previous step and we are getting the learning rate from the suggested Caddyp parameters. Then we can run this training with these two parameters. The first one is number of examples. The second one is the learning rate. Then we run training. Again, as I mentioned before, memorization is very important because you don't want to run preprocessing for maybe you just want to run it only once. In the next step of the evolution, you just want to run only training and collect the metrics, which is important for hyperparametry and job. Let me jump back to the Caddyp UI. As you can see here, the experiment is currently running. This is the experiment I ran before. In this UI, we can analyze it. We can get the optimized trial. We can see some of the metrics that we collected. We can jump to this UI just to see what is the name of the experiment, what is the current status, what is the best trial for now, what is the best trial performance. Also, we can see some experiment conditions. We can jump to the trials just to see their metrics, their metrics that we collect, the best hyperparameters. Also, we can see some distributions in this UI. Again, as I mentioned, each trial here is a separate argor flow. We can jump even to the argor flow UI. We're going to see that, for example, let me take one of the Caddyp trials. For example, this one. We're going to see that this represents the whole argor flow. Inside argor flow, we have different steps. Let me jump to one of the workflows here. Basically, we have the data preprocessing step, which is storing value in the cache. Then we have a model training, which is reused as well from a cache. As you can see here, we basically have a number of seven, six, nine. If we're going to check other workflows, we're going to see the same number for each step. If we go and click here, we can see seven, six, nine again. If I click to the model training, I'm going to see the exact training which is happening. We're just collecting the results from the training, basically. This is very powerful. Again, jumping to the UI, we can click to the trial. We can see which metrics have been collected, how the metrics are going to produce. You also can analyze the data based on these trials, so you can see what is the performance, which is produced. You can also collect more metrics if you need. Just to use this UI in terms of the metrics tracking process, at all, you can see some details of the experiment. This is a very simple example, but at the end, you can create more sophisticated examples. Again, as I mentioned, you can create even some multi-objective experiments with the deck when you have more than one model with strength and parallel and one evaluation step, and you can even run whatever you want, which are goal flow offers. Let me jump back to our presentation. At the end, I really want to quickly mention a couple slides regarding the community, because all of these amazing features won't be available without the great work from the open source community. If you want to check this experiment and just try to run it by yourself, you can follow this guide. Also, I strongly encourage you to join the art workflow and creative community meetings, so we meet almost every week. We're pretty open to the new contributors. We're pretty open to the new proposals and the future requests that we can integrate in our projects. Also, please check our GitHub repositories, our Slack channels, and if you're using KDeep or Argo, please update the adopters list, so we really want to have an interaction with the customers to see what kind of pain point you have and what is should be our next roadmap. If you want to learn more about KDeep, please check this presentation list to learn more about AutoML and how to use it within Argo. Just at the end, please conclude to just fingers if you have any questions. We're happy to answer all of your questions. With that, thank you so much for listening to us. We're happy to answer all of your questions. Hello. Can you hear me? Okay. Thank you, everyone, for watching our pre-recorded video today. So, Andre will be on our Argo Slack for any offline discussions and questions, and I'll be here for Q&A. Yeah. Let me know if you have any questions and I can answer them now. Raise your hand if you have any questions? No questions? So, how many of you are working in machine learning-related applications? Okay. I see a couple of hands. Are you using Argo workflows? Okay. What do you use for like distributed training, for example? Sorry, I can't hear you. SageMaker? Okay. Do you find it easy enough for you to run all sorts of experiments? Okay. Yeah. I hope you guys will try out many of the sub-projects are available in Kubeflow, and we have distributed training operators in this project that we just mentioned. It's for managing AutoML experiments, and there are a lot of built-in algorithms, for example, high-performing tuning and architecture search, and so on. Yes. I have an intern pursuing a master's degree, and we have him doing a machine learning research project, but I don't think he's picked his tech stack yet, actually. So, kind of like more generally, I was looking for something that I could take back to him and kind of show him this because probably gives him most everything he needs, I think, maybe. He'll feel no better than I do. So, like just like basic, like some of the same stuff you showed here, is there any of that like I could, like, is it going to be basically like a, like I can get like your Slack information or something and like introduce him so he can maybe get some help with his project? My recommendation is if you are running things in the cloud or on Kubernetes already, I'll go workflow to be the de-facto choice for workflow orchestration, and since it's really scalable and easy to use, and if you are running distributed machine training, especially on Kubernetes, then Kubeflow training operator is definitely something you want to look into because you can describe a distributed training job in CRD. It's very easy to use once you install the operators. Yes? Is it possible to run KTIB without the Kubeflow infrastructure? For example, we have, say, Argo and Metaflow sort of integrated for distributed training. Can I leverage KTIB independently from the Kubeflow? But you still want to use KTIB, right? Yeah, so KTIB is kind of independent of other independent on other Kubeflow ecosystem. Like you can run any custom CRD as an experiment, as a trial, and then you can spin up a lot of experiments using KTIB. And then using KTIB, you can also automatically start in different experiments with different parameters if needed. So if you can describe your Sage maker job in terms of CRD or a script, then I think you can use it directly. So somebody developing this, how much skillful they need to be in Argo workflows, writing all the workflows and writing the DAGs and all that stuff. So how much skillful do they need to be when writing the DAGs and the Argo workflows and DAGs? How much skillful? How much skills do they need to... It's just like Kubernetes CRDs. And once you've installed the controllers, you can just write everything in your YAML. And there's also Python SDK, Java and Go SDK that you can also use. I think it's pretty easy to use. So if you're a Python developer, if you are one of the data scientists in our company, you can just use one of our SDKs. Do they need to know Argo workflows? Yes, you would need to know the basic concepts, but there's also integration. So I know you mentioned Metaflow, right? They also added the integration with Argo workflows so that you don't have to understand all the concepts behind workflows. You can just write the regular Metaflow steps and then underlying it to our invoke and create Argo workflows without having the users to worry about it. Does that help? Okay. Any other questions? Any questions from here? No? Okay, we can take additional questions offline either on the ArgoCon Slack and I'll be at the Acquity booth if you want to stop by. Thank you.