 So it's going to feel like we have been thrown into the deep end of the pool, the first day of the conference. Some of us are on the Pacific time, so we're still kind of sort of waking up and then suddenly Kubernetes is complex and then Argo on top of it and then multiple clusters and configuration across this, multiple clusters and so on. It's going to feel like it's quite a lot to take in. So I'll try and keep this as simple as possible. And if there are any questions, feel free to raise your hand and we can have a conversation about that. All right, so hi everyone. My name is Shri Zawdekar. I am an engineer at Outerbounds. Outerbounds is one of the main companies contributing to the open source Metaflow project. And today we are going to talk about cross cluster execution of Argo workflows. So basically we're going to talk about how we can run Argo workflows across clusters. What does that even mean? And specifically from the point of view of Metaflow. So Metaflow can create Argo workflows for you, especially for data science workloads. So we'll look at this from the point of view of Metaflow and we'll have a conversation about it. So before we go into the details of what cross cluster Argo workflows means, cross cluster Argo events means, let's take a quick look at what Metaflow is and let's take it from there. So Metaflow as I mentioned is an open source project. Metaflow makes it easy to access resources needed for every data science and data intensive application, compute workflows, data and versioning. And it makes this possible using simple Python APIs. And as I mentioned, it's an open source project. You can go to github.com slash Netflix slash Metaflow. Netflix was the place where the open source project actually began. Now there are multiple companies that contribute towards it, Outerbounds being one of the main ones. So what does this even mean? What does Metaflow do when we say that it is a simple Python API for data science workloads? So here I have an example of what a Metaflow flow looks like. On the left hand side, if you look at a simple flow, the simplest possible flow you can think of. And again, given that we are all here at ArgoCon, I'm sure you guys have known what Argo workflows is. Argo workflows is basically a set of steps, some sequential, some parallel. It's the same logical concept applied with a Python API. So in this flow, we have a flow called HelloFlow. It implements the flow spec. So Metaflow has a flow spec that anyone who's writing a flow has to implement. It's implementing a flow spec. It has three steps, start and end are the entry and exit points. So every flow has to have a start step and an end step. And in this case, the start step simply says the next one to execute is self.work. So it goes to this function. It executes the work step and then work in this case says go to end. And that's when the workflow completes. So this is how people who are data scientists especially or anyone writing data processing applications can write Python code using Metaflow. And now the question is Metaflow makes it super easy to build locally and then run remotely. So the right hand side basically talks about how you would actually run this. So anyone who's building this flow would actually, if they want to run this on their own laptops, you can simply run hello.py run. So it'll actually execute the flow on your local machine. So the set of steps, Metaflow takes care of making sure that the sequentiality or the parallelism of the steps is maintained. The output of one, it has to be sent to the other. All of that is taken care of by Metaflow. And then when you're ready to kind of sort of run this on a Kubernetes backend, let's say you have a Kubernetes cluster on which you have special access to GPUs, or if you have Kubernetes clusters where you have access to special data that you want to process as part of this flow for creating your models or doing some other data processing work, you can actually run this as Argo workflows. You can run this by running these two steps here, one which says hello.py Argo workflows create. This one will create a workflow template in the Kubernetes cluster. In that case, what Metaflow does is that the three steps that we mentioned, those three steps get converted to a step each in an Argo workflow. And that's how a three-step DAG gets created. And Argo workflows, hello.py Argo workflows trigger will actually trigger the flow. So it'll actually run the flow from the given template. So the three-step flow will actually execute. If you want to run this multiple times, you keep calling Argo workflows trigger multiple times and each of these instances will actually run the flow. We'll actually see an example of this in a little later. So that's the basics of how Metaflow exactly works and how it integrates with Argo workflows and what does it mean to run Metaflow with Argo. And now let's get to the main topic at hand, right, cross-cluster Argo workflows. So as you can tell, the name cross-cluster Argo can be a little ambiguous. What does it even mean to run Argo workflows across clusters? Does it mean that given a particular flow with, let's say, three steps at every step is someone making a decision whether each of these steps should be run on cluster A versus cluster B? That is one way to interpret cross-cluster Argo workflows. That's not what we are going to do today, but that is one interpretation of cross-cluster Argo workflows. The second one is you can schedule an entire flow on a specific cluster. So the user submits an Argo workflow and someone makes this choice that, okay, let's take this entire flow and run this on cluster A. The next time someone submits the same workflow, let's run this on cluster B. Is it that it is one potential way of interpreting cross-cluster Argo workflows that actually has a lot of very interesting use cases with respect to multi-cloud, with respect to cost-effective execution of flows and so on. We'll take a look at the second example here with respect to Argo events a little later, but that is one way to interpret cross-cluster Argo. And then the third and interesting step. So our third interesting interpretation, which is Argo to execute Argo workflows, Argo has its own control plane. So Argo is a custom resource definition. You have an Argo workflow controller that actually sits in a Kubernetes cluster, watches for workflow objects. And when a workflow object is submitted, the Argo workflow controller is the one which looks at the workflow object and executes the steps one by one, honoring the sequentiality and the parallelism involved. And then also making sure that output of one has to go to the other and so on and so forth. So you can imagine cross-cluster Argo also has been a mechanism where the Argo control plane runs in one cluster and the Argo actual workflow pods or the actual workflows run in a different cluster. And that is one of the main themes of what we are going to talk about today. This particular way of running Argo control plane in one cluster and workflow pods in a different cluster is what we're going to talk about. Why do we talk about this though? You can think about this as very interesting connotations once you think about this. Imagine you have an execution plane cluster. So one of the main interests that Metaflow actually caters to is data science. And a lot of times data science teams have their own Kubernetes clusters or can provision their own Kubernetes clusters and say that this is a cluster in which I have access to special hardware like GPUs or this is a cluster where I have access to special data, which is what I want to use for my machine learning models. So they have one cluster and they only want to run workflows that are actually kind of dealing with data science there. Anything that is running in the cluster that is not doing data science work in a way from their point of view is overhead and it's a fair argument. So they want to be able to run only workflows in this execution plane cluster, which is where they have all this access to special hardware, special data, whatever. So in that case, you can imagine having a control plane Kubernetes cluster, which is a separate Kubernetes cluster and that is only running your Argo workflows controller. That's the Argo control plane and it is possible to do this today. So one of the main things that happened with Kubernetes and the controllers that have actually been built over the years has been Kubernetes has always had this notion of a cube config and it's actually one of the most fundamental things. If you are going to run, you know, cube CTL commands locally or if you want to build anything, if you have access to cube config, you have access to the Kubernetes cluster. So you can actually today, literally take the Argo workflows controller, run it anywhere. You can run it on your laptop as long as you have the cube config of the appropriate cluster that you want to connect to. If you have it there, then you should be able to run Argo workflows controller in one cluster and then have workflows actually executing in a different one. And this has multiple other connotations, right? So this has multiple other side effects rather that you have this notion of managed Kubernetes environments and the shared responsibility model, one of the more modern ways in which applications are getting deployed and actually getting used. So in our case, you have a data science team, which is actually in charge of the execution cluster, but you have the platform team, which is in charge of the managed Kubernetes environment. The management cluster kind of sort of takes care of running all the management services. So I just showed you the example of Argo workflows controller, but you can imagine other kinds of stuff running in there. If you want to have a UI or if you want to have other kind of sort of, you know, monitoring components or if you want to have anything else, you can run it on the control plane cluster, whereas the execution plane cluster is strictly used for running flows, which means it is strictly used for doing the work that the user requested to. And this has other interesting implications. So one of them, one of the main reasons why this came about was the data science teams would be like, I create a machine learning model, maybe once a week and my job runs for like six hours, but and it takes a lot of compute instances, but it runs once a week for six hours. But I see my Kubernetes cluster and I see my, let's say AWS bill and I see that three instances are running all the time. And why am I paying for all of these three instances 24 seven when I'm only doing work for six hours per week? And we would go to them and say that, oh, you know what? We are running the Argo workflows controller, the Argo events controller and a few other things. And people would be like, well, sure, they understand it, but always it's kind of slightly like, okay, it's like a happy compromise, maybe at best. But people were like, if I could actually just run my cluster for six hours a week and use all the compute resources that I want, that would be fantastic. This execution model of having a control plane and a separate execution plane makes that happen, makes that possible. Of course, the control plane execution cluster is actually running for 24 seven, which is run by the Argo workflows controller. But you can imagine consolidating a bunch of these control plane components into one cluster and maybe run that multiple for 24 seven. But other clusters can actually scale down when required. It can also pose manageability challenges. One incident that actually happened was the workflow controller and user flows were running together. At one point, let's say the workflow controller died and there were enough pods that were actually waiting to be executed in one Kubernetes cluster. So they basically took up all the capacity in the cluster and the workflow controller had no way to run. It did not have enough CPU and memory in the cluster to run at all. So now when you separate these into two separate entities, the teams that are in charge of execution plane can happily run their workloads, your workflow controller itself is almost always guaranteed to be run, or you can monitor for it and make sure that it has enough capacity in the control plane cluster to run this. So this kind of separation between an execution plane and control plane has its own benefits. This slide basically talks about the same thing that if you look at the execution plane, it's a user provided, let's say an account or user provided cluster. It will run user submitted workloads in this. It needs carefully created RBAC. So one interesting use case or one interesting side effect of this is that you need a carefully created role based access. You need to create a service account that has enough privileges to watch our go workflows to be able to run pods and to be able to watch the status pods and so on and so forth. And that service account needs to be the one for which you create the cube config and then pass it over to the control plane cluster so that it actually takes effect. And that the creation of the service account and creation of the cube config and handing it over from the execution plane to the control plane, it needs some work, some automation and things like that. So having said that, let's quickly take a look at how this actually works and with Metaflow. So I have two windows here, one at the top, which is one Kubernetes cluster. This is the name of the Kubernetes cluster. It is some randomized name. It's tagged with CP. CP stands for control plane. One cluster at the bottom here, which is called EP, which is the execution plane. If you look at the cluster and look at all the pods in the execution plane and if you look for workflow controller, we will not find a workflow controller running in this cluster, in the execution plane. But if I do the same thing in the cluster above, this is where the actual Ago workflows controller pod is actually running. So we have the workflow controller running in the control plane and it is watching for objects in the execution plane. There are no workflow templates. There are no workflows, so nothing is there in the execution plane right now. Now let's switch over to this other window. We have our good old hello flow here that I just showed you earlier. So this is a simple three step. There's a start, work, and end. Let's try to run this now. So the first command, as I mentioned, was just to create an Ago workflow template. So when Metaflow does this, it actually looks at the code and actually creates a template. Here it says that no triggers were defined. You have to launch this flow manually. So we will launch this flow actually manually, but now if you go back to the execution plane, if we look at the workflow templates, you will find that a new template called hello flow actually has been created. Then if you go back here and we do trigger, this should actually run the flow. So this will actually trigger the flow to run. So if you go back here, again, there's only one workflow template. If you look at workflows, we actually see that this workflow has started execution. And if I go back to this guy, and you can see that the Argo UI actually shows that this particular workflow is actually in execution. So the staff step will run, then the work step, and then the end step. So that shows that all of the work that was actually required to actually execute this flow actually was done by the workflow controller running in a different cluster, which is great for the use case that we had in mind. So this part is good that Argo workflow, and it's a relatively simpler one, as I said, with Argo workflows being just one controller, running this pod in different clusters is not that hard. And therefore, as long as you have the cube config, things are great. But what about Argo events? Now running the flow interactively is fine. It has its use cases, but it's not the productionized way of doing this. Especially for data scientists, no data scientist is going to say that every Saturday evening at 9 PM, I'll stand up and do Argo workflow trigger to train my model. They want to be able to run this either based on a specific schedule, or they want to do this based on events. And events are actually the most powerful one where you want to have flows for creating a new model triggered when new data gets uploaded to S3, or some other event happened because of which you want to create workflows. Now in that case, we want to use Argo events. And Argo events actually has a much bigger control plane footprint. So for Argo events to actually successfully run, you have what is called the Argo events controller, similar to the Argo workflow controller in some sense. There is a webhook deployment, which is the HTTP endpoint to which you can send events. There is an event bus that keeps track of or persists these events. And then there is a sensor deployment, which is the actual pod that triggers the actual flow. So if you want to say that, OK, you know, flow A should complete, and that event should trigger flow B. In that case, when flow A completes, it sends an event to this webhook. The webhook then writes the event to the event bus where it is persisted. And the sensor deployment is watching the event bus, saying that, oh, I see a flow A event complete. I should trigger flow B because that's what we were told earlier. So the sensor deployment is the one which is actually going to trigger the flow. So similar to the earlier case, the sensor deployment should have a cube config. As long as it has a cube config of the appropriate cluster, it can trigger a flow in a target cluster. And this is actually a very, very powerful thing because instead of now the sensor deployment going to one execution plane, you can imagine this actually happening where the sensor deployment has multiple cube configs. And therefore it can trigger flows in cluster A or cluster B. So this goes back to one of the earlier points we mentioned that cross cluster Argo can actually mean that given a flow, run this flow on cluster A and then run the next one that you come in cluster B or have some intelligent way of routing these flows to different clusters. So Argo events kind of sort of takes care of that. Let's take a quick look at how we can do this with Metaflow. So I have two flows here. The example that I just gave, we have first.py and second.py. First.py is very straightforward. There is nothing interesting in this. It's just a flow. But second.py is something more interesting where Metaflow provides one of these decorators called trigger on finish. And what second.py basically says here is that trigger the flow called second.py or the second flow when first flow completes. So only when the first flow completes, actually trigger this event. We can actually do this with the Argo events and Argo workflows control plane running in a separate cluster. Let's take a quick look at a demo for this. So just like we created the Argo workflows, we did Argo workflows create for first.py. Sorry for hello flow, let's just do this for first.py. So we created a workflow template for first.py. Let me just quickly also create it for second.py. So you'll find that, you know, if you look at the workflow templates here in the execution plane, you'll find there is first flow and there should be a second flow. Oops, what did I do? Okay, so we should find two different workflow templates. So there should be first.py and second.py. These are two templates. And here for second.py, it says this workflow triggers automatically when upstream first flow completes. So now let's go to first.py and actually execute it. So when we trigger first.py, it should actually execute and that should trigger an event which then goes and triggers second.py. So the second flow should actually execute based on that. So let's quickly go here, oops. Let's quickly go here. If you see the workflows that are getting executed, so the hello flow was the first flow that actually ran. First flow is the one that is actually running. This should complete any minute and as soon as that completes, the second flow should actually start. So let's go to the Argo UI here. This should show. So you can see first overflow.py, well, it completed the workflow. The UI needs a little time to refresh, but the second flow actually got triggered based on the event that first flow has actually completed. And this should actually complete fine. I'm being told that we are at time, if you have questions, I'm happy to take them. There are many other implications of this. If you have more questions about this or want to discuss this more, come find me. Maybe I'll be outside here during KubeCon or I'll be at the booth for Outer Bonds. Questions? Thank you, thank you. Good, good question. So the question was, is the flow, is the clusters that are actually executing the execution plane clusters, are they ephemeral? How does a control plane know that you are spinning them up and down? At this point, kind of sort of setting up these clusters and setting up that R back for making sure the control plane actually has access to it, it's part of some automation that is done separately. So in this case, of course, these two clusters were created by hand and like they were set up for this specific demo. But you can imagine being able to do this fairly easily where you create a cluster, create a KubeConfig for create a service account, create the KubeConfig and pass it on to the control plane. Yes, good question. So the question was, is there still like a one-to-one relationship between control plane and execution plane? Is there a way to submit control plane from one control plane to multiple clusters? In this demo, no, but if you, especially for Argo events, it should be fairly straightforward to actually have multiple KubeConfig and have them submit to different clusters. So the actual event that gets generated that might play a part where it says I am event A. So therefore run this in, run the workflow for it in cluster A. The event says I am event B and you can submit the workflow in cluster B. The mechanism in which Argo events triggers the flow is fairly straightforward. It should be possible to do this. I can talk to you offline about how it actually happens. If you have multiple KubeConfigs though. Yes, so good question. What about the Argo workflow operator? You're right. The Argo workflow controller, there's one-to-one relationship between a workflow controller, one Argo controller for every execution plane cluster. You will need it, yes. But you can run multiple Argo workflow controllers in one cluster, each with a KubeConfig pointing to a different cluster. So you can imagine like a hub and spoke kind of model where you have one kind of sort of golden control plane which has 15 KubeConfigs and now you can actually have 15 clusters. So this kind of sort of the original diagram that we had here, which we have one control plane which is Argo and it can talk to 15 different execution plane clusters and make trigger flows at various points. Anyway, thank you so much everyone. This was great. If you have more questions, come talk to me. I'll hang out here or maybe right outside this and also be at the booth. Thank you so much. Thank you.