 Thank you everyone for joining this session. I know it's the last one. I hope you had a great day here at Cube Day Israel. I know that I enjoyed most of it and talking to most of you. I'm actually really excited to give in the closing session with all of you. Today we'll be speaking about how to simplify your multi-cluster management using a tool called Karmada. A little bit about myself. My name is Eliran Bivas. I'm the architect for the platform group at AppSliar. I work for AppSliar for almost four years now. I have a real passion for technology, so if you look at me in LinkedIn or any other place, you will see that I am a self-proclaimed tech junkie. A little bit about AppSliar. I'll let the numbers speak for themselves, but I'll read out some of them. AppSliar is the market leader for mobile attribution. We operate with 14,000 customers and 65% of the global market share. We have about 1,500 employees that operate across the globe. A little bit about the engineering organization that I'm part of. The engineering group has about roughly 400 engineers divided into squads that operates around 1,200 microservices that handle roughly 3 million events per second. And we operate an infrastructure that is close to 250,000 cloud resources and a dozen of SAS integration and more. The reason it's roughly a number because our infrastructure, as with any other large organization, is growing or shrinking based on the demand and the customer requirements. So a little bit about the agenda that we're going to talk about today. We'll start about the challenges of working with large scale clusters. We'll introduce Karmada and understand what is architecture and a little bit about its API. I'll go over several use cases just to demonstrate the use of Karmada and how it should simplify the use of multi-cluster management and we'll conclude what we discuss in this talk. So let us begin. I'm pretty sure that all of you are familiar with this logo. If not, this is Kubernetes, of course. And I guess for over the years we'll learn to love this tool. Love of hate, it's a matter of perspective. We'll learn to love it because it has a decent scheduling mechanism. It has a simple yet a very robust API to describe our requirements. It has a lot of built-in goodies like service discovery, as node failures and countless integrations that target specifically Kubernetes. It's enough to look at the CNCF landscape just to understand how many different organizations or startups or frameworks target Kubernetes as their infrastructure operating system. But when you reach a certain scale when operating with Kubernetes it forces you to rethink your entire architecture of how to manage that scale. And when it comes to large scale there are basically two options for scaling your large scale clusters. Either you can go with vertical scaling, having a very large disco ball that you keep growing and growing and growing. Or the other way is to use many cluster. Basically try to have as many balls as possible and juggle with them. Each approach has, of course, its advantage and some drawbacks and then try to list some of them in the following slides. First, let's start with the vertical scaling, the very large disco ball. So having a very large disco ball it's basically your single point of failure in the organization. If something happened to that cluster your entire production might be in jeopardy. So having a single point of failure is a big no-no in most of our organizations. Another is a scalability limitation. You can't really have the ball that big. Kubernetes itself has a limitation for its cluster. You can't really reach more than 5,000 nodes. In some cloud provider you can't even reach 3,000 nodes. There is also a limitation for port per node and other limitations that prevent you from growing even larger and larger. Another issue that you will have is resource underutilization. You're going to schedule large workloads operating alongside smaller ones. You get a lot of fragmentation in your bin packing and the cluster is becoming underutilized. Another thing is the complexity of managing a very large cluster. Just think of a simple test if you would like to, let's say, upgrade the cluster API. How long does it take to do a rolling upgrade for a 3,000 node cluster? And it takes quite a while. Next we have a horizontal scaling. So now instead of managing a single cluster, we're going to manage multiple clusters. So just by introducing many clusters we have a complexity of managing of multiple clusters. For example, at App Slayer we manage hundreds of Kubernetes clusters and we have different flavors of clusters. We have clusters targeting Kafka, we have clusters targeting Airflow, Spark, Services, and many, many more each require different type of management and different kind of workloads. So the entire complexity of managing them has become an issue. Resource allocation, this is, again, another difficulty that you need to face with. Basically, when you have so many clusters, which workload you are targeting to which cluster? How do you manage it? So your resource allocation becomes an issue that you need to work with. When you're working with many clusters, your entire network becomes a complexity on its own. How do you do interconnectivity between clusters? Which cluster is allowed to talk to which? And this is a very big challenge if we're facing with service discovery. Because now we need to do a service discovery across multiple clusters. So this is, again, another challenge. And, of course, the last drawback when working with multiple clusters is you have a potential to have an inconsistency of your configuration. Basically, did you install the right controller in all of our clusters? Do we have the same API in all of them? Have we deployed the right deployment to each of our regions? It's a drawback when working with so many clusters. So this is where actually Karmada comes into place. And I'm going to read out loud what basically Karmada is. Karmada, which is a short for Kubernetes Karmada, is a Kubernetes management system that enables you to run cloud-native application across multiple Kubernetes clusters and clouds with no changes to your application. So just by this short statement, we are talking about not just multi-cluster, we are talking about multiple regions, even running on multiple cloud provider. Now, this is a very strong statement, but basically Karmada is aiming to manage your entire global availability of your production. So let's do a short dive into Karmada's architecture. Karmada's control plan is trying to mimic Kubernetes API, so you're going to be familiar with most of the examples I'm going to show you because you're only natively working with Kubernetes. So Karmada's API server mimics the Kubernetes API server. We have a scheduler, but now instead of scheduling two nodes, we are scheduling into clusters. And there are multiple controllers that consist of the entire control plan. One of them, just for an example, is a cluster controller. So if you're familiar with how Kubernetes itself operates, we have a node controller to operate on the node level. Here we are operating on cluster level. There are two options to integrate with the control plan. If you have a cluster that you want Karmada to integrate directly, meaning a push, Karmada will integrate directly with that API server and push the workflows directly to that cluster. Other option is, of course, the pool using an agent. The agent will connect to Karmada's control plan and fetch all the workflows to that designated cluster. A little bit about the primary concept that Karmada is trying to align with. Basically, we're trying to work with these three and I'll later on demonstrate how they are coming into action. First, we have a resource template. Resource templates are the native Kubernetes API you're already familiar with if you know how to work with a deployment, a service, secret, config map, or whatever. It will become available into Karmada's API as a resource template. Any other existing tool that you're currently working with in Kubernetes will become available for you when working with Karmada. There's no need to change anything. Next, we have a propagation policy. The propagation policy is basically the multi-cluster scheduling that allows you to do one-to-many scheduling, having the template propagate to any other cluster that you would like. And last, we have the override policy and the override policy is a cluster-specific configuration that allows you to change the propagation into a much more specific and individual for each cluster or a group of clusters. So let's look at the API flow, for example. The API starts with a resource template. And as I mentioned, everything that needs native to Kubernetes, it's native to in Karmada. So a deployment, config map, and so on. You're submitting into Karmada's API alongside with the propagation and override policy. Everything comes into action. Karmada knows how to push that workflow directly to the designated cluster. Internally, it creates an object called walk. This walk object is responsible to reconcile or synchronize to each of the specific clusters. Now, it doesn't schedule on the cluster itself. The cluster knows how to schedule a deployment or a config map. It knows how to handle a service. It just communicates with the designated cluster through the API and tells it, you need to walk with that object. Internally, everything keeps walking as it should. Now, in most of the diagrams that I show, and later on in some of the other examples that I will show, when you're looking at diagrams, usually in Kubernetes, squares are nodes. Okay? It's pretty simple. We see a lot of nodes, a lot of squares. In Karmada's diagrams, each square is actually a cluster. So it's a much larger scale that you are accustomed to, you see. So let's go over several use cases to demonstrate the multi-cluster API that Karmada provides. I have to say that due to some limitation, specifically space in a slide, I wouldn't be able to show you a complete armada of clusters. I'm going to demonstrate with only two. But just for the sense of it, think of it if we had multiple cluster in the U.S., in Europe, in Asia, mainly in Africa, the entire slide will be a lot of squares. So I'm going to show a very few initial use cases just to demonstrate it. Let's start with the simplest item that Karmada supports. It's, of course, scheduling, how to do a multi-cluster scheduling. And in this example, as I mentioned earlier, we're working with native Kubernetes API. So this is a deployment. I hope you're familiar with it. But it's nothing related to Karmada. It's a simple deployment that we submit to Karmada's API. Nothing special, but nothing happens. When we submit it, this is actually a resource template. Nothing happens on either of the cluster that we are connected to. So if we look at the next step, it's the propagation policy. This is the multi-cluster API that Karmada provides, part of the principles that Karmada is operating with. We are defining a propagation policy, and just for the sake of the example, I use a static propagation. Just for a short demonstration, you usually wouldn't use static configuration. But in this case, we have two cluster, one in the EU, the other in the US. The EU one called EU West One. The EU West One, US East One. Pretty simple. In this static configuration, I'm going to say propagate that deployment that they defined earlier to EU. This is what Karmada will do, schedule it in EU West One. Nothing really happens on EU West. Simple. And again, any square that you see here is a cluster, is not a node specifically. Karmada's API is a cluster. Each one of the EU and the US cluster are fully functional clusters. Next, a more advanced scheduling. Maybe we affinited. Again, this is something that you are familiar with when working with Kubernetes. I'm going to add a label for each of my cluster, one label that's called location based on the continent. In this case, we'll have two. And I'm going to deploy all of my resource templates to the US clusters. So if we'll have multiple clusters, not in this example, we'll deploy all of the resource templates, the resource template to all of our US-based clusters. And a slightly modified scheduling mechanism is an API that is using, again, match expression with labels. Now I'm targeting the US and the EU. Our Asia and Africa cluster will not get any propagation defined. Next, let's see how the override policies come into action. Again, this is a multi-cluster API that Karmada provides. Maybe due to some privacy issues or other business constraints that you have, you need to change the way that services are deployed to your US-based clusters. You're basically using the same label selector as before. And now we're saying, okay, for that cluster, I'm omitting the overriders and rather the possibility there. For example, let's say we're going to use a different environment variable or an image. And now Karmada will schedule the same deployment, but with the overriding scheduling of cluster in the US. A failover. This is another internal mechanism that comes within Karmada's API. I updated the example a little bit. Now the deployment has a replica set to three, meaning that if you schedule it on a regular Kubernetes cluster, we'll get three pots. This is the definition. Again, I'm deploying it to Karmada's API. It's becoming a resource template. And again, I'm going to talk about cluster failover, not node failovers. It's something that Kubernetes is not to do without Karmada, so it doesn't really need to handle it. But this is again a failover mechanism that slightly have a more advanced propagation policy. And again, I'm using static configuration just for the sake of the example. In this case, I'm going to use preferences of a weighted cluster, meaning that I want to change the way that everything is split across my Karmada. I have a scheduling type that it's called divided, meaning that I want to take the entire weight and divide it across my entire clusters. I can also use replicated and other strategies. Again, just for the sake of the example, I've used a weighted and divided because it's much, much easier to understand. And I'm going to use a static that weighs two to the US, one to the EU, meaning that two to the US, one to EU. And again, Karmada doesn't schedule the pods themselves. Karmada is scheduling only the deployments. And those deployments have a slight different propagation based on the propagation policy that we define. And when they reach internally into the designated clusters, two pods appear in the US, one pod appear in the EU. And now catastrophe happened, maybe an entire region has failed or maybe something that we did that causes our cluster in the EU to shut down. In this case, Karmada will identify it and will simply reschedule the deployment, the deployment, not the pod and change the way that the number of replicas will represent the desired state that we wanted, three replicas, now everyone will be in the US. Another example for Karmada's multi-cluster API is service discovery. And now this is something that I personally find very interesting. It's something I am contributing to this project. I wish to keep contributing to enhance its features. We'll start again with a very simple example. We have a deployment and a service. I'm going to omit the propagation policy just for the sake of the example. We propagate it through the Karmada API and have it on the EU West 1 cluster. So now everything runs very simple. We have a deployment and a service that connects into it. So everyone that access the service locally will reach the deployment. Simple, still Kubernetes. When we employ the multi-cluster services API, we say that we want to export that service from the US cluster sorry, from the EU cluster to the US cluster we export from one cluster and import to another. Again, I'm omitting the propagation, everything else. But once we'll have everything running we'll have a service defined in the US that basically once the services will reach it locally will reach EU cluster without having to know that there is no local deployment. Now there are other possibilities under strategies for service discovery you can do failovers, you can do other options how to do service discovery in multiple clusters but this is just the simplest example of having a service that basically does a direct connect to another cluster. But we don't think that I just demonstrated you have to understand that all of the examples that I recently showed you, possesses some risk. And the risk is if you've seen the YAMLs and read it closely you will probably see that the API version was v1 alpha 1 meaning that while some company do use Karmada in production you have to understand that this is a CNCF sandbox project it means that the API is not finalized or should be considered stable so if you're planning on using it you should definitely consider the risk of using a sandbox project and as a sandbox project or any other open source project it requires a lot of contribution from its audience and users first you should access the website in karmadas.io there's a lot of documentation that the team has provided a very well documented code and a very well documented feature set you can also open a lot of issues and contribute of course code to their GitHub repository it's highly appreciated and of course join the discussion, there are a lot of discussions over the CNCF Slack over at Karmadas and any contribution will be much helpful so let us conclude a little bit about what we have discussed in the past 20- so minutes first we discussed about Kubernetes and how it used to ease our work our operational work we mentioned that once we reach a certain amount of scale we need to rethink our operating strategy we talked about vertical scaling versus horizontal scaling having the discoball versus number of balls and what are the downsides for each of the approaches and last we talked about karmada, its architecture and very very initial use cases just to get a glimpse of its multi cluster API thank you very much thank you very much Aliran and we have a few minutes for questions any questions from the audience okay I see a few hands here let's start here so my question is regarding your experience in scaling at Apslyre before you when you made the decision to scale horizontally did you, what was the limit that you discovered practically on vertical scaling sorry on horizontal scaling of a single cluster before you decided that you had to go split workers to multiple clusters in Apslyre we decided that we immediately going to go with horizontal scaling because of the amount of different workflow that we have since we have workflows that targeting Kafka, air flow services, they are all different in shapes and sizes so we immediately started with horizontal scaling next question here we go can you please say how Kafka is different from OCM open cluster management and why would they choose one of the other I guess this is a topic for another for another conversation so maybe we could talk yeah, let's stick to this because it's a big subject and I can actually feeling another 30 minutes for it more questions, here we go when you deploy an application across multiple clusters and let's say it deployed everything went successfully who's responsible for the deployment if it, let's say changed somehow someone changed it so it looks on it and checks if it changed and reverted to the initial state and because let's say if I use rgo cd to manage my applications and I want to deploy all the application across multiple clusters I can use both but since rgo cd is the manager of the application it might change something and if this one will change it back it will go back and forth I think you can answer your question because the key here is githubs you basically need to have the entire api propagate with githubs so if you are managing today your entire clusters with rgo cd or flux to manage your clusters you basically have the same notions of githubs everything the source of truth for you is still your github so if something changes across the cluster it will propagate again and reconcile to your source of truth more questions okay we have here the corner here we go okay a couple of questions about commander first can you also handle things like changing a community's version or it just handles the deployment and following up with this question does commander have githubs of its own because if I understand correctly instead of writing infrastructure as code that goes directly to the community's cluster you have to write infrastructure as code that goes to commander to handle all of it so it should have the same tools of managing codes and configuration on its own so I'll start with the second one actually but no it doesn't have its own githubs it relies on reconciling itself this is why it has come out as api and you're supposed to bring in all your yaml's and configuration from another source in this case maybe rgo or maybe flux to bringing your configuration answer the second part of your question okay and for the first part I need you to ask it again sorry so you mentioned a few drawbacks of using multiple clusters like inconsistencies of configuration between clusters how does if it does commander resolve those issues it actually doesn't this is part of being a sandbox it doesn't really target it yet any other questions just a second here we go in the example you gave with deployment of multiple replicas that got divided between two clusters how is pod disruption budget handled in that example you basically if you know how to work with pod distribution budget in a single cluster it's the same with karmada's api it's submitted and propagated to the clusters and it will keep working as the same but if my deployment is let's say I want to have a max of one unavailable but I have one cluster with one deployment with one replica and another cluster with two replicas what would be the pdb that gets set on each of those so I showed a static configuration so usually wouldn't use that a specific number of pods this contradicts working with auto scaling and contradicts budgeting and so on so if you're working with budgeting there is another api for that I didn't show it because it's a much more advanced one but karmada knows how to handle it there's a different one there's a question on this side of the house here we go first question CRDs are they supported yep okay that's good and the second thing some infrastructure providers like Pulumi and Terraform rely on the status of the deployments so in this case because when deployment is actually multiple deployments how does the status field change I'm sure it's not the same as one single deployment so karmada's api is the source of truth for you usually if you deploy with Terraform, Terraform looks as a single cluster right but now we are working on multiple cluster so karmada is your entry point so Terraform operates on the karmada api level if you apply deployment to a cluster in Terraform it waits on the deployment status to be complete so in this case what's the status because there's 10 clusters and I'm heard it's a different syntax or do you even have a status yeah it's the native if you do kubectl describe the deployment you will get the status fit as if it was running on the karmada api oh also like the complete is on every single cluster combined thank you very much and you run great round of applause great talk