 Welcome. The next session is going to be about load balancing VMs in Kubernetes. I have one small request. When there will be questions, please do not leave the room because you will disturb the speakers, so if you can stay, there's enough time actually to leave the room after the session. So Janir, the stage is yours. Thank you very much. So, hi guys. My name is Janir Quinn. I've been working in Red Hat for the last two interesting years. I'm a part of the Overt SLA team, the Red Hat Virtualization, and currently I'm mostly working around Qvert. So, let's start. Basically, the topic today is the need for load balancing in Kubernetes. And to talk about it, we need to first go over what is Kubernetes in a nutshell and also what is Qvert, how scheduling works in Kubernetes and then understand why do we even need load balancing. After understanding the need, we'll see some of the projects already out there and the concepts in Kubernetes and also we will focus on one major project in Kubernetes incubator called the scheduler for load balancing. After that, we'll see what else might be done around that area with an emphasis on scheduling simulation and finally, finally, all of that will come down to load balancing virtual machines in our queue wheel fold with a load balancing experimental algorithm. We'll have an example at the end. So, first of all, Kubernetes. So, Kubernetes is, as you know, is an open source platform designed to auto-scale deployment, automate deployment, scaling, and orchestrate the containerized applications. Doing all that on a cluster level. So, at the minimum, Kubernetes can schedule and run containerized applications across multiple virtual or physical hosts and they could be around up to 5,000 of them. Doing that in a manner that maximizes the host resources. And Kubernetes containerized infrastructure allows you to control and automate application deployments. In addition, it also allows you to mount and add storage to keep these applications in a stateful state. Meaning, because containers usually have unfurl disk files that will be lost after they're gone, we need to have persistent volumes to keep stateful data and also allow containers to share data among themselves. So, it should be easy to scale containers application and resources on the fly. And Kubernetes is also a declarative platform. Meaning, I'm telling Kubernetes using a specification file usually. Usually these are YAML files. How do I want applications to run? And it's guaranteed that the applications will run that way. It also provides monitoring and health tools for health check-in and self-healing, etc. To understand a bit about scheduling and how scheduling works, I'll talk about some of the components in Kubernetes. I won't go over the entire architecture. So, basically Kubernetes runs on top of an operating system. And it interacts with pods running on nodes. Nodes, for that matter, are our hosts. So, let's zoom a bit to Kubernetes node and what it contains. So, our scheduling unit in Kubernetes is a pod. A pod is a collection of containers. One or more containers that have shared storage, networking. And being a declarative language with Kubernetes is you create a pod with a specification file. The specification file will contain image data, storage data, resources I want to allocate for a specific pod such as CPU and memory. And these pods run on the host I mentioned. Either physical or virtual. In a Kubernetes node, you have services that keeps the node running and communicate with the outer world. One of them is a Kubelet, which is some sort of an agent for the node to communicate with other components in Kubernetes and keep an eye on the state of the world and the changes and other commands. Also, it might contain services such as Docker or other container runtimes and a Kube proxy. Some of the master components relevant to this lecture is, first of all, the API server. So, the API server is the gateway to Kubernetes. It's basically the front-end. It receives commands from the outer world, DevOps, for example, and relays them down to Kubernetes. And this is how we communicate. Another major component is ETCD. ETCD is the back-end store of Kubernetes. Kubernetes runs on resources. One resource, for example, is our pods. So, ETCD keeps in a key value pair all the information needed for our world in a stateful way. So, a pod will be contained in ETCD with its identity and we can always examine its state. So, one of the components in Kubernetes examining the state of the world is a scheduler. But before I talk about the scheduler, the scheduler is one of the background threads in Kubernetes called controllers. Each controller looks at the state of the world, meaning in ETCD, and acts upon them, running in the background to define the ratio for examining ETCD. So, the basic job of a scheduler is to look if a pod was created and it doesn't have a node assigned to it, takes that pod and schedule it on top of one of the nodes in Kubernetes. So, we talked very briefly about Kubernetes and now we get to Kuvert. So, Kuvert basically comes to converge to infrastructures. We have our Kubernetes infrastructure, which is about containers, but we still want to manage VMs. So, Kuvert is an add-on for Kubernetes that allows us also to manage virtual machines alongside containers. This is a classical example. So, container is the strong buzzword around the world. It's considered the strongest way to go right now. It's faster, it's stronger, it's scalable and can store really high, like Superman, for example. On the other hand, we still have our virtual machines. So, they have their old bulky utility belt with a full hardware stack containing starting from BIOS and network adapters, storage virtualization, CPU and more. But the world is still not ready to get rid of virtual machines. The world still needs also Batman, for that matter. So, that's the main reason for us to have a converged infrastructure. We don't want to manage two infrastructures that will give us a large overhead and duplicate resources, etc. So, for that, we have Kuvert. So, Kuvert comes on top of Kubernetes and acts with it. It drops really easy to a Kubernetes cluster. You don't need to add an additional old setup. It's really simple to install. And it takes all of the Kubernetes benefits and capabilities, such as scheduling and also managing it in a declarative way. So, I talked about pods, but Kuvert is about virtual machines. And we represent in Kuvert virtual machines as the way pods are represented in Kubernetes in a declarative way. And they are managed the same way a pod is managed in Kubernetes. To understand the concept of load balancing in Kubernetes and virtual machines, think about it this way. A virtual machine is running on top of one or more pods. So, when I'll talk about scheduling pods, it's basically scheduling virtual machines with a twist. So, just a quick, simple overview of how scheduling works in Kubernetes. So, you have the client side where you can run commands using kubectl. With kubectl, you can initiate a command given the specification file as Kubernetes being a declarative platform about the pod. That file will contain, as I said, the images, the image storage, and also for scheduling matters how much resources I wanted to consume or what is the maximum it can consume or minimum it needs. So, we created a pod. And now we have a pod definition. The pod is not running on any host yet. Once we created a pod, its data is stored in ETCD, but without any host. So, the scheduler being a controller that looks at the state of the world notices, hey, I have a pod right now. It doesn't have any host. So, I'll assign that pod to Nordix. The schedule works according to a scheduling policies and some threshold you want to define. Examples from our open world can be power saving, even redistributed. And it acts with filters or in Kubernetes world are predicats and also weights or priorities in Kubernetes world. So, doing all of these calculations, it finally comes out with a best host for that specific pod. Once it decides what's the target host or the target node, it updates the state in ETCD. So, the pod is now with a state that it's assigned to node, specific node, node X for that matter. The agent running on node X notices that something has changed in ETCD. The pod is now assigned for node X in ETCD. So, it said, hey, I'm supposed to run that pod and finally starts a process which runs the pod with all of its container on top of it. So, that's a very simple flow of scheduling in Kubernetes. So, now we talked about scheduling. Let's talk a bit about load balancing. So, here's a blast from the past for this defragmenter. So, it will help us a bit to understand why do we even need load balancing. So, like your classical old hard disk, when you create files, delete files, add capacity to it, you create holes. And the state of the hard disk is not as good as it was at the start. And this defragmenter, for that matter, does the defragmentation work to get your disk to be in a more consistent state. Same thing goes to Kubernetes cluster. You can look at pods as files. Pod creation, pod termination, adding more nodes will get the cluster to be in an, I won't say unstable, but not in an optimized state. So, one of the main reasons for that, that scheduling decisions in Kubernetes are made only once. Once you create a pod, and once it's assigned to a node, it won't be moved again. The decision is final until the pod is terminated, and then a replacement one will come, and the scheduler will just schedule it to a new node. But, again, pods are killed, and then they don't live anymore, and a replacement pod comes instead of them. So, because of that, we need a way to optimize the cluster layout. We want to look at the situation and decide, okay, I have some pods running on certain nodes, and maybe if shifting them to other nodes will optimize the cluster layout, and give me a more evenly distributed cluster, or maybe I can switch some pods to other nodes and save some power by disabling one node. A complete framework of that doesn't, doesn't exist yet in Kubernetes, but there's work being done, and work that has been done for now. So, in the Kubernetes world, let's take a look about a few of the examples that currently exist in Kubernetes. So, we have the classical scaling, alt-scaling, so it's the alt-scaler. It simply noticed the state of the Kubernetes cluster, and if it sees that the resources are almost exhausted, it will automatically bring up a node. Bear in mind that it brings up the node and doesn't move existing pods yet to the new node. This is for upcoming nodes, for newly created nodes in the future, so I won't worsen the state of the cluster. Another thing is eviction policy. Eviction policy, you can look at it also as a concept, but for balancing right now, you can look at it as a concept for a specific node and not for the entire system, meaning I have a node and I set defined thresholds for that node. Once I reach these thresholds after a grace period you defined, the node will start evicting, preempting, sorry, evicting pods from that node until it reaches a stable state again. Again, it's in a concept of one node. From other virtualization concepts and maybe other cluster concepts, you might have heard about affinity, which basically means a pod can run along other pods or a pod can run on certain hosts, and you have also anti-affinity, which means the complete opposite. In Kubernetes, you have also the concept of taint and toleration. Taint means that a node will repel certain pods, according to defined labels and definitions, but you can put on a pod certain toleration that means, okay, the node will reject this and that pod, but if I have a toleration, then a pod can be scheduled on that node. An example for load balancing world or maybe for damage control, one taint can be once I reach out of space or out of memory in a certain node, then it applies immediately an eviction for the pods running on a node because it cannot tolerate out of disk. And finally, another solution in Kubernetes is about pod priority and preemption. Basically, it's scoring your pods, meaning I want to grade the pods from one to 1,000, for example, being 1,000 the highest rank, so critical pods will get 1,000 and not so critical pods will get something in the middle, maybe less. And once I reach a specific threshold, no, sorry, critical pods that will have a high score are favored over pods that have a lower score. So for that matter, I want to create a new pod which is critical or I have a critical pod that went down what's created again and I need it to run. So pods with low priority will be preempted, they will be moved and the preemption will cause critical pods to be scheduled on a node. These are partial solutions for the system that does relief the state of the system, but it's still not a complete load balancing solution. So this was Kubernetes and also Kubernetes has a great repository called Kubernetes Incubator with interesting projects in it and two of the projects related to load balancing are cluster capacity and the scheduler. So before I talk about the scheduler just a quick overview about cluster capacity. Cluster capacity means I get a pod specification, Vivids, for that matter, desired resources and it takes a snapshot of the state of the cluster and then tries to see on how many nodes the pod will be deployed on or assigned to until it reaches a certain limit and then it will say okay, I scheduled a number of pods and I cannot schedule any more pods. Downside with it that it is for a specification of a specific pod and again not a complete solution. So we're coming now to this schedule. This scheduler currently looks like something that is more close to a complete load balancing solution. So like classical load balancers for workloads, this scheduler, you set a certain strategy or policy. It can be high utilization, low utilization, power saving, maybe in the future. Having said that policy, it looks for a trigger for a state that it can enable load balancing in the system meaning I got to a certain state and now I can see that I can take down one pod, kill it and let the scheduler reschedule it again. So once it gets to that situation, it kills the pod, evicts the pod. A pod being a part of Kubernetes concept, part of deployment or replication set or application control will be recreated instead of the killed pod and the scheduler with the new state of the system of the cluster will probably schedule it on a new better node than it was before. Doing all of that with an emphasis of minimal disturbance to the cluster, I don't want to start killing pods all over and evicting them over and over again because they might not be reassigned or it might not improve the state of the cluster. So one existing policy, for example, it's low utilization. You can see them on top. Once I see that the node has a state with a lower number of nodes than something I defined and a low memory in CPU, it means I might want to evict some pods into that node so I can level the system. This can be caught, for example, if an autoscaler brought up a node and it doesn't have a lot of pods running on it and I want to balance the system. So it basically will kill a pod and hopefully the scheduler will reschedule that pod on top of the new idle node. So again, because we're repeating that Kubernetes is declarative and the scheduler also gets a declarative file with that specific policy. So for the example of low utilization, I'll zoom in a bit, we define specific threshold for low utilization. So we can see at the top example that I defined I want 20% of CPU and maybe 20% for memory is not the best example but a simple threshold of memory that wasn't reached and also number of pods. If a node applies to all of these thresholds and it's under it, then it triggers the scheduler to start searching for other nodes where it can evict pods from. So we can also define target thresholds for nodes you want to evict pods from. So it will be the other example. All the values above 50 CPU, 50 memory or 50 pods will cause a pod to be evicted for that specific node. So you can just start killing evicting pods from every node. You can't evict critical pods. Pods that are not expected to terminate have local storage for that example. Again, they need to be a part of a deployment or application set so they will be recreated. Otherwise killing them won't mean anything. And we want to evict, of course, best-effort pods before critical or burstable pods. So something is still missing here that might contribute also to the scheduler project. We want to know before we evict a pod and let Kubernetes create a new one if a new pod will even be scheduled on some node. Maybe we do all of this work in vain. We want to have a more accurate simulation of the system state and reassignment, not only taking a snapshot, but maybe doing it in real time. And finally, actual replacement of pods to better nodes. So for that matter, we want to have the ability of scheduling simulation. Scheduling simulation or dry run functionality for that matter doesn't exist yet in Kubernetes. It's an open issue. We are also working to contribute on that matter to have the ability for dry run scheduling. So scheduling simulation basically means you take a dry run endpoint that will take a pod, sorry, and eventually will tell you where the pod will be created or where the pod will be binded to. Or even before. I want to see if it can be even binded to a specific node even before getting the results. So these are two of the outputs. Very simple for a dry run functionality. So as I said, use cases can be very good for rescheduling and load balancing if I can pre-know where I would schedule that pod on a specific node that will really help me before starting changing the cluster state and load balancing the system. For cluster capacity analysis, for example, instead of taking a whole snapshot, I can just run dry scheduling on each pod and then see what happens. And maybe it would be more accurate instead of having time elapsed since I took that snapshot. Also for auto-scaling in Kubernetes, if I scale up a node I want to see what happens next, et cetera. You also have the capability in Kubernetes to use replacement schedulers, meaning you can create your own scheduler and have maybe stricter predicates of filters. So stricter predicates, usually logically saying, they might not allow you to assign pods to nodes as easy as in the Kubernetes stock scheduler. So we want to better know how it will happen. There are some obstacles for creating this dry run functionality on Kubernetes because, first of all, we don't want to affect the state of the world. We don't want to affect the ETCD. So we need to leverage caching mechanism in the Kubernetes existing scheduler, which won't be easy. And also, we don't want to just evict a pod and then run it back again because I'm changing the state of the world. So it should be something more memory-wise instead of changing ETCD. So it all comes down to load balancing virtual machines, all of the concepts I mentioned before. So taking Kubernetes and Qvert, using dry run functionality for load balancing with the scheduler, maybe another algorithm, and migrating eventually VMs. But before we talk about just running the algorithm, migrating VMs, what does it mean? So virtual machine, let's look at it in an abstract way. Virtual machines run on pods, one or more pods. And we don't want to kill virtual machines. We don't want to evict the pod that a virtual machine runs on. Killing it will say, it doesn't really create a live migration. So for that matter, what we want to do is before killing the pod with the virtual machine is creating a replacement pod and then transferring the workload of the virtual machine to the new pod as a result of load balancing and shifting a pod to another node. So Qvert has a new object called migration, which defines the destination node, defines what is the VM, and the status of the state. So once a migration object is created, we schedule a new pod. It schedules a new pod. When it starts, it triggers the migration, and at the end the VM moves to the new pod. So how can it work on Qvert? So as I said, we have the descadular. The descadular evicts the first pod it sees from a specific node, and then lets a new pod be created. Kills it, and the replacement pod will be rescheduled by Kubernetes. But we don't want to kill the pod yet. So instead of it, we will just block pod eviction or pod deletions, and virtual machine migration will be in the background. We will create a regression object, a new pod will be created, and then a virtual machine will finally move there. I also talked about scheduling simulation. So we might be able to neglect the descadular project and do explicit load balancing based on migration objects and utilizing the dry run functionality. So this is an example for a brute force algorithm that will make use of such dry run functionality. I will go all over the pods in this cluster, taking them from all the nodes, dry run them with the scheduler and see what is the destination node. So for each pod that I found the destination node, I want to score it. Give it some sort of scoring. I don't want to score the node itself. I will score a code into the node according to its memory consumption, according to CPU, maybe other aspects you want to add later on using priorities. So I have an overall score. For that matter, let's see that the higher score is the better migration. So I scored the migration eventually. So once I've went all over possible migrations, I'll get a list of all migration results and then like in MapReduce, I will reduce it to the best score that I have and according to that, I will initiate the migration. So the best migration will be for the best overall and then we'll trigger a migration event and the pod will be rescheduled. This could be very CPU consuming since it's brute force, but it's another example of how you can utilize dry run functionality in Kubernetes without using other projects or maybe creating your own load balancer. Okay, so we talked about Kubernetes, Qvert, load balancing, scheduling, dry run scheduling, but it all comes down to virtual machine and balancing virtual machines. We are leveraging Kubernetes concepts using Qvert. We are leveraging the scheduler and finally we might want and wish to add more capabilities to Kubernetes such as dry run functionality. Kubernetes are still, on their side, are adding more concepts for optimizing cluster and you will always see in each version a new concept, eviction policy is a bit new. Also, pod priority is new. So it's an ongoing work on that area. So there's still more to come. Still work on the virtual machines world, but it's all very interesting and exciting and I think in the following next year you'll see more concrete results in converged infrastructure for virtual machines and containers. So that was for me and if you have any questions, I'll be happy to answer. Yes, again, can you repeat the question? You don't want to isolate the problem. So you're asking if it's about isolation and load balance, are you talking about throughput? No, okay, so if I got it correctly you think the problem is isolation and why do you... Basically, again, it's about... The load balancing concept is about the system of the cluster, the state of nodes and resource consumption. It means that I have certain containers and pods running on a node and the nodes came to a certain state where I have other nodes that can take some of the pods. Again, if you're talking about isolation you have affinity concept for that and taints and toleration, meaning I want simple containers to run on specific nodes if I understand correctly. These are other concepts, meaning affinity and taints and toleration. But balancing concept is around just stabilizing the system in terms of workloads and resources. In VC? You said about VC. Open VC is not supported by Kubernetes. It's a similar approach. Again, it's not existing in Kubernetes but I'll be happy to discuss with you later on about this approach. Again, is that a naive approach for live migration on top of Kubernetes? Okay, so you asked if the approach of this schedule is only for Qvert or not only for Qvert. This scheduler is a project for Kubernetes, not Qvert. But it can also contribute for Qvert, for virtual machines. We can use, as I said, at the end of the presentation we can leverage it, see where we can evict a pod and then instead of letting it evict, we can put it in the Kubernetes world. We're talking about virtual machines running on pods. So this scheduler evicts a pod and then a pod is recreated and the Kubernetes scheduler just schedules the replacement pod on a new node. That's a Kubernetes concept. So integrating Qvert concept is blocking the pod deletion, creating a replacement pod and by creating a migration object which is a part of Qvert and then moving the VM to a new pod. So are there non-Qvert applications for that concept in? Live migration? Nope. You cannot do live migrations for containers because the concept again for pods, once they're terminated, it's final. Do you have room for my question? Yeah. What's the difference between virtual Qvert and the Qvert? I'm less familiar with the Microsoft project. Maybe it has the same concept. So it's integration for Azure Microsoft. So I'm less familiar with that but maybe we can talk about the differences later on. Do you have time for another question? Yeah. So you're talking about, the question was about the load balancing network load if I understand correctly. Or this guy or other concept other than CPU and memory on top of Kubernetes. I give an example for CPU and memory just to grasp the things. In Kubernetes, you can also apply to a pod labels. Other than labeling pods, I know in terms of throughputs, Kubernetes does have a load balancing which is not the load balancer I talked about. Only about throughput. But you can also apply extra predicates. I don't know if you read about the Kubernetes schedule itself but you can create your own scheduler with stricter predicates and maybe add filtering of networking and other priorities for that matter also for IO. So the concept is plain but you can always add on top of that. More questions? We're on up of time.