 So, hi everyone. As I was saying earlier, good morning, good afternoon, good evening, depending on where you're tuning in from around the world today. Welcome to today's CNCF webinar. I want to thank everyone who's joining us. Today's title is Micro-K8 HA Under the Hood Kubernetes with DQ Lite with Kaniacal. I'm Kristi Tannen. I'll be moderating today's webinar. We would like to welcome our presenter, Konstantinos Tlaaculos, senior software engineer at Kaniacal. A few housekeeping items before we get started. During the webinar you are not able to talk as an attendee. There is a Q&A box at the bottom of your screen. Please feel free to drop your questions in there and we'll get to as many as we can at the end. This is an official webinar of the CNCF and as such is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of that Code of Conduct. Basically, please be respectful of all your fellow participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF webinars page at cncf.io slash webinars. With that I'll hand it over to Konstantinos to kick off today's presentation. Take it away. Thank you Kristi. Hi everyone. I'm Konstantinos Tlaaculos. I work for Canonical on Micro-K8s and here I will be showing you how HA works in Micro-K8s. First, let's see what Micro-K8s is. Micro-K8s is an opinionated Kubernetes distribution. It is opinionated as you will see in the sense that we try to minimize operational friction in both single and multi-node clusters. Of course, we are always CNCF conformant, meaning that we do not drop any features. Everything that is supported upstream is also available in Micro-K8s. We target IoT devices and Edge and I would add here that we are also comfortable on developers workstations and on CI-CD pipelines. The two hardware architectures that we support right now is X86 and ARM64. Now Micro-K8s is built on solid foundations. It is packaged as a snap and that means that when you deploy Micro-K8s on your machine you get with the Kubernetes binaries all of their dependencies as well as everything that is needed to set up a cluster. At runtime Micro-K8s remains isolated, very similar to what happens with OCI containers that you already know. The benefit of this setup is that since we ship the whole software stack where Kubernetes runs on, we have full control over updates. These updates are delivered automatically whenever the administrator decides to have them installed on his cluster. This is particularly important and I mention it here because as you know Kubernetes is a project that is constantly evolving at a very fast pace. Every few weeks we get new parts and that part needs to be delivered, needs to be installed on your cluster if you want to keep it up to date. The focus on this slide is on security, however the same mechanism of updates is also used to deliver backfixes, new features, performance enhancements and all this is done transparently to the user. The user doesn't need to do absolutely anything. The cluster just updates. Now along with the Kubernetes binaries, with the Kubernetes cluster, in Micro-K8s we package some curated addons. These addons can be enabled with a single command, Micro-K8s enable and the name of the addon. These addons range from addons for basic operations like here you see we have addons for DNS, Metal LB, we also have for increase for local storage. We have other addons that have to do with monitoring and logging like Prometheus or Yeager, some service messes like Istio that you see here. And I would also like to point out that we also have an addon for machine learning group flow that pairs very well with the GPU enablement in a single command, Micro-K8s enable GPU. Okay, so here we will be discussing how, yes, sorry, I thought I heard something. So here we will be discussing how high availability is delivered in Micro-K8s with zero ops. The first thing that we need to do, however, is to build some understanding on what Kubernetes is, how it's deployed and how it works. So in this figure that I guess many of you have already seen is an overview of Kubernetes deployment. There are two types of nodes, the Kubernetes master, the green one, and the Kubernetes nodes that are the blue ones. In the Kubernetes master, you expect to find cluster management services that form the control plane. The Kubernetes node, on the other hand, the blue nodes are where the user workloads run. So the Kubernetes management services that you see in this figure include the API server that serves the API that the administrators talk to. You also see there the scheduler and the control manager that one of the things that they do is that they spawn pods to the Kubernetes nodes, the worker nodes. And then you also see there EDCD, which is the data store. The data store essentially keeps the state of the Kubernetes services. It doesn't need to be inside the Kubernetes master, but it is depicted like this in this figure because of its importance to the control plane. The Kubernetes nodes where the user workloads are hosted are the blue nodes you expect to find their cooblet, the container and time, and also a networking layer that spreads across all the nodes and allows pods to talk to each other. Note here that the users that want to reach their workloads, they talk directly to the Kubernetes nodes while the administrators talk to the Kubernetes master. So in order to reach high availability in this setup, we have to essentially eliminate single points of failure. We want to quickly detect failures and have failover solutions if something goes wrong, right? So before we go and see what how we have reached to that state to H8, it is important to understand here that high availability is perceived in different ways by its entity that interacts with Kubernetes. For example, for the end user, H8 means that the workload he's running into Kubernetes is always available. For the administrator, things are a bit more complicated as the administrator needs to make sure that the H8 expectations of the users are fulfilled. And they also need to be able to always have the API responsive, the API server responsive. That doesn't mean that a specific API server needs to be up all the time. It just means that there has to be an endpoint, one API server through which the admins can talk to the control management service of the Kubernetes cluster. Other things that interest the administrators is that the cluster should have more than one nodes. And that is for spreading the workloads across different nodes and in case one of them fades, respond the pods to the ones that remain online. And this as a consequence means that there has to be a reliable storage so that user data survive these pod migrations. There has to be proper networking so that the users can reach the workloads and the nodes, the Kubernetes nodes. This list that you see there here is not, is not exhausting. Of course, the administrators need to get to take care of a lot of other things that perhaps sometime outside the scope of Kubernetes thing, for example, reliable network links or external load balancers to the cluster. Now, one level down is the Kubernetes itself and Kubernetes itself with the help of the Kubernetes distributions, they need to satisfy the admin needs in terms of a. So, for Kubernetes to be a, there has to be a data store that is always available. There has to be a, the Kubernetes cluster should support clustering so that you have again, more than one nodes. And there has to be a way to configure things like persistent storage and load balancers. So here, in this presentation, we're focusing more on the provisioning of an h a data store. And at the same time, we try to be true to our goal of minimal operation friction, operational friction. So, the first thing that we have to do is to build a distributed data store. Because for the cluster to be highly available, there has to be more than one API servers at the time. So what we do, we take SQL, SQLite, this embedded database that is widely used, and we put it into the API server. However, there has to be only one data, only one data store on the cluster each time. So, we also provide a layer, provide the distributed part of SQLite. And actually, we call this decolite and actually D decolite means distributed. So, decolite is an embedded component, it's embedded inside the API server. And each API server is placed potentially the role of a node of the distributed, of that distributed data store. I say here potentially, because decolite transparently selects a subset of nodes to maintain data store copies. And we will talk about this at a later slide. So, here, we focus only on the API server. So, having solved the data store issue in this way allows us to have to freely copy, have replicated the API server. And in effect, if we take this to an extreme, we can stop worrying about where the control plane of the of your Kubernetes cluster is, because all of the nodes run the API server. And this is kind of a good thing because by replicating the API server, we also improve the reliability of the data store. Because the data store is able to have more options to migrate its data in case of a failure. Again, all of this is transparent decolite, the layer that we added takes care of this transparently. The administrator needs only to make sure that there are at least three nodes for an H8 cluster to be formed. The next thing that we did is that we towards our goal of zero ops is to make every node a Kubernetes node, a worker node, a node that will serve user workloads. This way, we do not need to care about where our workers are, where our masters are. We don't need to keep track on what are the endpoints of the master nodes, so that the workers can talk to the right masters. Because actually the master and the worker nodes, the nodes that serve the workloads are collocated. So the masters are essentially on the local host where the workers are. So as you see here, we have less moving parts and less moving parts, results, fewer, less operational friction. So at this point, I'd like us to have a look at what the user experience on the user experience for Kubernetes is. Practically, the only thing that you need to know in order to operate microcades is how to form a cluster. And this is done through these two commands. On one node you have to call microcades at node, and then you have to cycle through the nodes that you want to join to the cluster. And call microcades join. Let's see a demo of this. I have here two nodes on AWS. Let me make this a bit bigger. Okay. So the first thing that we will do is to install microcades. So what happens here is that we need to download 200 megabytes of data. Okay. This is the package, the microcades package. As soon as this is downloaded, the services, the Kubernetes services will start. And then microcades will make sure that the data is loaded. Microcades will make sure that the CNI, the network layer that is needed, is also deployed. So the installation on this node is done. Let me do the same for the other node. Okay. So I was saying about the CNI, the default CNI that we have is Calico. If we here do microcades to control get all minus eight. You will see that we have a Calico running. Here we have basically two nodes, completely separate nodes. Control get node. We have one node on its machine. And we want these nodes to, we want to join these nodes. Okay. So what are we going to do? We're going to call microcades add node on the first of those nodes. You see here that this command prints out the command that you should put on the second node. Okay. And let's see what this command is. It says pseudo microcades join. It has a connection string here. The connection string is made of two parts. The first part is the IP and a port on the first node. And the second is a token, a magic token. As soon as the first node sees this magic token, it says, okay, this new node is authorized to join the cluster. The two nodes, the two nodes will, will exchange all the required information to form a cluster, including certificates, self configurations, user tokens, and anything. They will, they will basically form a cluster. And as we, and as this happens, let's go to the first node and do and see if we start seeing the node, okay, that node just registered four seconds ago. It's not ready right now, but it will soon get into, yeah, it will soon get into the red state. So as I told you, the API server runs of two nodes. So basically, you can go in on the second node now and do and call the same things. So go microcades, control, get no, control, get no, and you see exactly the same nodes. Now, there is a command here that gives out the status of the cluster. With this command here, we see that microcades is indeed running. We are not in an AHA setup. There is only one node that holds the data store. This is the node here. It's actually the first node that we have this one. We also see the state of each add-on if they are enabled or not. So in order to reach an AHA setup, you have, we have to have at least three of these nodes connected. I don't want to do that here because I have already a cluster prepared with four nodes. Here, for example, I do microcades, status, I do microcades, status here. You will see that there are three nodes that hold the data store. There's also one standby node that we will discuss on a later slide. So this is the formation of the cluster. If we, now at this point, it's a Kubernetes cluster for node Kubernetes cluster. You can enable things like chudo, microcades, and enable DNS from the curated set of add-ons. This is basically the DNS service that would apply the manifest and also go through all of the nodes and reconfigure them. Now, let me go to another node and actually show you what will happen if we want to exit, if we want to remove a node. Okay, so let's do this. Let's see. Let's do a microcades status. So, okay, so in the master, include 43. Okay, so this node is part of the master control plane. Let's get it out of here with microcades. This is how easy it is to exit the cluster. If we run this, you will see, let's go to, you will see that basically this node will restart and it will restart as if it is a single node cluster. The rest of the nodes, so the rest of the nodes will pick up this update. Okay, and remember we had the standby node, so that standby node now became part of the data store master nodes, nodes that keep a copy of the data store. And the cluster remained in an HA setup. It might also be interesting here to do a quick control, get all my say so wide to show you that indeed we have pods running on all of the nodes. Okay, cool. So, so this is, this is more or less the line that we have that the administrator, the path that we have for the administrator, in order to manage the HA. Join the nodes and leave the nodes. Of course, there are variations here, depending on what might happen in the cluster, like there's a variation regarding what will happen, how would you remove a node that has crashed. Okay, you have to forcibly remove it. Or how would you add the same, how would you add multiple nodes with the same token. This is also supported. Okay, so let's focus now a bit more on decalite. I'm sure that the you want, you might be wondering now, why we went with decalite and not the DCD. So decalite is a is possibly the most widely used embedded database out there. So its value is, it's indisputed right. It is light reliable performance. And it is a great fit for a great fit for zero ops. You can take over all the operations that you that this database needs, as it is, as it is embedded right. And more, you may have noticed that decalite is also used in other canonical products. So we are committed to its success. And we feel that we will get multiple gains from using in my crickets as well. Now, at this point we can dig a bit more into into decalite. I will give you a very high level view on how decalite works. So our contribution to the added part to SQLite is the distributed part, as we said, we, we employ we have implemented a consensus protocol based on draft. So that all copies of the database of the data store stay consistent. In draft, there has to be a single node that plays the role of the leader, let's say, to simplify things this this leader keeps the golden copy of the database of the data store. And of course, if you want to have a leader you have to have a way to elect the leader right so you have to have a number of nodes that will vote for that leader. This is where the requirement of having at least three nodes to four minutes a cluster comes from. Keep adding nodes, the fourth and fifth node will become will become standby nodes as the one that we saw the standby nodes are there to keep a copy of the database, the data store, but they do not participate in the voting process they do not vote for the leader. And the role is that if a voter leaves or if the leader leaves, then one of those can be quickly promoted to become a voter. And any other nodes above that the sixth, seventh and onwards. These nodes are spare nodes, they do not. They do not vote, they do not keep a copy of the database, the only thing that they do is that they proxy any request database request that you do to the to the leader of the of the cluster. So, at this point we are ready to discuss what happens in more interesting scenarios. Like, for example, here we have two data centers, DC one and DC two with microkits running on them in this in this in this deployment. We have on DC one, one voter and to standby nodes on DC two we have two voters and one spare node. So what will happen if the two data centers get disconnected. Let's say the the, let's say, let's let's examine first what would happen if the leader was on DC to the leader here in this figure is marked with an asterisk. So, if though if these two data centers get disconnected. This is the vote, the DC one will basically freeze because it will not be able to reach the leader, and it will not be able to elect a new leader, because it has only one vote and it doesn't have the majority of votes. This is to, on the other hand, has already has a leader so it doesn't need to go into an election phase, but doesn't have enough voted right so one of the spare nodes will be promoted to standby and then to a voter. So there is practically no disruption in in this in this cluster. The variation of this is what happens if the leader is on DC one, where you have only one water. In this case, the leader on DC one will realize that it has lost connection to the majority of the voters. Okay, so it will, it will stop serving any requests. So, so in this way you don't risk any data corruption and you don't need, you don't risk any, any consistency split playing situations and stuff like that. However, it's not going to be able to elect a new leader, because there are not enough voters on DC one. So this DC again will freeze. So the other DC DC to will, since it has two voters, it is able to elect a new leader, and it will do so. And also it will promote again spare node to become a voter. So, this is very briefly how take a light works. So what's, what's another roadmap, what's, what, what are we working on right now. We want to make microcades. We want to give a good interface user interface in so that we make microcades failure domain aware, aware. We also want to provide a way to, for the administrator to hint which nodes are applicable for hosting TQ like, of course, we're also working on CPU and memory improvements. So summing up microcades is built on solid foundations, we are growing production grade features. Our vision is that we minimize operational friction, and we are very pleased in the direction we are heading. I would like to take the opportunity here to thank our great community. It's been, it is, it's been great that you have you guys and the amount of work that you're doing it's much appreciated. So microcades microcades and decalite are GitHub open source. You can reach us in microcades.io this is our website. And if you want real time conversations, we are also into the microcades channel on Slack Kubernetes. That's it from me. Take any questions you may have. Thanks so much for the presentation. That was great. We have a couple questions coming in here. The first one I'll read off to you. Is it possible to have microcades cluster across a private network excuse me, NATS and public network with NTLS assume metal LB and core DNS etc are enabled. Okay, so this is this is an interesting question. I cannot tell you right now because the main. So it's so this question is not the matter of microcades or it's Kubernetes okay. The thing that I'm not sure right now based on this question is what would happen with the CNI the network layer that has to be spread across different across the Kubernetes workers. Okay, the Kubernetes nodes this there are certain certain requirements for this to work. I cannot answer this right now like we have to we have to look at how this this network is set up exactly. Okay, the next question is what components of k8 are not part of micro k8. So, so we okay so we don't drop anything from upstream, perhaps a federation is one that you may call out but it's not, it's not part of the core micro Kubernetes. And so in micro case we don't drop anything. Okay, at the day upstream releases, we also go and build the binaries and release the same day so it's not that we have the luxury of having a huge patch against the Kubernetes services and drop features and stuff. And I can tell you here what services we run. So, this is an inspection script that tells you what is what services are running in microcades. So, in a note we run a proxy could let scheduler control manager API server. We are container deep and a couple of more services that are just there for setting up micro gets itself. So, we do. We. So these are the services that we have. I'm not aware of any service of Kubernetes of upstream Kubernetes that we do not ship. Okay, those were the questions submitted. Last call for questions. Give. Oh, got one right away. It says I am new to micro K8. Is there an example or use case with IOT and developer workstation. I mean real time applications. For example, so, so the best, the best place you should start looking for for application for this kind of examples is in tutorials Ubuntu. There is a tutorial that sets up micro case. I don't know though, if this is if this matches the part that says IOT and developer workstations. Another source of examples that you can find is your documentation on the microcades.io stocks. Real time applications now. That's a that's a very wide. Again, the best thing is for you to come into the microcades channel and ask us directly it's a this is a big topic right. Okay, great. Looks like there's a comment in here that they're not seeing the presentation that's totally fine. And I think that that is it. A reminder that the recording and slides will be posted later today to the CNCF webinars page. CNCF.io slash webinars. Thanks, Konstantinos again for a great presentation and thank you all for joining us. Stay safe and hope to see you all at cube con cloud native con North America next week. Take care everyone. Thank you. Bye bye.