 Good morning. Good morning. How often can you speak in your own native language at KubeCon? Welcome to our story about taming Tactical Cluster Federation in edge conditions at the edge. My name is Stéphane Vagostel and I am proud to be working for the Dutch Ministry of Defence. Hi and my name is Anna Kosak and I'm a lead developer at Helendeter right now but last year we did this work together with Stéphane when I worked at TNO. Thanks. And credit where credit is due. We have tried and built these developments and experiments we're going to talk about as a team. Some of these team members can also be spotted here at KubeCon probably on the solution showcase floor replenishing their earless supply of socks and t-shirts. And a small disclaimer before we start we're going to talk about future developments and that we are already experimenting, researching and testing. At times we will be deliberately vague and incomplete for obvious reasons. Moving on. Before we get into the nerdy candy I want to briefly take you through the rationale for this story. The Dutch Ministry of Defence or MOD for short is a diverse and versatile organization. We focus on high technology and low labor intensiveness. We're a small country as you might have seen but we have an Air Force, Army, Navy and what we call Marseille. Unfortunately there is no English translation for the letter but think of it as a amongst other tasks military police and security force. I also want to challenge the English speaking guests to pronounce Marseille after talk if possible. And with these armed forces we cover land, sea, air and the digital domains. The Dutch MOD has three primary tasks. First, defending our national territory and that of our allies. Second, enforcing the national and international rule of law. And third, providing assistance during disasters and crises. And we're also not spared modern-day threats and subsequent problems. You don't have to read them but a volatile geopolitical situation, increasing cyber attacks and influencing and unpredictable natural threats are some examples. In addition we must look to answers for problems such as hybrid warfare, shortage of personnel and the risk of drowning in the sea of information. So with that, as the current state clearly defined, we want to direct our attention to achieving a desired state. A set of principles has already been drawn for us and for our efforts we focus or focused on addressing the goals of achieving information superiority, providing multi-domain and integrated operations and being able to deploy both software and personnel quickly, scalable and self-supporting. Now we also have a clear vision of our desired state. We enlist the help of TNO to help bridge the gap between the actual state, the current state and the desired state. And help us determine which steps we need to take or need to be taken and which technologies we need to be explored. TNO is an independent research organization whose goal is to make knowledge applicable for companies and governments. Our goal was to improve situational awareness and decision support through information superiority. And the most important part of the research program that followed and we had was a clear plan, what we would but also would not investigate and research the assumption that future command and control systems will be built and run cloud natively. So also the assumption that every vehicle be it land, sea or air will have a form of Kubernetes running to facilitators. And finally the premise that we want to experiment as much as possible. So in a nutshell how do we improve on collecting data and drawing a snapshot of a situation this way into deploying modern high-tech autonomous sensor platforms like you see here leading to this kind of common operational picture. It's quite a challenge even if you only look at the technological aspect of it. Because what we want is technology to help us create a concept in which software is flowing out, is sent out and data for example sensor data coming from vehicles returning from a patrol or a reconnaissance mission is brought back in. Where small adjustments based on fast analysis by for example on-site data scientists can be made at the edge resulting in retrained models and then redeployed as software. And we're joined multinational and public organizations can work and federate together within their edge or connected cloud context. And we're beyond that edge at the edge of the edge or what you could call the FARC. Independent drops are self-sufficient in their information gathering and analysis. And finally where we know it tamed and controlled edge situation of chaotic but alternately systems leaving and joining the collective we refer to as a federation of clusters and systems. Well edge computing is a broad concept with different interpretations. The usage conditions and environments in which it takes place are generally the same or generally similar. Systems are designed with the premise that there is a stable connection to the cloud but taking into account the situation in which this is not the case. This is where in our opinion cloud native software excels in its resilient and adaptive properties. But we have to turn this around and use these properties from the assumption that we usually have a complete lack of connectivity taking into account the sporadic occasion in which we have a possibly small bandwidth connection during physical deployment being away from a compound or home. And while these principles for non-military and military edge applications are likely to be similar, the circumstances, conditions and environments are likely not and tend to be more extreme and constrained in the military case. Some examples. Both edge locations you could you could also think that both locations would benefit from computer vision models running. On the one side computer vision models running to detect vehicles coming into a drive-through and enabling personnel or staff to help them and on the right side computer vision models detecting vehicles approaching a checkpoint or a gate, enabling or activating personnel to inspect. Same use cases, vastly different locations. There's no cell reception far away in the desert. Another one where slow or a lack of information about traffic, well traffic doesn't really apply to the right side but routing, positioning, weather and other information could endanger a happy customer and profit on the left side but could endanger life and a successful mission on the right side. And both these bad boys wanted to detect vehicles and other relevant objects in their facility or remote facility. Now again conceptual idea of the use case is the same but the level of environmental circumstances and requirements and constraints are most definitely not. So now that you hopefully have a good idea of the background challenges and ambitions for this story we can fast forward to Kubernetes cluster federation and how it helped us realize this future picture. And when we talk about federation we're not talking about the federation. Sorry Jean-Luc. We talk about Kubernetes cluster federation. A and I quote multi-cloud or multi-region implementation for centralized deployment and management of applications and services across multiple Kubernetes clusters. And as with everything within the Kubernetes ecosystem there is plenty of choice in tools and projects that meet this need. Anna will now take you down the rabbit hole of Kubernetes tactical cluster federation and show you what we found. Thank you Stefan. So let's think a little bit about what kind of what kind of federation we would like. So here we call this tactical federation because really you want a set of devices that are somewhere in the field in geographical location in vicinity of each other to be to create an ad hoc cloud that would be resilient, distributed and collaborative. For this kind of cloud we have a few wishes. So first of all we would like this kind of clouds to be able to regroup. So that means that any vehicle could come or go from the federation without affecting the rest of the federation. We'd like the vehicles and the clouds to be aware of its surroundings, whatever it is a distance from each other or networking between each other. And we would really like to be able to observe the federation but from the point of view of a specific vehicle and not a bird view. So exactly how it is in the field you only can see what you can see. Okay so how would that work? So we have a simple scenario that we want to take you through and just to explain what we have. So here we have three clusters in the federated cloud. We have a cluster A, B and C and now cluster A has quite a lot of computation so it can compute something, perhaps a bigger vehicle. But this cluster doesn't really have any data source. So it's looking for a camera in the area to collect some data to observe the environment. So there are two cameras, B and C and they are pretty similar. However the networking conditions to one are better, to the other one are worse. And the cluster A would like to deploy a distributed streaming application therefore network conditions are important. So we would like to deploy this networking, this streaming application from cluster A to B or C and stream and that means we need good networking conditions. So we would choose cluster B for that. Now cluster B is chosen and network is amazing but we know we are at the edge so we make it stateful just in case. I will start streaming the data in back. We have application running on the cluster A and cluster B to facilitate the connection. Alright as we expected cluster B, its scanning pattern took it away somewhere behind the mountain and we lost the connection. Now the federated cluster is much smaller. So that means we still have the cluster A which deploys the application but what should cluster A do? Like how long should you wait for cluster B to return? Should you maybe reschedule this application? And what we want is to have an agreement between two clusters in the federation of how long they could be disconnected and how long is too long and if the time passes the cluster A can decide okay I really need this application I'm going to reschedule it to the cameras that are available. And if the cluster comes back within specified time we just continue streaming like nothing happened because we are at the edge. So not really high expectations right from this use case totally reasonable for edge application so let's see how we can make it happen. So first wish that we had that we talked about was that we want to join and leave the federation and we want this to be natural and expected. So here we have a red federation and a car green is part of that federation and then a card green would like to join another federation, leave and join another one and never come back. So that's what is our wish how it would look like. And now we tried kubfed for that so kubfed creates hierarchical federations quite known in the community so it has two types of clusters it has a host cluster it has a host cluster and then two join clusters and hierarchical means that well there's one more important cluster than the other and the cluster will schedule applications on other so use it as resources oh sorry. So joining this cluster as a blue as a green vehicle it's it's fine it's as many as one can join but now leaving the application the federation is slightly more difficult because if it's the blue one that leaves there's no federation anymore everything is broken and then if another vehicle comes and would like to schedule applications somewhere also not allowed because we have two heads in this beast and that is not allowed. Okay so this is not really fitting our requirements we don't have this easy coming easy going functionality so we look for something else. So we found this called technology called LICO and there we can create federation constellations so that means that you can create a just federation with one cluster next to you make an agreement of which direction you want to federate so you would like to provide resources or you want to consume resources could be bi-directional and then another cluster can create the same agreement with a different cluster and then you get the constellations of agreements so you make a bigger cluster and then joining this this is fine then if any cluster disappears that's also fine because the federation is peer-to-peer doesn't really matter and then if another cluster appears that wants to join everybody's welcome everybody can schedule whatever they want as long as the agreement peer-to-peer agreement is there. So that fits quite well to what we want to do so you can check out this technology on lico.io if you want perhaps even in the audience there's somebody from lico because maybe not okay maybe not this time alright um so we have several major major relevant lico concepts that's that actually helping to create these constellations the first one is appearing so that's as I said this bi-directional agreement between two clusters of what to provide how to connect what kind of certificates to use everything and the second one is a network fabric and that's allowing pods to pod to pod and pod to service communication via secure channels storage fabric allows you to defer the creation of persistent storage on another cluster so that you not always have persistent storage on your cluster but can defer that to a remote cluster and on the other side also data gravity so if pv already exists on a specific cluster pods will gravitate to that pv to first state for applications last thing afloating that's just afloating of applications to other clusters one thing that is not really fitting this use case is that lico was created to to connect computers many computers in a university lab to boost some computation in that lab so really required reliable connection and that we don't have so you have to figure out how to deal with that part of lico but except everything else is quite fitting okay so let's consider our next wish and next wish wishes to to schedule pods tasks with a consideration of networking condition so as i said we have two possibilities to schedule tasks in the federation and we just want to choose a better network and now so that means we have to do something about scheduling and then many of you know the coop scheduler many of you figure out somehow how pods are scheduled you can also read about it so there are the two major steps that are important for scheduling that's filtering and then scoring and then you have several several plugins that that you can allow this scoring to happen so now if and one of these is not network conditions so therefore you would have to do something about it to to to make a scheduler so how do you actually make a custom metric scheduler well you don't have to because first of all the coop scheduler is allowing extensions so that means you can hook into to the plugins and affect the filtering and scheduling that the two most important tasks and second one well somebody already made already made telemetry or scheduling and that was intel thank you very much so that means that we can hook up something to the scheduler give it a matrix and all we really need is this metric to reflect the network conditions so that's the only thing we have to do and for this we use optimized link state routing protocol all this are and that's basically an algorithm that would tell you the state of all the links that exist from from your machine to other machines in the network not very not many hops only two hops but it's sufficient for our application so now we have a cost in the network that we can hook up to the to task and run with the scheduler so basically all you have to do in task is to create a policy custom resource policy and you have to supply only three things so scheduler metric that would mean that you want to minimize your cost so less than that's a minimization and then we have two two rules when not to schedule and when to de-schedule so we don't want to schedule when the network cost is above 600 and we want to de-schedule if we had the network metric but the cluster left the federation we annotate this as minus one and then you want to de-schedule and that's all you need to do all right and to figure it out it took some time so we have some contribution from alco and Johan Johan here in the room to figure it out put it all together so that the schedule actually works and that we can use it in our in our application all right so how do we put these things together so we basically needed some kind of test pad to to run this to figure out okay those all of these pieces that we selected would it even work if we screw up the networking so what we did is build this this federation of clusters we built three clusters we did by directional federation between them deploy streaming application and then yeah screw up with the network okay so I will show you how it's built from ground up we have a two two VMs for each cluster we build them with multi-pass then we use Terraform to install Kubernetes with already a coop scheduler as a scheduler vanilla Kubernetes nothing special then we install Lico and OSR now Lico starts working and then how it works is that it actually creates a virtual cooblet on each on the on the node so that means it looks like you have just very big nodes here available on this cluster so that means a normal scheduler would just choose one of these nodes to to deliver applications to schedule applications and that would of course also happen on the other clusters but just too little space to visualize this all right so now we have Terraform that replaces the scheduler adds the extensions to the scheduler so now we have tasks as a scheduler and then what happens is that from from each OSR OSR application we get the network cost and then we also implemented custom metrics adapter that would get this metric and put it make it available to tasks so that tasks would always know what is the networking conditions between the two clusters okay so now we use now we want to screw up with the network so we use chaos mesh to do that allow it to to degrade the network okay now we deploy our applications also use Terraform for that because we want everything to be automated in this experiment we don't want to click anything okay so now the application is going to as we said to cluster a green and now you can see that the that we have this on the virtual node we have actually shadows of the objects that we want to deploy and this is done by the Lycoris source reflection and then actually the the PV is created on the on the cluster on the green cluster and we have a reflection here of the PVC on this cluster so now from the point of view of the of the blue cluster we know exactly where the application is where the PV should be on the other cluster so we have a full observability of where are our applications all right now we want to change the network again we want to disconnect green we do that with chaos mesh and while we disconnect green sorry or we disconnect green we we can still see the shadow of the application and then this application will not be rescheduled because we can still see it and we have this deferred unjoining that we set up that you can set up for any amount of time it would basically tell cluster blue not to deschedule anything yet but wait until the time is passed so the cluster connect again the Lycoris source reflection kicks in again and everything is fine the application works as expected all right and this test base was made by here yeah all from the ideas to implementation together with the team okay one more wish i'm almost over with my wishes i want to really have the view of the federation from the point of view of one cluster so all that one cluster can see i want to see individualization so that we we ask Clarcia to make this and she did so now we'll show you a video of what we we're going to see so on the left side you can see a command line at some point it was something will happen and on the on the left side you can see a view of the cluster so we are cluster blue so we see yourself soon we'll see other clusters and here we can see on the timeline on the timeline what is actually is happening on the cluster so all the events all the pots happening that will be visible here and we have all sr network cost that is also a timeline okay so we start and we start and we don't all right yes started okay so we start with nothing then we can read command this api to find out what exactly do we have and now we have one cluster now another cluster appeared we start pairing with it another cluster appears we pair with it so that you can see we already have virtual kublets on on our cluster now we also start getting the network information from the cluster and we can also see it on the graph and now we decided that we want to decrease the network cost on increase the network cost on one so that to the yellow one it's changed and we start deploying our application so now the application is going on the orange cluster it's starting so part of it started on the blue and part on orange because the network conditions were better on orange so we started the application all right and we have data flowing from our streaming slow streaming was still okay so now what we're going to do is going to disconnect the the cluster orange we want it gone so you can see the network cost went quite high and now we put it back so the cluster came back and now we can see it is back to being connected you can see here it's back to being connected all right and the data is flowing can yes so that's that's it that was the demo thank you very much yeah we're coming to the end of this of this story we also learned a thing or two all men all team members have some wise insights to share feel free to download the slides they're already on the on the scat app so you can read them in detail bottom line is there is a lot of potential for federated cloud beyond the obvious in preparation for this talk i asked my new best friend chat gpt you know what what are some common users of kubernetes cluster federation and what companies use it and why do they use it and he keeps coming up with with lists of you know the big players with all all the same use cases so connecting multiple clusters over multiple geolocations or multiple physical locations we clearly seen tested and experimented with a different approach which doesn't assume the the natural behavior for cluster federation so that's that's really interesting for us to see and to continue on and another interesting one is that and it's not very common for us is that that we can translate ideas we experimented and tested in in mission critical it environments and we can translate them back to our non-mission critical it environments usually it's the other way around we tested something took it off the internet tested it out that could be applied to mission critical it finally we would like to give you a glimpse of our way forward a new knowledge building and research programs have already been started the one that resulted in this talk essentially was was was ended to our satisfaction and so we started some new ones with with you know already new policies being drawn up and we're seeing that our efforts are being continued to take this story to the to the next level and at the bottom of this iceberg you see some topics we will be researching in the near future so trust and isolation are pretty important obviously observability um and power and temperature aware scheduling or actually anything aware scheduling um going in that term right now um anything aware scheduling because you can imagine that in a vehicle especially when it's when it's in in an operation or on a mission temperatures can fluctuate even when a vehicle is for example damaged temperatures can fluctuate but also power power consumption and power support so that's that's things we're we're looking into um if you got in some way motivated by this talk feel free to reach out to both me or Joan also present here he will be replacing Anna since she unfortunately left TNO so feel free to reach out if you if you like the talk or if you if you think you can contribute of or if you want to want to share some some experiences um yeah and finally please rate this talk and give us some feedback it's our first time speaking on kubecon um so we can only get get better from this i hope uh yeah and absolutely finally uh we're also recruiting um like a thousand job opportunities coming up in in in may for the dutch mod in it so any dutch people here feel free uh yeah thank you any questions i see we have three minutes how do you deal with data federation and uh replication across clusters we try to replicate as doesn't really work okay we try to replicate as little as possible so yeah try to not to use the network so that's why the data gravity we use it quite a lot so if it's already on the cluster yeah we we keep it on the cluster and then you can have a replication on that cluster but not across clusters no yeah yeah so perhaps some applications that would carry data with them while they are being deployed on another cluster yeah definitely yep sounds cool hi thank you for the presentation first of all um i was wondering how do you handle such a situation in case your edge device where the pv gets compromised so somebody gets access to this device is there how do you revoke the key or accessing your main cluster and how do you protect the pv that is in there thank you yeah like like we showed you in the iceberg this resource is far from complete so that's that's actually that's that's an open an open question um obviously we would be cryptomatically uh securing stuff but yeah yeah that's that's definitely an open question we're first checking the concepts and so we do use certificates and keys between the clusters so when there's a federation there's clear agreement based on keys and and certificates and that could be revoked but then yeah it's it has to wait for for check to be revoked right but there's some other research being done in other parts of the world that that would actually allow revocation of this so you have two levels of certificates and one is revoked and that you cannot use the second one because the first one is revoked so you'll get instantly removed from the network yeah yeah and i i'm from canada so our federal government is also from canada the same challenges uh we have a couple ways of doing it and it's still being tested right now so experimenting exactly yeah and then it's based on some kind of trust metric right that you could yeah evaluate and then decide in or out yeah yeah super cool thanks all right uh that's it thanks and have a great lunch thank you