 Hello and welcome to the managing Kubernetes edge fleets prone to network fault tolerance talk for the open network and edge summit. I am Mark Abrams. I'm a field engineer edge specialist with rancher labs. And what I do is I work with pre and post sales teams as a technical resource for rancher labs and to help teams learn how to build things out on the edge using Kubernetes. And to learn about how they're doing it prior to us getting involved and sort of help the rancher team develop the right product by interfacing with these other teams. I have an agenda here for us today. And what I'd like to do I want to do things a little bit different today. Normally I save my demos for the end of a presentation but because of the way we're presenting today. I want to make sure I get the demo in. And so we'll demo first you can always come back to it. If you if you're thinking about it throughout the presentation of course it's pre recorded so you can roll back and take a look and see what I was doing. And then after I present will talk about it a little bit. After that, I want to talk about some of the foundations for this presentation. And then I'll talk about containers and container orchestration. Now many of you already know about containers and container orchestration. Some of you don't though so I want to really make sure I cover it for everybody and that I'm really addressing everybody who's interested in in this information. So I'll do it briefly it's going to be a summary if you don't know about containers and container orchestration. This is not the only talk you should be watching there's a lot more to it and you should go check it out. Of course I'm going to talk about network fault tolerance and sort of the what it what it is and and give some examples of where we see it on the edge and in our customers and in and the people that I talked to day in and day out. And then I'll talk about K3S in practice and and lightweight Kubernetes and how it can be used with the edge and network fault tolerance. The fleet is part of the title fleet references large numbers man large numbers of Kubernetes clusters. Well fleet references large in man number managing large numbers of anything in the basically in the in this presentation. I'm really talking about edge fleets of Kubernetes clusters so let's get on with it so let me move right along we'll move on to the demo and for that I'm going to go to what I have as a pre recorded terminal session. So, if you look at up at the top left here, you'll see I've already loaded a command watch kubectl get pods and that's the first command I'm going to run. I'm going to go ahead and play this and I'll talk through it as it goes so I run that command and there's nothing there I get an error. It actually doesn't find any resources because there are none. So I moved down to the bottom part of the terminal, and I apply this speed limit ammo which is the configuration file for some resources that I want to put into Kubernetes. Those resources include the fault tolerance namespace and a demon set with this speed limit application in it. And you can see now I'm up in the top right hand side and I'm actually running. I try and look at the logs for this speed limit container that popped up on the top left, but nothing came up there was there were no logs at first and that's because there's an initialization container that in it container actually is holding back the main container from starting because it's checking to make sure that the network tolerance the network speeds are acceptable so it checks it figures out yep the network speeds are good. My new container pops up and it starts downloading content and I'm actually just looping through and pulling down CNCF dot IO again and again and again, until I go over to my network control, and I actually start throttling the network for this device. So it's something I can do to demonstrate what it looks like when my bandwidth goes goes sour right so bad bandwidth, rather than running, stopping the speed limit container you can see I didn't the container still running. But actually I stopped downloading so we're in another part of the app where it just stops pulling down the CNCF stuff. Then I go back and I throttle it back up, and you'll see it starts downloading content again. So this is just one representation of what you can do using Kubernetes to manage network faults. What I have here and I'll just pause the video right here because this is a static information but this is a view of what's actually deployed into this Kubernetes this K3S cluster. So you can see that the first thing here is an init container, and that init container is really just running speed test. This is something you've probably run in your browser before to try and test the network latency of your own home network, or you're in a coffee shop or you don't run well so you run speed test right. And that's run as a command line tool. So I'm able to run that in the container, get the results and figure out what do I want to do with my workload. But the first one the init container actually prevents the workload from starting. Once it started, I have this liveness check and I run that I run the liveness check. Periodically you can see it delays for 20 seconds and a timeouts for 30 sec. It allows a timeout of 30 seconds to allow 30 seconds for it to run. And then it'll do this every five seconds. The liveness check is almost identical to the init container, but it's actually running as part of, it's just part of the functionality that I get from container orchestration. So my, I'm managing my application, I'm managing how much it accesses the network using some Kubernetes facilities to do so. Well, that's the demo. Let's talk about how I got there and what the significance of these, you know, the other parts of network fault tolerance. Why use Kubernetes on the edge and how it can help us with problems like this. So, here's some of the foundations to the presentation some of the constraints that I saw when looking at network fault tolerance with our customers and prospects. And with the, you know, lots and lots of different edge scenarios. It turns out, one of the most common problems for edge devices is network fault tolerance. It's extremely common. It's low band with connectivity spotty connectivity. Every time I have a conversation not every time but very often when when we have conversations with customers who are working on the edge. Those are issues that they have sometimes, you know, they it's even like an air gap network. So they built a network that just doesn't allow any interaction with the outside at all so it's not really a network fault. Often the edges disconnected at long periods of time. The other sort of common theme for the edge stack that I see is hundreds or tens of thousands of devices, you know, just just massive numbers of devices at the edge. And I call fleet right it's a large number of devices. This obviously I did this demo on a single device so it applies single device, but we need to be able to handle this across many devices. I'm not going to talk a lot about fleet today there wasn't really enough time for me to really talk about how to manage fleets. The idea of network fault tolerance in fleets is really common so and there will be other talks. I have some other upcoming talks that about fleet as well so hopefully that'll get to a Linux foundation event coming to you soon. I also have seen non homogenous resources at the edge. I mean this is pretty common the data center. And the idea about the servers and the idea is that everything's got CPU and RAM and that's what our resource. That's what our applications need at the edge that's not the case. At the edge what we find is that there's more than just CPU and RAM. Oftentimes, well in the in the data center sometimes it's GPUs at the edge, it's GPUs but it's also as well. You know actuators and sensors to devices the hardware right on the device for example in one of my demos, which I won't show you today but I actually control the Nick right from Kubernetes so you know, we may have other types of devices Bluetooth, ZigBee, stuff like that. We'll talk about that later. So why container orchestration? Why should I use container orchestration at the edge? I want systemic consistency across the enterprise. I want my development teams who are developing apps to have consistency even as we go out to the edge. Traditionally, the edge has been embedded systems and my app is tightly coupled with my operating system. And to update that package, I need to do a firmware upgrade. And so it's a very, you know, onerous problematic task. It doesn't happen often. The whereas the life cycle for app. We're approaching, you know, we're never going to get to zero minutes, but we're approaching momentary updates in the data center right and not everybody's doing that today but that that was the goal that's how we develop distributed computing. In part, you know, that was the impetus behind it. So container orchestration brings me availability and scalability and a minimum times reliability. It allows me to separate my concerns at the edge so I can take that embedded as my app built in separate the hardware from the operating system from the application. Update those independently, keep my app on a more consistent update life cycle, lit into an application life cycle, more similar to what we see in the cloud. These are all things that we see our edge customers desiring to get and that they can achieve through Kubernetes and container orchestration. In addition, I get then the remote management right so now I can, I can update my actually through Kubernetes, I can even update my hardware as well I can update my, my OS and my other parts of the hardware, depending on how it's operating. But it is possible to do even using Kubernetes with things like a system upgrade control controller. So I want to manage 10s to hundreds of thousands of devices. So, a typical scenario on the edge this is a typical network layout like just really superficial right. I've got device one through device and, and they just talk back to the data center or the cloud, what have you. And then there's these other scenarios where I have these devices but I insert a gate device map. So, my devices don't talk directly to the cloud they talk through a gateway and then the way to the cloud. So pretty much these two scenarios. They can get these networks can get much more complex behind the data center or downstream from the data center. So this is the general gist of it. And that the issue is the issue for network fault tolerance anyways is when we lose connectivity at any one of these points. So if any of these things lose connectivity to their partner that has an impact. But it shouldn't cause total failure of the system. And again, Kubernetes can help with that. I leave you with that as sort of the foundations for this this talk. We did the demo already. I'm going to touch on containers and container orchestration I'm going to zip over it so for those of you that know it please bear with me for those of you that don't. This is very superficial but I want to just make sure you're with me for the talk. So for containers and container orchestration. On the right hand side what you see is sort of the traditional processes right you'll see that I've got my device and then I've got my operating system. I've got some process and my process has dependencies on the left hand side. I've containerized those processes. So they're, they're sitting on top of this Docker runtime you've heard of Docker. It's a container runtime. And it just allows me to run these containers with their processes so that we can fully contain the test and its dependencies in most cases. In the data center resources are homogeneous so I can I can pretty much fully contain them and just depend on that every machine is going to have the resources I need. Or I can use things like tanks and tolerations and Kubernetes to target specific hardware. However, at the edge I absolutely resources at the edge are. But they are finite and known for any given scenario, which means at some point I must decide what hardware operating system sensor actuator whatever it is that my edge device will have, and I can start targeting processes to the resources that are available there. So, with that, let's look when we insert so it just have containers, right, but what I want is container orchestration containers aren't enough. They can contain my app sort of, but some of the dependencies are in the hardware right they're not in the container themselves with container orchestration. I'm adding this other layer you can see I have K3S here, which is Kubernetes, and it has CRI so that's the container runtime. K3S happens to use container D not Docker, but it's got my buzz with my container so it looks very similar to what we had it's just my container runtime is inside of the container orchestration application that's running. And this allows me to do things like networking scheduling. And when I talk about scheduling I'm not talking about putting things on your calendar. I'm talking about scheduling resources. I'm talking about what workloads need what resources where are they. Kubernetes can help me with that. I can manage app and service life cycles. I can get scalability and reliability of the services. And if I have multiple nodes clustered together I can get availability. I can have things that will function, even when one of the nodes goes out. It's sort of like this, where I have a resource pool and this is really what we see in the cloud right where we pool our resources of CPU and RAM. And then those four pods that I had in the previous in this slide, those four pods can be spread across resources. Now in that example I showed you earlier, we actually had a demon set so that would just be one pod a demon set means I'm going to run it on every node. And that was the in this situation that I'm simulating every node is networked in some way. And so that every node needs to be able to have access to understanding information so we use a demon set. So container orchestration on the edge. Let's just talk about that a little bit. One of the scenarios that that we see is actually very common to see single node clusters so a single device where people just want to take a advantage of some of the availability and scalability capabilities, or the they are using like the container to as a step in for visor, and I'll talk about that a little bit more in a little further on in the presentation. In addition, we've got, of course, a container orchestration layer. And that's what the container runtime is. And then with K3S we have and with edge we have device size limitations. K3S will run well in about one CPU with a gig of RAM. We often get requests to run it under a quarter of that size, about 256 megabytes of RAM. But that is not something that we can really do well today. I've run it in about 512 megabytes of RAM. But we do recommend one gig. All right, so let's get on to the good stuff. Network fault tolerance. Some types of network faults. I showed you a network faults that had to do with low bandwidth, right, a low limit was hit. But in my diagrams I was showing total loss of communication like that, you know, the gateway just blows out can't communicate. And then if we have single nodes, what happens if they lose connectivity. It's a single node cluster, of course, that's it. It's out just like total loss. A multi-node HA cluster with Kubernetes or container orchestration. If the cluster itself is HA, I might be able to move a workload over to another node. So there are some advantages there with multi-node clusters. And we do see that in some scenarios where it is possible to have multiple nodes and sort of that HA capability as well. So it varies on your use case, of course. And then of course, multi-node non-HA, we need to do something. It's a problem. I don't know what the problem is until you tell me more about your use case. So let's look at from here what I want to do is look at examples, specific examples from customers I've talked to. Obviously I can't share customer names, but here's an example. We've talked to a number of customers and prospects that have this issue. They have these service vehicles. This is pretty common in the energy industry. These vehicles, you know, think of like a large truck, like a dump truck or garbage truck, but instead of having the garbage mechanism in the back, it's basically got a data center in the back. They have onboard high-performance computing. They manage fleets of hundreds of these vehicles. And of course, with that, managing multiple devices on each of these computers. The people on board are not IT. They're not, you know, trained in networking. They're not trained in data center practices. They're trained in running that and operating that truck. Each of these trucks, because they're mobile vehicles, they have the potential to exit the network area. They often are connected only by cellular or satellite, so they already have often low bandwidth scenarios. And then during a network fault, they need to continue to operate. Like if they're doing some sort of drilling or some sort of geodetection, you know, they're mapping the earth, they need to continue to operate in that capacity even without the connectivity uplink. So, you know, they recognize they can't receive updates, they can't send data out, but they can handle this outage. They're aware that it's going to happen. So they take action. They stop communicating upstream locally. They, when the network comes back, they flush the data. That's one example. Another example is these could be retail stores. They could be food chains. You know, the conglomerates often own multiple retail stores. So these scenarios get to a situation where they have a lot of devices. They tend to be small one to three node clusters. Maybe the type of device I would actually put under my desk, right, a little home device. Sometimes it's a nuke. Sometimes these are Raspberry Pi sized. The store uses business grade or consumer grade network from the same providers you and I get our network, home network front. Network outages are generally isolated to stores. We do have a lot of clusters, as I was saying. So again, during a network fall, the store has to continue to operate in every capacity that it can. People still need to be able to pay. Things of course have changed during the times of pandemic. In fact, we're seeing that these, these types of problems, they're increasing the amount of technology that they're putting into the stores, not decreasing it. So customers, they're trying to give customers the best experience possible, you know, similar actions that they take, stop communicating upstream, store data locally, transmit network comes back up. And then the last one is in a factory. So an assembly. In this scenario, they often have gateways. So everything behind downstream of the gateway doesn't even talk outside. It only talks to the gateway. Often they're less concerned about loss of connectivity between the smaller devices or the robot. Not necessarily smaller and the gateway. These gateways often have AI capability, or are often highly clusters. And then the edge devices will flow through the. But the gateway has the potential to disconnect. So again, during a network fault, their assembly lines can't go down. But the, the fleet management cannot see the clusters anymore. So it can't even know about the downstream stuff. It doesn't know about the gateways, but that's okay. As long as you know they take the right actions. Similarly, stop trying to communicate. Continue operating the assembly line flush data on return. Maybe what you want to do is tag the gateway is unavailable. So in Kubernetes, you can create labels and say, hey, this thing's not available. It's possible with these sometimes this is multiple buildings on a site. Each building has its own gateway could connect through another gateway and another building if the networking in the, and the site is appropriate. So you can actually still operate, you know, have functionality on depending on the scenario. And then often there's socket communication just between the robotics, the systems that are there not going through the gateway. So it can continue operating. We have about five minutes left a little bit I would like to try and save time for questions. So general considerations right stop trying to store your data flush tag advice. We saw that through all those examples. I kind of went through and I laid out the, the details on. I'm going to reach of these very quickly because you can look through if you want to stop on any of these pages. You can do that. So the next one was storing data right so we buffer and flush. Flush when the network comes back up. So this will vary depending on a bunch of things. There are different strategies for how you might do this. And then tag the device if it's a device that can get used that that you know maybe these devices are redundant in some way. Maybe there's another one something else you can use to it. Container orchestration can help by recognizing network faults, managing the device hardware, bringing up an alternative network and allowing us to do things like change default routes. The K3S allows us to container orchestration allows us to create a network. I can drain and cord my nodes so that's where I can take the workloads from one move them over to another. So if one device in the, in a group of gateway devices lost connectivity, coordinate and drain it and and devices operating and that's great. Everything's running. Let's talk about K3S practically, you know the practical implications of K3S. You just own your hardware and in the case of network faults that means possibly owning the network interface controller. I have an example and I will share the URL with you. Obviously we won't have time to demonstrate it, but you can use one network device to bootstrap the Kubernetes system, and then use a different network device to actually for communication and so there are things you can do with Kubernetes to actually run lower level parts of the operating system and devices that are connected. There are other devices possible as well. In my first attempt when I first started doing this stuff when I was first trying to solve some of these problems. I thought, oh, you know, I just I want to run system control right so let me let me just take over. I want to just run the supervisor the way the suit. And what I found was that it to do that I had to give privileged access I had to give like serious access to this container that was trying to control the Nick. And honestly, it was terrible right it's just it's too much control I didn't like the way I was doing it. Well, you know, what if I just take ownership from the supervisor I mean okay the supervisor is useful but actually I have this Kubernetes like control plane that can do all sorts of stuff pretty much everything the supervisor can do. So, I can get scalability and reliability of my process whatever that is, and I can do dependency management which is one of the things your supervisor does. So, in things like the demon set that I showed right I can have something that runs everywhere and make sure it's always running. I can use Kubernetes jobs to run something once. So I can prepare for something else that needs to run and use in a containers I showed you a demonstration of use liveness and readiness to see if things are working as expected. So container orchestration needs to run itself needs to run as a service so the supervisor is running my K3S my container orchestration, and that means that K3S is always on right so if that goes down. The supervisor will make sure my Kubernetes is running, and then I can use my to do things like manage my devices for network interface management. So I said it runs as a set supervisor yep. Oh the capabilities, I can access all sorts of stuff in Linux using Kubernetes capabilities the process ID, the host network, get to the host processes, there are a ton more do man capabilities in Linux or look at Kubernetes capabilities to find more out about that. And with that I'm going to share my related projects. The first project is the one we didn't demo today the turnkey project example of owning the network interface. And the second one is to watch for throttle bandwidth and that example we did see today. What's next up to you define a controller we can do that all of this through a control one off demon set type thing. Thanks for joining me today, and I think my time is up and hope to present to you again sometime soon. Hi. There's just, there's just one question. Right so there's one question in the q amp a. If you have any others, go ahead and put them in and we'll probably have to switch over to slack. The question came up is fleet some offering from rancher fleet both refers to sort of fleet of devices. First time I heard the fleet term was probably 20 years ago in terms of fleet of trucks, where, you know, using credit cards, they would manage the fleet of trucks that could go to the gas station and what what they could access from the credit card. Because it's a fleet of devices. It could be a fleet of the devices may be a part of a cluster so there may be a single device may be a cluster and in its own right, but that device may also be a part of a larger cluster. It can't be both either has to be its own cluster or part of a larger cluster. So there is a project from rancher which will be released in early October, which actually the project is out in open source on GitHub now. And that is a fleet manager so it's managing managing multiple clusters, and it's designed to manage ends of thousands of clusters in a get ups manner. We're going to be doing some demos on that in other talks at other sessions, not at Linux Foundation this year but in upcoming summits. So, to keep the conversation going. So, the number two cloud native networking on our slack workspace after the session ends.