 Hi, welcome to KubeCon 2022 in Valencia. This session is about supporting edge devices using Kubernetes along with other open-source tools. I'm Steve Wong of VMware representing the Kubernetes IoT Edge Working Group, and I'm joined by Kate Goldenring of Microsoft. Unfortunately, Kilton Hopkins of Edgeworks was also going to join us, but he can't due to a death in the family, and our deepest sympathies go out to him. I'm going to cover a process for onboarding a host device to run Kubernetes. Then Kate will address onboarding of connected devices. Finally, we believe that recent technology is going to be disruptive in Edge. It's going to be big, really big, and you can be a part of a community that changes the world, and at the end we're going to tell you how to get involved. Devices don't just have a purchase price. There's an added operational aspect. Onboard is the process of taking a new device and initializing it to begin using it. I'm going to talk about secure device onboard. This open-source tool is specifically for putting a device in service at a remote location. Physical access can be challenging. Staff at Edge can be untrained and untrusted. What you'd like is to use existing logistics to ship a newer replacement device and put these in service using a random person trusted only to plug in the power and network cards and turn it on. No log-on or credentials needed. Secure device onboard does this even with headless or keyboardless devices. I want to make it clear what problems face this covers and what it doesn't cover. This is intended to do a one-time initial update or install of software. The scope is the software running after a boot, so from the OS level on-off through a container runtime, a Kubernetes distribution, and other things that run above the OS. I suppose that if you had an executable that updated firmware without triggering an immediate reboot, a one-time firmware update might be possible too. The open-source reference implementation of this has support for Linux devices and a couple of embed ARM devices. Using the spec, I believe you could write an implementation for other platforms, but this diagram covers what is out there today on GitHub. These are the players in history of the project. Intel originated this work and contributed it to the LF Edge Foundation. Secure device onboard and Fido device onboard are two names related to the same thing. The Fido Alliance is shepherding the spec. The LF Edge org is publishing reference open-source implementations of the spec. This requires a manufacturer to host a couple of servers. The manufacturer server is for in-house use. The rendezvous server is used by device owners and the devices themselves after a sale, and when this happens, the use of that rendezvous server is limited to a couple of brief one-time transactions. There's also an owner server not shown in this diagram that I'll cover in a moment. Open-source reference implementations of these servers are published on GitHub. They might not be sufficient for your production needs, but the spec is open if you want to build your own. I'm not going to cover how to deploy these servers. I just don't have time to do it in 35 minutes, but I'll tell you that Docker versions of these are out there, and I'm going to present a demo that already has these things deployed, and I'm not going to show you how they were deployed. After a device is made, it runs code that interacts with the manufacturer's server over a connection. A device GWID is generated and the manufacturer's server makes a record of the new device's model and serial number. The device itself records a crypto hash of the manufacturer's key and the address of the manufacturer operated rendezvous server that it contacts upon first power-up. This is stored in a TPM or other form of restricted operating environment on the device. Now, for demo, you could leave it on the file system, but that's probably not a safe thing to do in production. I'm going to be doing a demo using what I'd consider a representative edge device, a Dell Edge Gateway 5000. This is a picture of it, the actual unit in my home lab on a workbench with the cover removed. It's a four-core Intel Atom CPU with 8GB of RAM passively cooled with an M2 SSD and a bunch of ports suitable for industrial I.O. devices like CAN bus, RS-485, serial, etc. After sale, a digital voucher for the device is generated and given to the owner. In the voucher, the current owner is at the end of a chain and can demonstrate ownership to the device through use of a private key. The device GUID is in the voucher and is signed by all the owners and in this chain, the manufacturer is first in the list. Remember, the device also has a securely held hash of the manufacturer's key so it will be able to validate the owner when presented with the voucher after sale. The device is not involved in voucher generation. During this generation, it can stay unpowered sealed in a box until it goes into operation. The next step is that the device owner calls an API on the rendezvous server that causes the rendezvous server to host temporary name lookup service, kind of like DNS. And you do this just before you're about to power the device. The owner provides the device GUID and IP where the device redirects to the owner for configuration instructions and a time limit, for example, one hour after which time the lookup service for this device will terminate. The device is still all powered off here. The device will be able to find the rendezvous server when it powers up because remember, this was one of the elements burned into it when it was manufactured. Also observed that the owner is disclosing very little to this rendezvous server and that's intentional. The rendezvous server does not know who this customer is and the IP address provided can be a temporary one, perhaps a burner address, rented on a public cloud or even an address inside a private range that that device is going to be able to get to. So this allows a retailer that doesn't want people to know where stores are opening to keep this information private. Finally, the device is powered on. It first contacts the rendezvous server, gets quickly redirected to the owner address, which is where an owner server is being hosted, and then it connects to the owner server where it gets a device profile that drives the initialization. The owner to device transactions support mutual establishment of trust over a secure channel, followed by delivery of a series of key value pairs that are kind of open-ended in terms of how you use this to initialize the device. The reference implementation for the Linux client I'm showing here is pretty simple. The device gets a bash script and executes it. This connection isn't intended to support bulk transactions of large blobs such as an OS image. Instead, you use the connection to set up credentials and instructions that would allow you to set up a secondary channel for delivery of bulk artifacts. I haven't seen anybody do it, but I actually think if you wanted to, you could conceivably ask the person onboarding to plug in a USB key with an encrypted file signed and perhaps utilize that to cut the amount of network traffic involved if you had bulk artifacts involved in initialization during onboard. This is a recorded demo just sped up a little showing you that manufacturer process when the device was manufactured. I started with Ubuntu installed Docker. I actually, to accelerate this, there's a published Docker image that has the client SDK, so I didn't have to build it. I invoke a script on that client SDK that generates keys for the device, contacts the manufacturer server, gets the stuff burned into the TPM module and makes the record of this device being manufactured. It records it on the manufacturer server. So that was an 18 second video. I do want to shout out to Anthony LaPenia of Portainer for helping me out with this demo. I used a Docker container that he built so I was able to avoid compiling and building my own client SDK. You don't have to use Portainer. You're going to see me doing this in this demo because it has a nice GUI, but the Open Horizon project also supports secure device onboard. I suspect a number of other projects and commercial products do too. So I'm moving now instead of videos in the interest of time. I'm just going to show screenshots, but the process for getting a voucher is pretty quick. You can look at the one liner using a Docker container here. This doesn't have to be the device and shouldn't be the device. It's just you the manufacturer pull a voucher and you send it to the customer. So here is what I pointed out with the Portainer service, what's involved there. If you're going to use Portainer, you can set this up to configure where this owner server is as an initial step of using it to drive the process as an owner. You could also do this with code you write yourself, or I believe you could do it with the Postman tool to just call the REST APIs associated with these tools. I haven't ever tried to use Postman, but I believe it would work. You first create that device profile. Like I say, that's a bash script. If you're using Portainer, they keep a catalog of these so you can reuse one for multiple devices. You press the button, it brings up an in browser editor. You write your bash script, save it with a name, and then you can reuse it for multiple devices. This is the example of the profile I'm using in my demo with one omission. Just to fit it in this single slide, I cut out, oh, I apologize. It's kind of small print for the in-house audience, but when you download the deck, it is legible. I left out the part of installing Helm, but this is updating Ubuntu, installing a cube cuddle, a Kubernetes distro, instantiating a single node Kubernetes cluster on the node, and then at the final end installing Helm, which I omitted, and a Portainer agent. At this point, we've got the profile ready and we're ready to use it. If you're using Portainer, you would upload your voucher as a customer and then pick a device name and choose the device profile from a dropdown list. Okay, this is the actual device on board. It takes about 25 minutes to bring up unattended a Kubernetes cluster that is in operation. Obviously, I'm speeding this up, but the steps are the Ubuntu upgrade took a long time. I intentionally wanted to challenge this, so I installed a back-level Ubuntu and took it up to the current. That takes many minutes. It's now downloading a cube cuddle and a Kubernetes distro. It's untarring the distro. This distro happens to be Tanzu Community Edition because I work on the engineering team for that, but it could, I think, be any Kubernetes distro appropriate for edge. MicroKates, K3S, MicroShift, I don't know any reason why this would be locked into anything in particular, so long as you could install it from a bash script and maybe do a curl download, you should be good to go. At this point, the CLI for this distro is invoked and a Kubernetes cluster is starting up. I'm going to skip in this recording showing the helm installed and the portainer agent installed, but you get the idea if you were watching that this thing, oh, and by the way, this is headless, but what I did here just to record this is I did not auto invoke this client thing. You should set it up so it triggers on device boot, but I left it manual, so I invoked it in SSH and was able to get this recording, but you don't have to do it manually like that, but that's what this showed. Finally, we've got a Kubernetes cluster stood up. We want to use this remotely without going there, so you need some means of remote managing it. Now, this is the portainer UI that has a nice dashboard for this. One element of the dashboard, you can see the running pods. I can click on a container to get the logs out of it. They also have a button you click to get a kubectl shell, but an alternative would be to install SSH and just log in there and use kubectl remotely. I want to avoid giving commercial promo appearances, so many of you might be aware that there are commercial offerings made to manage and federate Kubernetes clusters at scale, so those are out there, but I'm not calling out any names because it moves into promo territory and it's an open source conference. This got you, you're running Kubernetes cluster, but you probably also need devices connected to this like cameras and other I.O. So at this point, Kate is going to talk about the tech that handles that. Great. So Steve just talked about those servers that are on the edge, but oftentimes around these servers on the edge, there's a bunch of IoT devices, a variety of kind, gathering really important data. So the question becomes, how can we easily deploy workloads to access this data, manage these devices, and in the Kubernetes context, oftentimes these devices are kind of on the smallest end of the user edge, so they're either smart devices like IP cameras or maybe they're constrained devices that have no extra compute to put anything else on them. So the question becomes, if we can't add them to the cluster in a traditional way as a node, how can we easily bring them into the cluster? And a CNCF Sandbox project that I work on called Aukri aims to provide a solution for this. So Aukri stands for a Kubernetes resource interface because it's an interface that abstracts away the details of discovering a variety of IoT devices, whatever protocol they speak, and it brings them into the cluster by representing them as native Kubernetes resources. And then once Aukri's discovered those resources, it can automatically deploy your workloads to use those devices. And it was built in Rust because we really are focused on those small edge scenarios and as an aside, Aukri also means edge in Greek because we're in the Kubernetes world. Looking a little bit at the details of this, so as I mentioned, we discover a device, then we create a Kubernetes resource to represent it. So what that means is in your pod spec, just as you could declaratively request CPU and memory, requiring that workload is only scheduled to a node that has those resources, Aukri extends that to the IoT space. So you can also request an IP camera. So say you have some monitoring application that you're deploying to the edge and it needs a USB thermometer, it needs an IP camera and it needs some robot arm. Well with Aukri, after we've discovered those devices, created those resources, you can add those to your pod spec requesting those along with your compute resources. And then Aukri ensures that scheduled to only nodes that can see those devices, maybe they're only plugged into those, maybe they're only on the local network of those. And as you can see, I've kind of bolded the protocols there for each of those devices. So currently Aukri, we discovered these devices via our discovery handlers. And we currently have support for three different protocols that we can discover devices for. So Udev for devices in the local Linux device file system, Envif for IP cameras and OPC UA for industrial machinery. But our discovery handler interface is just a simple GRPC interface. So you can implement that in any language you're choosing for any protocol you're choosing, maybe even proprietary. So there's some community members working on XeroConf for MD and S-based, co-op for more constrained environments, and we're currently having discussions around Bluetooth and MQTT, which are also very popular on the edge. So once you add a discovery handler to Aukri, you're basically extending Kubernetes to be able to declare the request a new set of IoT devices. And when we talk about workloads that are deployed on the edge to manage devices, they can kind of fall into these two categories. So the first one is device use. So you want to gather data from these devices, maybe do some pre-processing, only send up some final aggregated data to the cloud. Or maybe you want to just quickly pass some information onto a new workload, or maybe you're doing some protocol translation. So you're gathering data from a USB camera and translating that data to MQTT. Another category of ways that you can use these devices on the edge is managing these devices. And oftentimes in your managing devices, it's short-lived single tasks that are performing some action on that device. So for example, maybe you're configuring the frame rate of an IP camera, or maybe you're performing a whole firmware upgrade. And in Kubernetes terms, these kind of fall into the definition of long-running pods or Kubernetes jobs. So with Aukri, we can automatically deploy your workloads in these two forms, the kinds that we have current support for. So digging into that second one, so the management scenarios, I wanted to walk through a scenario of using Aukri to upgrade the firmware of IP cameras. So in this scenario, we have a two-node cluster on the edge. And if we're going to say it's running KBS, like Steve mentioned, it could be any of these areas to upgrade the firmware of these cameras from 1.0 to 2.0. So the first thing you do when using Aukri is you deploy our Helm chart. When you install our Helm chart, you can also install a configuration. And that's Aukri's custom resource where you say what you want to find and what you want to deploy to what you've found. So when you're saying what you want to find, you're specifying what protocol we're using to find that device. So in this scenario, we'll say we're going to use Envif to find these cameras that you want to deploy to your cameras to perform a firmware upgrade. So after you've applied your configuration to the cluster, Aukri's agent is going to look out across the network. I think we're seeing a microphone device onboard. Really cool. Okay. Great. I hope people have been able to hear me, but we'll probably have questions. Great. So you've applied your configuration and now Aukri's agent, which runs on every node in the cluster, sees that. So now it's told, go look for these IP cameras across the network. For every, there's the configuration. For every camera that it discovers, it'll create our second custom resource definition, which is an instance. And that represents the device. It's current usage, some metadata about it. And when I say current usage with Aukri, you can also constrain the amount of workloads that can simultaneously use a device at once. So that's also represented in the instance and maintained there. So once the instances have been created, our controller then sees that, and it will automatically deploy the workload you specified. So now we have our two firmware upgrade jobs that are deployed to the cluster. They're going to reach out to these IP cameras, firmware upgrade endpoint. So on VIF, it's device management endpoint does have a due system firmware upgrade endpoint. So these, in this scenario, those jobs are just going to reach out to that camera and signal that upgrade. However, not every IoT device has its own firmware upgrade endpoint. So maybe in your current device management solution, you have some agents that run on your nodes on the edge. Well, you could just change your workflow to be that your job that we deploy on your behalf reaches out to your local upgrade agent. And then that upgrades the camera. Or maybe you have one upgrade agent across your cluster, or it's on a different server that you want your job to reach out to. Well, then you could change that as well. So the point of these three different scenarios within this scenario is to say that Akri is really a platform to plug in your current solutions, but it makes it more declarative, just like Kubernetes is supposed to be in general. And at the bottom of the slide I've linked to a CNCF webinar we gave a couple months ago that shows a real demo of this, as we didn't have time to do it here, of performing a firmware upgrade on an IP camera with Akri. So to kind of summarize, the goal of Akri is really to be an open source standard way to connect clusters to connect workloads to your IoT devices and manage your IoT devices from Kubernetes clusters. But in order to become that cohesive standard it really needed to be as pluggable and extensible as possible. And I mentioned some of the ways we've done that, so our Discovery Handler interface, just that GRPC interface, you can really bring anything including your own proprietary protocols. You can extend Akri to be kind of a generic service discovery if that's of your wish. And then you really do bring your own workloads, which is that second part of it. And then we're really interested, as I kind of focused on these management scenarios with Akri. So we're really curious, how are you bringing in your own current device registries, your own certificate store solutions for IoT device management? Are there headaches when doing that with Akri and how can we ease those? Because those are some really powerful solutions that we want to make sure we're supported. So that's kind of a call to action to let us know how we can help there. And some resources to learn more. We have our documentation site. We have a demo, so it even helps you set up some fake devices in case you want to try it out. And then we are a CNCF project. We do have monthly community meetings. We have our Slack on the Kubernetes Slack, the Akri channel. And then once again, this is a link to that demo of IP camera upgrades. So kind of stepping back again. So in this whole talk, we've talked about first provisioning servers on the edge, and then also giving these servers access to those IoT devices. And there's a variety of scenarios that fit this. There's a variety of devices that fit this. And so the working group is really trying to build cohesion around the heterogeneity of the edge. The cloud's very homogeneous. The edge, we have so many different devices, so many different scenarios. And one thing that we found in particular over the past few years of the working group is that we extend beyond Kubernetes. So all the technology we've talked about today, Akri, it's not in the Kubernetes source tree. It's an add-on. SDO doesn't need to be provisioning Kubernetes. So there's really edge solutions that go beyond Kubernetes. So our working group is transitioning from Kubernetes to the broader CNCF as a working group. So we're going to fall under the runtime tag. And we're going to redraft our charter as a part of this process. So if this piqued your interest, and you want to be a part of that process, please join our working group meetings. Here are those resources. We have bimonthly meetings, both in Europe time and US time. And we also have all of our meetings previously recorded. If you're kind of curious about what we've been talking about in the past and the Slack's also a great place to kind of plug in as well. And with that being said, this is about us, how you can reach out to us. If you have any questions outside of this Q&A, we're going to actually have about 10 minutes, which is great to see. And with that being said, we're happy to take any questions that you have. And I saw you taking pictures, but the deck is available there. And usually these KubeCon's, the actual video probably gets posted on YouTube in a couple of months would be my guess. So if anybody's got questions, please raise your hand and let's get the question recorded on the mic so that the remote audience hears it. Don't be shy. It has to be the first one asking. How does all collide with the project matter? Because I saw that this edge node somehow collides with the border router concept there. I don't know if those border routers fall behind the edge node that you are designing in this working group or it's really the same concept. I'm not sure I understand the question. Are you talking about the onboard of the device as it first powers up or something related to the arc? It's more about the concept of the item. Let's say the node or the edge node in this working group and the border router concept in the project matter connected home over IP. And they are also working the secure commissioning and onboarding and provisioning of the devices. And I guess that they are also probably basing the decision in the final alliance. I don't know if there is some working together with those kind of people or not? Yeah, I've never noticed anybody working on any border routing group showing up to the meetings of this group. There are people who support scenarios where they are what they call air gapped where they don't connect up to the private public internet on purpose for security or just because it's problematic. Right now that spec I know for the secure device onboard that I talked about kind of was the reference implementation sort of implies that this manufacturer server is out there in public that at a location you can get to. But I'll tell you one way around that is in my demo effectively I made myself the manufacturer. I put up my own manufacturer server and in that case I could have had the whole system for that behind within my organization's confines with no outside access whatsoever. So that would be one thing that I believe would work. But if you go read the spec they initially I think targeted a scenario of the main manufacturers Dell, HP, whoever running these servers there's no reason you can't just get one that thing I used on a demo I just bought on eBay generic Linux box that happened to have a TPM module in it blew it out and essentially remanufactured it by starting from scratch at the BIOS level. The other thing that's in the spec but not working now is they have a plan to implement protocols like co-app for devices that wouldn't even have IP connectivity but I don't believe that that has been implemented right now but it is in the plan and if you were really eager you could probably help out by helping build that out it's been architected to allow it so I'm not sure if they answered your question but it's probably the best I can do can you hand the mic to the person right behind you? Hello, very nice presentation very nice projects I wanted to ask if you have explored the problem of at the end of onboarding providing a cryptographic device identity that you can trust you can use for example to trust that device or to have devices trust each other There's no reason you can't have that be part of the owner onboard so the spec where the device connects to the owner of the server exchanges key value pair so part of that payload could easily be keys, certificates whatever addresses for where you establish secure tunnels so I think that that might be the process that you used to I think at that point I was wondering because that is something I think it's not really defined for example in the FIDA standard when you onboard Yeah, they specifically didn't try to do things like day 2 operations just to keep it simple enough but the contention is they allowed you to install tooling that supports day 2 maintenance operations and there's plenty of people out there in enterprises that already have sort of a legacy system for managing those securely so the thought is that maybe a lot of those users aren't going to switch to something new anyway so that if you boot it up to the point where they can use what they're used to and already have in place that would probably be the most popular solution. Yeah, I was wondering it's a problem that everybody will need to solve at that point so maybe also like trying to bring some sort of standard or tools that maybe you have incorporated the other day at the third manager project they pointed me to the SPIFI project that maybe could be interesting for this. Yeah, the other thing is you know you say oh it's easy with these key value pairs but if you look at a threat model you really probably ought to put those into the TPM module because these edge devices for the common use case are probably not physically secured so you have to worry about people stealing them breaking into them undetected and all these scenarios so you clearly want to have some kind of tamper resistant storage in there to hold keys and things and just the fact that you got these key value pairs didn't get it into the tamper resistant storage but if you're executing a bash script presumably you can come up with a means to put it where it needs to go either you know TPM they have these puff modules that are lower cost or what I mean that's a very rapidly moving area and this vital onboard really didn't constrain you as to what secure store you have it's kind of open-ended but you realistically ought to have one any other questions we've still got time for one or two here let me get you the mic thank you thank you guys great presentation and you know thanks a lot I really enjoyed also the live mic on board during the session that was great as well so yeah I'm coming from the telco space and when we hear about aids immediately we go into our 5G Mac solutions if you know multi-access edge computing and I was wondering if you do things in these areas as well or it's totally unrelated in this group we've had a history of telcos but I have to say that the bulk of the conversations are not telco specific there are other groups out there that pretty much all the audience are telco interested and even within telcos I think you mentioned Mac but software defined radio there's so many niches in that stuff that I think it's even somewhat forked in terms of where those things go the biggest organization I'm aware of was something that started way back in the open stack days and a lot of telcos were using open stack technology rather than kubernetes but it's morphed into covering a bit of kubernetes and I think it has the broadest user base if you wanted to find that out it's not under the CNCF but I think that a lot of that telco conversation they have so many things to worry about that certainly things going on in this group are relevant but it's by no means telco focused this group is pretty broad covering kind of the general things that everybody needs I would say and historically it's been general things everybody needs related to kubernetes but when we move it over to CNCF maybe we'll go a little bit broader than that the kind of the latest cool kid talk that's been leaking into our group is web assembly where there's a lot of talk web assemblies like a cool technology that is kind of something container like like docker but smaller packaging tiger sandbox lower startup latency less variability in the latency and CPU agnostic and that kind of is weak being a member of this group for years it kind of transitions into cool technologies as they're evolving kind of at a general what if level and we move on to other things so it's kind of a general interest thing not specific to one use case like telco any other oh follow up go ahead thank you very much we have time probably for one more and while the mics being passed if you're trying to find a space for that like 5g community it might be worth joining the slack and kind of asking if anyone else is there having the same question because there is a fair user base already in there thanks for the talk it's quite interesting are you doing any collaboration with like on the firmer side on the boot loader side with like ufe or edk2 which is this tiny core thing well this group is not and I've been I really dug through that secure device onboard spec to do this demo and I don't think that they have an interface level that would expect to go they didn't anticipate doing firmware patches like like I mentioned in the talk once you can run a binary if you could get a binary to run at a command line to do a firmware patch I don't see why it wouldn't work although if it triggered an immediate reboot it would probably halt the rest of the script and that would not be desirable in terms of going at the management interfaces that are built in or the boot process no it doesn't it assumes that that thing got booted okay outside the scope of that project do you not say that there's because you have things like you know like Broadcom you know boot rescue thing where if you have a firmware that doesn't flash it comes back up on a rescue image and then you need to go down on the Ethernet level and actually send some magic packets like in the IP camera use case this is important because you're not going to go and get a ladder and go up and get and do some funky stuff with the reset button to flash it where I think as a general principle in all devices whether they're compute or IO devices like cameras the bulletproof ones essentially need to spend the money to make double the fat flash storage there so that you when you're doing an upgrade the original gets untouched maybe with a watchdog timer so that if the new stuff doesn't work and it doesn't come up you can revert to the old one unattended but that isn't something that is really in the scope of our group so far you know we started out being under the Kubernetes project so kind of at a higher level of the stack then firmware I've got nothing against people coming in discussing it but I suspect there's better homes for those talks than our group I don't know what those would be but we've got plenty of people who have that problem but I don't we're not we don't have ownership of any code that solves that problem if you want to come to the group and you're aware of it and want to give a presentation and share it I'd welcome that they're doing something in this direction as well okay thank you very much okay at this point we're past time so we'll break Kate can probably hang around I have another talk I have to go to so I've only got a couple minutes if you want to say hello and set up a meeting later I'm willing to do that but thanks for coming it's been great like I say if you liked it start coming to the working group meetings look for the space because as we move maybe some of our URLs and things for this group are going to change that should happen in the next month or two but once again thank you thanks everyone