 Hi everyone, thanks for joining us at Kubernetes on Edge Day. Welcome to our session on bridging IoT leave devices to your Edge clusters with Aukri and dynamic resource allocation. My name is Eujin and I'm a product manager at Microsoft under Azure Edge. And I'm Nikola and I work at SUSE in the Edge BU and I'm also maintainer of Aukri. Alright, so today we'll go over what is Aukri, what is dynamic resource allocation, and how these two can work together to optimize, you know, connecting to IoT leave devices at the Edge. We'll also do a little demo and talk about our roadmap to going to an Aukri version 1.0. So I think Eric actually did a really great job summarizing kind of the challenges at the Edge and Kubernetes today. But I'm just going to summarize specifically IoT leave devices which includes actuators, sensors, MCU class devices. And so these usually speak different protocols and they all have different apologies. They also have heterogeneous requirements for authentication and storing secrets. They may have intermittent availability and downtime. And these are constantly scaling up and down. And most importantly, they're usually too small, too old, or too locked down to run Kubernetes on themselves today. So how can we dynamically make these IoT leave devices available to Kubernetes workloads running in Edge clusters? So in Kubernetes, we actually have the device plugin system. And if you're not aware, the device plugin system actually facilitates the discovery, advertisement and allocation of specialized hardware and external devices. This might include GPU, FBGA, high-performance NICS, et cetera. But there are a few limitations with the device plugin system today. For example, there is no real resource sharing, so devices can't be shared among containers. It also relies on a very specific, the Device Manager API. So you have to constantly track the versioning and compatibility of these. And it's also limited to hardware that are on node, so it doesn't really support network-based devices. So I want to talk to you about AUGRI, which is our CNCF Sandbox project. And AUGRI actually stands for a Kubernetes resource interface. It also means Edge in Greek. And this actually extends the device plugin framework to make connection to these IoT leaf devices via the protocols. For example, we have OPCUA on the Udev, et cetera. And it actually enables resource sharing by registering these as Kubernetes resources using custom resource definitions. So that means workloads can be assigned to specific devices or group of devices, even if they're attached to other nodes. And if a node goes down, these configurations and properties will remain on the cluster so that other nodes can pick up any of your lost workloads. And AUGRI is architected with extensibility in mind so that developers can easily add new discovery handlers and brokers. And it's also built with Rust to optimize for the Edge and low-constrained clusters and devices. So I'll quickly go over the architecture of AUGRI. So first, you have the AUGRI configuration, as you can see on the right side. And this is a CRD. And this is where the user would tell AUGRI what kind of device to look for. So you might say, hey, AUGRI, use OPCUA protocol. And then you will also specify what kind of workloads you want to put to all the devices that are found. Then you have the discovery handler. So AUGRI will take in the configuration and deploy the discovery handler that you've specified. And this goes and looks for these devices using the protocols. Then it'll inform the agent of these discovered devices. And the AUGRI agent is the one that handles the resource availability changes and enables the sharing of these devices. And so the agent will create an AUGRI instance, which is another custom resource definition. And this is made to track the availability and usage of these devices. And finally, you have the AUGRI controller. And the controller will see each AUGRI instance and deploy a broker pod that will help you utilize those devices. So now I'll get over what is dynamic resource allocation. So dynamic resource allocation is kind of a new thing in Kubernetes that aims to improve what device plugin does. So basically, it has way better support for network-based devices. It has, well, way less, I'd say, of the shortcomings of device plugins. Because here you have the driver with its own control plane that is able to intervene directly in the scheduling process of the pods. So basically, you can have more control with your driver on the pods that are scheduled, where they are scheduled. And you have the ability to have devices that are shared between containers that are shared between pods. You can, well, you basically have way more control than what you have with device plugins. So basically, as it doesn't work the same, it means, well, first, how this works, it's very similar to how container storage worked currently in Kubernetes. So basically, you define a resource class with optional class parameters that are up to the driver. And then you can use resource claim or resource claim templates to use that, well, resource or device within pods. So you can reference them from one pod or multiple pods. And they will all use, if you use the template, the template will get, well, implemented for every time you reference it. If you reference a claim, it will be the same. So how it goes when you apply this for Accra. So it means we have to make small changes. So currently, we have a single configuration object that does everything from, well, getting the discovery information, like what protocol you want to talk, the details about this. And it also do things like workload deployment things like for every device, I want to deploy that workload. That is something that Accra does. And currently, it's a single object for this. So first, we have to split that object in two so we can use the discovery part as a parameter for the resource class. So here, we see in the resource class, I use a parameter that is the discovery configuration. And that's how the Accra driver works on this. And then when you do your claim for using the device that is discovered, you basically refer to the resource class. And you can add another parameter here. That is an instance filter that is a new resource in Accra that basically allure you to further filter what devices that were discovered are going to be used for the claim. So you can, well, it's pretty standard used there. Another change that is needed is in the architecture of Accra. Currently, we have the controller that does the scheduling the workload and monitoring the nodes health so that it can prune the nodes devices. And we have the agent that does everything else. The agent is deployed on every node. So every node basically query the discovery handlers, manage the instances, talk with the Kerblitz and manage all the device plugins because there are many because we to make Accra work like it we wanted to work. We have to create one device plugin per device that is discovered. So it makes a lot of device plugin to manage. And it also has to guess when a slot gets freed because with device plugin you don't get any notification when a pod stops using the device. And so we switch from that to splitting the controller into with the workload controller that still schedule the workload. The driver controller that is basically doing all the dynamic resource allocation work that will allocate and they allocate resource claims that will watch for node health so it can prune devices that are no longer available or nodes that are no longer available from the list of schedulable nodes there. And it keeps track of device usage. So to ensure that you don't overuse your devices because you specify a capacity in the discovery the discovery configuration that is the maximum number of users of a device. And then the agent does way less. So it's way simpler as well. So it just queries the discovery and manages the instances that are the discovered devices. It talks with the Koblet to do the DRA well plug-in thing and it managed what is called CDI entries. CDI is for container device interface I think. I never remember the I and it's basically the description of how a device is formed and that is passed to the container well the CRI so that it can get passed correctly to the underlying container. So basically it means that when you want to use a device you have your pod, you add your resource claim there. You add it in your container to say this container want to use that claim. And then it's a resource claim template. You have your template that references your resource class and then your resource class. You can also have here an instance filter that further narrow the discovered devices. Here the example is about for example a robot where you have a property that says this one is more precise and then you can have your container that say okay I want to use any robot that is available or I want to use only the ones that are more precise. So this way you can filter. And these properties are exposed by the discovery handlers. So basically if you need to extend that I have more properties exposed than you can write your own discovery handler quite easily. Now we're going to try to do a live demo. So can I let you present the architecture? Yes, so we have an edge cluster with three nodes and we can see that we have AUGRY deployed on there with the workload controller, driver controller on one of the nodes as well as a web service for taking in the printer job requests. And we're using the MDNS discovery handler to connect to our Raspberry Pi which is connected to our printer and also a UDub discovery handler which is connected to our display. So basically what we're going to do is if it works the Wi-Fi has been a bit flaky today but we're going to have a QR code and you can basically scan the QR code to put in a printer job request and hopefully that will then print something. I will try to see if it works. So I'll go here and refresh it. Okay, no, we are out of range for the Wi-Fi. So it's quite flaky when I set up here. So maybe as it's flaky sometimes it gets back. Yeah, okay. No luck. Sorry for... So basically this demo will also be available on... will be put on work on different places. Like I can go on the slide that precise this. Okay, go back in. Slide show this one. So yeah, we have the SUSE booth outside where we are going to set up the demo after the talk as well so you can see it. So basically we have this printer that is going to print things when you enter the data you want to print on the form you glimpsed at before. So then we also have this set up at the Acriciosk we have at the project booth during the main event and it will also be still at the SUSE booth and we will have a presentation on Friday on the Microsoft booth that will also feature the demo. So I'm going to get back to the wrong map and give it back to you. All right, so before we finish up I want to quickly talk about our roadmap to getting to an Acric V1.0. We're hoping to get to this very soon. So these are the three kind of sections that we're working on. So first we want to add features on top of the DRA work that we're doing. We also want to enable arbitrary workload scheduling. So right now you can schedule broker workload pods to any device that is found and we actually want to enable being able to schedule other types of resources as well. Maybe you want to schedule or create a custom resource for any device that is found. We also want to implement a status field to get more information about the ACRI configuration and instance resources. For example, is the broker scheduled? Are they ready, et cetera. We also want to add an external query service which would enable device skating and requesting additional information such as metadata of the devices or the credentials. And then we want to do other production readiness things such as a security assessment. Currently on the GitHub you can find some of our threat modeling that we did. We also want to do some performance benchmarking and stress testing just to know what the limits of ACRI deployments are. And we also want to improve our end-to-end test use cases of our releases. So currently we do have a few like testing OPCUA and ONVIF but we really want to flesh all these out and make sure we get all the edge cases. And finally we also want to make it really easy for everyone to contribute. So first we want to define a clearer version of the maintainer policy. So what does it mean to be a maintainer being a part of our issue triage rotation and hoping to solve three bugs every quarter or something like that? We also want to split up the repositories. So right now the agent and discovery handler samples they're all in one big repo but we're going to try and split up the repositories so that we move out some of the samples and discovery handler and broker so that if you want to just contribute to the core component you can do that or if you want to add a discovery handler for another protocol you can do that as well. And obviously we want to improve the documentation not only for using it but also for contributing and testing. So we have a new release of version 0.12.20 so check our release notes and you can also read our docs and we also have a bunch of community proposals on some of the work that we mentioned adding for v1.0 and you can even add one yourself. And we hope to see you on our Slack channel at hashtag Aukri on Kubernetes Workspace and join our community. We have community meetings every first Tuesday of the month at 5pm CEST or 8am PST. So yeah, again I think Nicola won't over some of these but we'll have the Aukri demo running in a bunch of these places so we really hope you can check it out and if you have any questions feel free to swing by at any of these places. And we hope you leave some feedback on our session. Thank you so much for joining us today and if you have any questions let us know. So the demo seems to be working. It might work. So again, I'll try again and see, yeah. So here I have this form and you see here if I go back. So if you scan this QR code you get to the form that will allow you to basically here we see we have a few pods that are scheduled. Some are and the ones that are pending are requests that have been sent to the printer and here you see that it prints it basically prints the, well, what you prompted it in the form. So here we see the different pods that are getting scheduled and we have here the resource claims that are getting assigned for every pod that is created and these resource claims here are pending except one here that is allocated that is the one that is currently printing and then we'll see the different, well, jobs that are here on the pods specific and that gets, well, scheduled and trigger a print there. So yeah, a bit quicker than expected on this, but yeah. Yeah, so if you put in a request please come up and grab your sticker after. All right, any questions? Hi, yeah. So if we have multiple printers with different, I'm just making a metaphor. If we have multiple GPUs with different architectures is it still possible? Yes, well, as long as you can have a discovery handler that, well, a configuration can only be linked to a single discovery handler. So if you need multiple discovery handlers then they will be seen as different classes. But if you are just, well, if you have a single discovery handler that can handle all your devices then it can be seen as the same class of device. So they can all be, well, picked up and scheduled as this is done there. So as long as we do the configuration right we can choose which actual device we can schedule a job to, right? Yes. Okay, thank you very much. Again, thank you for your presentation. Based on the slide 13, if I'm not mistaken, you said that you divide the controller to the driver controller and also the worker controller, isn't it? Yeah. And you repeat some part of the task of the controller about the health check and also the trackup device using the devices. Is there a change? Any applied to this or you just said, okay, we need to separate this and the modified part is workload controller. Well, we decided to split it to have more like ownership on this. So basically everything that is related to schedule the workload is in the workload controller and everything that is linked to scheduling the devices is in the driver controller. So this way, from a permission point of view, the workload controller is the only one that has the right to create the resources and that is the only one that will have access to the workload configuration while the driver controller doesn't need access to this. So from a security standpoint, it's better to split them that way. So no need to change any part of the task that we have before for the North Health or a crap of the device usage in your use case. They are the same scenario, yes? Yes, they are the same scenario. Just we can have them in with more simple controllers and agents and we can better track the usage. So we don't... Well, basically with the current architecture of ACRI there is a rare case where you can basically overuse a device because the scheduler and the ACRI controller enter in the race. So this new architecture with the array basically remove that possible race. Thank you. Hi, thanks for the presentation. If you could maybe go to the PLC robot example, the slide with the OPC-OR server, I think. Yeah, in that example, for instance, how would, say in OPC-OR server has an IP address, how would that get exposed to the workload, to the port that's scheduled? All the properties that are discovered by the discovery handler then will be shown as environment variables to the containers. Oh, very nice. Thanks.