 Thank you very much for attending. It's one of the last sessions of the conference. I know it's hard for all of you. My name is Ricardo Noriega. I'm a principal software engineer working in the emerging technologies organization in the office of the CTO in Red Hat. I'm Alex Mevick. I work over at Lockheed Martin. I'm a senior inforops engineer working primarily on edge AI devices. So in this presentation, we are going to talk about, well, how to run efficiently AI models at the edge. Of course, you've seen Kubernetes. This talk is meant for infrastructure people. It's not meant for data scientists. But I think we'll explain what we've been doing. And I think I hope you find it interesting. So to give you a little bit of background, what is edge computing? That word can mean a lot of things for different people. And in our industry, we've been trying to centralize workloads for decades. We have built data centers with thousands of servers. We have built distributed computing platforms that we call the cloud. But however, more and more devices are connected to the internet, smart lights, IP cameras, sensors, et cetera. And they are all generating huge amounts of data that have been transferred to the cloud across the globe. All these industries that you see in the slides, medical, automotive, industrial, defense, for example, FYI, that is one of the Lockheed Martin goodies, they are all generating huge amounts of data. So for us, the definition of edge computing is basically putting computing power closer to where the data is being generated. Especially when we talk about running AI models at the edge, we want to do the inferencing process closer to where the data is generated. Imagine you want to run an anomaly detection algorithm. We cannot afford to send video streams back to the cloud to do that process. We need to put the GPUs closer to that data. So as you can see in the slides, these are the type of devices that we are talking about. These are usually single board computers or systems on chip that have certain limitations or characteristics that need to be taken into account. Limited resources in terms of CPU, memory, storage. They are not extensible as servers. We are used to servers, right? Like they have PCI express slots to plug accelerators, more memory, they are extensible. Usually they don't have out of band management interfaces which might be a problem to access them sometimes. And they are placed in remote locations where network is not present sometimes or is intermittent. So edge computing is not a data center but has similar expectations. We all want ease of management, security, scalability, all these features that we have managed to build in the cloud. And we are at KubeCon or CNCFCon. We tend to focus on the way we package applications and containers, we are all about containers and how to run these applications. However, we believe that the operating system is a critical piece to be able to run a network of edge computing devices in a scalable way. One of these features, so we are calling, we are trying to have what we call an edge optimized operating system. Repeatability is one of these main features that we wanna have. So we need to go to an image based operating system approach more than a package based operating system approach. That means that a user should be able to create their own customized operating system image and you know, plug it into the device or send it to the manufacturer, the hardware manufacturer. So whenever you deploy thousands of devices in the field, you have the exact same configuration or the exact same image that you have built locally. On top of that, we have the onboarding process. So once you deploy the devices in the field, we need a way to onboarding in our device management systems in a secure way. There is this project called FDO, FIDOT device onboard that using keepers and ownership vouchers allows you to claim those devices and register those into your own systems. One other key feature of an edge optimized operating system is how to do updates and rollbacks in an efficient way. So going to this image based approach, an update would be just to create a new image that is exposed somewhere. So devices will recognize that there is an available update and they will download only the data that is necessary so we don't download the full image back again. And of course, most of these devices are deployed in the field and probably they don't have a way to SSH into the device and so the health of the fleet getting reports on the health of the fleet is very important and what happens if a device gets bricked, the configuration of the update is not right, we need a way to do automatic rollbacks. So we always get a device that is fully operational. So we minus these capabilities by using a technology called OS3 or in this case, RPM OS3. So OS3 is a technology that is shaped in certain flavors of Red Hat Enterprise Linux and provides some sort of version control for the operating system. Imagine like a git but for a file system. So most of the file system is immutable, you cannot change it unless you do an update as I mentioned before. So you see the file system read only and if you wanna update or apply a CDE or any other configuration, you build this new image update of the OS. After reboot, so the good thing about this is that you have, you download the update of the new image and you stage those changes and then in the next reboot, you will go to the new state fully transactionally. So these are like two atomic states that you have in your device and you go all or nothing. It's or you are in state A or B, you know? This makes updates and management of devices very easy because you can track changes along the way. We also provide a system, this service called Green Boot. Basically runs a series of health checks and you can even add your own for the operating system or for the applications that are running on top. And if any of the health checks fail, there will be a counter and a number of attempts. And if Green Boot sets that the health of the device is failed or failure, it will automatically do a rollback. So it will point to the previous working version that we have states in RPMOS tree and it will force that reboot. So the next time it boots operational again. It's also worth mentioning that this kind of image-based approach is very good for CI and for testing because you can create in your infrastructure the image that you're gonna test on or deploy on those devices. And you can test it and you can be sure that the image that you have tested in CI will be the same as deployed in those devices. Finally, so we have talked about how the operating system is a crucial part of managing and network of edge computing devices. But we wanna run things on top. We want to run our workloads, especially in this talk, AI and machine learning models. So to run these workloads, we are used to the cloud and we want to use the same kind of tooling that we used in the cloud. And for that, we needed to provide a lightweight Kubernetes and this is where MicroShift was born. I'm very happy I was part of the team that created MicroShift since its inception. And I don't know, but if you know that last week MicroShift 4.14 has become GA. So it's a big milestone for the team. And so the benefits that we get having a lightweight Kubernetes distribution basically is standardizing on Kubernetes for the life cycle of our software of our applications. We get the benefits of the orchestration that Kubernetes provides, consistency across footprints. One of the mantras that I love is like, I can develop in the cloud and deploy anywhere. So this is a very cool feature. And you can use off the shelf AI frameworks in this case. I don't know, Q-Serve, Q-Flog, K-Serve, and so on. So this is like basically shows a little bit the difference between OpenShift and MicroShift in terms of architecture. For those who don't know, OpenShift is Red Hat's Kubernetes distribution. And it's like a vertical integrated platform. So OpenShift is responsible for managing the infrastructure that is below, the operating system, all the components and versions from the Kubernetes cluster itself and the applications that are running on top. However, for MicroShift we have taken a different approach. We are using the capabilities of its optimized operating system, as I mentioned before. And MicroShift is just an application sitting on top. So it's basically a runtime. It's not this vertically integrated platform. And Alex will talk about the sexy stuff now. There you go. All right, so Ricky talked a little bit about running MicroShift on top of Red Hat and exactly what that looks like. From there, Lockheed Martin, we're trying to figure out how to run AI workloads on these edge devices. And there's some improvements we're gonna need from the standard AI models that you're running in the serving architectures you're using on X8664 systems when we're making this transition over towards these low power arm devices. So in particular, we're looking to lower resource utilization. We don't have an A100 with all of those Gibi bytes of RAM. We don't have the speed that you're looking for in those devices. And some of the dependencies at times are a little bit difficult to manage. And so what we're using right now today and kind of demonstrating is memory caching, using TensorRT and MLCLLM. And that's gonna mostly be how we're improving, running the AI on the edge. From there, we obviously have to figure out how to host that AI and talk between things. Again, we're looking for speed. We wanna have a variety of models running at one time. We are trying to manage the dependencies for these models because they don't all use the same thing. And we're trying to lower our overall resource footprint. And so to do that, we're using this micro shift Kubernetes offering at the edge. By using microservices and containers, we're able to manage our services independently and we're able to restart things as they're needed per service basis. So if one model needs to be spun down to maintain a certain power profile, and we're gonna pass that data off between the two of them, we can spin one down and spin one up kind of as needed with the benefits of these Kubernetes containers. And for that, we're using GRPC with protobuf, micro shift and a little Flask server. And then from there, the two main demo models that we're looking at today are gonna be YOLOv8 and Vicunia, the seven billion parameter model. YOLOv8 is a tracking and classification model. We're using it kind of to determine what's in the room. And we're passing that information over to our large language model, Vicunia, that's been optimized with MLCLM. Sure. And just to backtrack a little bit and show off the graphs, I guess, to explain what we're doing with those. So with Penser RT, we're getting around a 95% accuracy with YOLOv8 and around a three-time speed improvement. It does lower the memory footprint as well, which is obviously important with these edge devices that can be incredibly limited. And then with the LLM, we're seeing a 44-time speed improvement on pre-fill, around four times on decode, less than 10% utilization on the RAM. So in this case, when we were working on these tests, we weren't even able to use the Transformers version of Vicunia on the JSON-NX and by changing to using this MLCLM quantized version, we're actually able to run it on a significantly more limited hardware. And from there, we're probably gonna show the demo. Yep. So we changed. So what we're looking at here is a video stream of everybody kind of off of this cheap webcam running off of our little jets on the NX. You can see on the bottom left-hand side, hopefully you can read that. We have our different pods running our services here. So we've got a web server that's hosting this sort of simple flash gap on top. And then we've got a Vicunia server running the LLM in the back, and we have a Yolo server running our Yolo V8 inferencing. On the bottom right, we have JTOP running, which shows the GPU memory utilization per process. So you can see we're kind of sitting in around that two to three gig mark. And so I can actually ask the model here. So this is a quick chat to Vicunia, asking it how many people are here. I guess we could give it a shot. It's a little confused. I guess as another example, I could ask it how many chairs are here. They might get that a little bit better. Too many boxes, right? Yeah, it might be that there's too many boxes. Oh, it thinks there's a motorcycle. Oh, down over here it sees a motorcycle. But it's getting that information from the Yolo V8 engine through those GRPC connections as a microservice between the two models. And then I can ask it, I guess really quickly, something else simple. Can you give me three example points this presentation might cover? It's kind of got something. You can see the GPU usage in the bottom right part of it. Yeah, and so the GPU spikes up a bit, but with that MLCLM quantization, we're sitting at that five to six gigabyte mark on our RAM, which is really helpful on these edge devices. And then we can do that just to show that it's actually. This is the funny part of the presentation. We can just ask. So you are Boromir from Lord of the Rings. You will answer questions in the character of Boromir in line with the movies. And we'll just ask Boromir where he's from and see what he has to say about it. He's from the kingdom of Gondor in the land of Mordor. Interesting, but it's not like, we aren't doing any work here to cut down the LLM or to cut down the vision. As you can see, it's finding motorcycles and other silly stuff in here that clearly aren't here. Most of what we've been doing is quantizing these and cutting them down so they work on the edge through our hosting methods. So TensorRT and MLCLM. It's all running in the NX, in this device. Actually, let me point the camera to the device to see what we are using. Let me see. Can you see? So there's like a battery pack and then the actual compute module is just this little thing with a file on top. And we are not doing any API call to externally or anything. Maybe the frequency? Sure. Also to show really quickly, I guess a feature of the Jetson we're running in the max frequency mode right now. And so if we really want to, I can jump over in my little JTOP application here. Let me see if I can get to JTOP. And we can turn off the constant boost clock. And so we can run even more efficient. And I could just ask, where are you from again to Boromir? So now instead of running at the maximum possible clock at all times, it's running a little bit lower than that. And that makes it a little bit more power efficient. I'd say that overall in this NX, we're probably running at 30 to 35 watts total to do the full LLM inferencing that you'll OV8 all at once at this speed. I guess that's probably, if you pass it up to questions at that point. Yeah. If you have any questions. Anyone has questions?