 So, hello, hello everyone. My name is Zonko Kaiser. I'm from NVIDIA. I'm with the cloud native team. My current main responsibilities are working on Cata and Confidential Containers, and in this talk, I just want to give you also some history about sandbox environments, what we've done in the Cata space to enable our use cases and how we can apply all those features that we edit to REC LLMs, but also to any AI ML pipeline. Agenda is talking about how we came to confidential computing, why we have chosen Cata as the main driver for enabling sandbox environments with the GPU. A little bit explanation about our GPU enablement stack because it's important for our lift and shift strategy where we said we don't want to have any code modifications to run our GPU workloads on Cata or on Confidential Container. We added also a virtualization reference architecture to support advanced use cases like GPU Direct GDS or RDMA in virtualized sandbox environments. A small stop on Confidential Containers, and then I'm going to talk a little bit about Confidential REC LLMs. But let me first set a stage by why are we even doing this, right? If you're looking at a container, it's just a process with namespace separation and C-group resource management. Containers share the host kernel, meaning if I have a container breakout, it can take over the complete node, it can take over the complete cluster. It's just a modern way of packaging and sharing and deploying applications. It's a user space abstraction only and the red part shows what we are most concerned about. It's the weak security boundary between a container runtime and the host operating system. We have things like SLNX and AppArmor to fortify it, but we don't have any heart isolation and we are highly dependent on the kernel and what we are mostly worried about is that host changes may break our container stack because we are deploying drivers and other components that we need to enable the GPU. As I said before, container scapes take over the whole node and we need to trust the container images and the manifests that we are pulling into our cluster. In the past, there were some techniques that people are looking into to create sandbox environments. Way back there were unicunnels. Unicunnels is a way to package parts of the kernel and parts of your user space into one binary without a memory protection unit. So you have fast boot times, small attack surfaces and a really nice latency, but nobody wants to recompile their applications and package it and link various libraries to it. It just didn't, was easy enough to create unicunnels from existing applications. IBM NABLA took this idea of unicunnels and run it as a process. Essentially what they are doing is reducing the amount of syscalls that the container runtime is issuing to the host operating system. They added hypercalls for privileged operations and they even added a complete OCI-compatible runtime, so you could run containers with it and Kubernetes. Another sandbox environment which people are aware of is GVisor. It's essentially a user space kernel implementing all syscalls in user space and they, as far as I know, they re-implemented like 70% of syscalls, but GVisor has no device model, so you cannot run device drivers. You cannot run easily a GPU instead of GVisor. There's also Kubert. Kubert is a VM in a pod and the pod is mainly used as the deployment vehicle, mostly used for legacy VM applications, where you are still interacting with the VM. AWS took a step and created Firecracker, a micro virtual machine, to run sandbox in a sandbox environment. It's a minimal operating system, but they don't have any emulation for devices, so GPU are currently not working with Firecracker. And lastly, Cata containers. This is where we invested a lot because I will go on the reasons why and how and what you're doing in the Cata space. Essentially, also a micro and light virtual machines that you're leveraging to run sandbox environments, especially GPU workloads. There's a nice talk on Friday by our friends in the community about running unique kernels in Kubernetes, so check it out, especially in the serverless computing in the K-native space. Okay, what's Cata? So Cata is essentially a container in a VM. Cata supports a broad spectrum of hypervisors, be it QEMU, ACORN, Cloud Hypervisor, and currently we are working on with the upstream community on adding a rust micro-VMM, which is called Dragon Ball. It seamlessly plugs into any orchestration platforms like Kubernetes and container runtimes. Container and workloads are now kernel and user space independent, meaning host kernel updates cannot easily break the stack. We can run untrusted code in a container, virtualization is a second line of defense. The outer runtime is mainly responsible for life-cycling of the VM, so we have an inner runtime, which is an OCI compliant runtime. It adheres to CNIC, CSI, CRI, so it's completely transparent running a container or a Cata container in Kubernetes. All the functionality that Kubernetes is providing you, Cata will just pick it up, be it storage, be it networking, or what the CRI is going to do you. Not only we are interested in fortifying the isolation of the container runtime to the host operating system, but one other thing that we are really interested is also to fortify the isolation between applications that are running on the cluster. There are several features or projects that are trying to enable or fortify this isolation. One is homomorphic encryption, which essentially enables computation on encrypted data without decryption. We have secure multi-party computation, aka federal learning, meaning you can allow parties to jointly compute a function over the inputs without revealing their inputs to other parties. And the third big function that was or features that were added to hardware vendors are trusted execution environments. In the past years, we have very good solutions for protecting data at rest, meaning we have encrypted databases. We have good solutions of data and transit, meaning encryption on the network, be it IPsec or TLS. But as soon as you decrypt your database and running on a host, it's completely vulnerable because you don't have any encryption on the note. And this is where confidential computing comes into place, where those trusted execution environments are providing an environment to run your workload in a VM which is completely encrypted. Not only the memory is encrypted, the hypervisor has no access to your register state because the register state would expose the frame pointer and stack pointer. So the hypervisor could deduce what you're doing in a VM, also interrupts or obfuscates it so that essentially the hypervisor does not have access to any parts of the VM. So if we are breaking out of a container, we still have like VM as a second line of defense. But if a attacker also is able to escape out of the VM, he has no access to the other VMs. He can still do dynamic source attacks on the VM, shutting down. But the confidential data inside of the VMs is still protected. Just a small history. Trust execution environments is nothing completely new. Already in 2004, we had first trust execution environments. All the major CPU vendors have trust execution environments right now. If you have a mobile phone, you're running a trust execution environments for VPay, Apple Pay, Google Pay. They're all running in some security enclaves. And all major architectures are providing trust execution environments. So let me just talk a little bit why and how and what we're doing with Cata. As I mentioned before, the why, why we chosen it. Containers are now kernel and user space independent. Host changes do not affect us very much. Container breakouts cannot compromise the whole note or the complete cluster because we are still running in a VM. We can seamlessly plug in into existing orchestration platforms like Kubernetes and other container runtimes. We have full OCI runtime image support and we can run containers without modification. The other point is we can run untrusted code in a container because we have virtualization as a second line of defense. And Cata supports a wide range of trust execution environments like Intel TDX, AMD S&P, ARM CCA, and S390s secure enclaves. We are very active in the upstream community. We are working with many companies attending architecture committee meetings of Cata and the confidential containers. We are providing the reference architecture for virtualized environments using accelerators. We are trying to reuse all the parts of the cloud native stack that we have. You may have heard about the talks about CDI, DRA, NFD, the GPU operator. All those parts we are using in Cata and trying to integrate as much as possible because eat your own dog food, right? We already have established a good enablement picture with the GPU on a bare metal and we're just following this path to enable it in Cata as well. So we are extending the cloud native stack for a new sandbox environment, be it a QBird, Cata, a Firecracker, or any other sandbox environment. So what we have done in Cata, we enabled GPU nick pass through or in general VFIO pass through and also VFs. We are extending Cata's PCI implementation to support host typology replication or site channels to provide meta information. I will talk about this a little later on the virtualization reference architecture. The use case is really GPU direct RDMA and GDS in virtualized environments. Enhance NFD which is no feature discovery to expose features to the cluster so that you can schedule those confidential or Cata containers on the right note. We are also creating various runtimes to support all of these use cases. The nice thing is for each pod you can define how the PCI topology is going to look like. So you can run one pod with GPU direct RDMA, you can run the other pod with GPU direct GDS, or you have virtual GPUs, or you have a complete GPU pass through. So you can, by setting configurations or runtime classes on your pod YAML spec, you can decide what PCI implementation or PCI express topology you want to run in your pod. So we are also adding some new features like inter-VM communication, VTPMs, and as I said the end goal is really to run GPU direct RDMA GDS in Cata containers. So I just want to show you a brief overview of the GPU container in the environment stack on bare metal and what we've done and how we integrated into Cata because this is important for lift and shift characteristics that we, when we said our premises, we don't want to do any code modification, we just want the container running inside of Cata the same as we are running on bare metal. Stack is pretty easy, so there are features like CDI that we are using to modify the GPU container to bind the needed files into the container because we need to make sure that user space and kernel space are in syncs. This enablement stack works with all major runtimes, be it Docker, container, D, cryo. There are other features like C Group V2 that we need to add a BPF program to enable devices. So all those dirty details should be really hidden from the user and it should be a seamless integration. There was just a talk about how to manage device drivers by one of my colleagues in the cloud native team and the other talk is tomorrow about how the GPU operator works and how to life cycle GPUs in a cluster. So we have the bare metal enablement. As I said before, GPU operator is for the Kubernetes enablement. There is also on Friday another talk about how you can use operator patterns to manage hardware life cycles in a cluster. So I'm not going too much into detail, but the point is we are using all of our proven and working stack that's running for production in many years and we didn't want to reinvent the wheel and we want to reuse what we have and the goal is really to run GPU containers unmodified. Users should have the very same experience no matter what the underlying enablement mechanism is. So we are using this bare metal enablement stack and just putting it in the VM. Since the Cata agent running inside is an OCI compliant, it will support all the things that you are used to on a bare metal enablement, meaning you take your CUDA container, run it on bare metal, you can run it one-to-one on your Cata container or in your confidential container. One thing we need to do is, of course, we need to provide all the guest operating artifacts like the kernel firmware and guest FS images and configurations, but the main point is really no code modification, just run it as you would on your bare metal system and for the Cata use case, we enabled GPU, PF pass-through, we have pass-through meaning all the virtualized stuff like VGPU, time slice, VGPU, mick-backed are working and our current use case to enable is GPU direct RDMA inside of a virtualized environment. How do you choose between those configurations? It's easy in setting a runtime class on your POD YAML. You can set your runtime class VGPU or runtime class GPU Cata. It's just a matter of changing the runtime class to enable any of those use cases. Let me just go a little bit about the virtualization reference architecture, what I said before that we have a PCI topology per pod. This is a brief overview of the use cases that we want to enable. There are a lot of combinations of PFs, VFs, mix slices. We are adding the nick into the mix and the main point is the driver stack will disable peer-to-peer communication if the PCI express topology is not suitable. There are various factors like IOMMU, ACS, ATS, PCI root pods, switch pods, all the factors that can influence peer-to-peer capability. We have hardware constraints. We have NUMAR, VM CPU sockets. Essentially, two motors apparently that can be reused for any virtualized environment. Based on this host PCI topology, we are not... Those are running on the same NUMAR, so we're excluding NUMAR for here now. It's on one NUMAR node. We have a PCI switch with a NUMAR NOX and a GPU. Most of the CSPs that are providing a VM, you will get a flat hierarchy so you don't know which GPU can talk to which NUMAR NOX. You lost your PCI topology information. Usually, the VM will get a side channel such as a file like topologies.xml to set, for example, nickel peer-to-peer levels if you're running on InfiniBand or higher level libraries like UCX can read this. But there's a problem on lower level libraries like on the InfiniBand RDMA libraries, which can be verbs or GDS. They don't know nothing about that. Yeah, and recent will be also added codeplug support into Cata. Usually, all this stuff is hotplugged. But how do we... Yeah, how do we provide additional information transparently in a cloud native way which is not tied to the pod but rather to the hardware used? The GPU driver stack can read a specific PCI express virtual peer-to-peer approval capability which needs to be set by the user to tell which MelnoxNIC and which NVIDIA GPU are creating a group that are capable of doing peer-to-peer or which two GPUs can do peer-to-peer based on my host apology. And you may heard CDI, which is the container device interface that we are using is a DSL to provide additional meta information to any container runtime. And that's what we leveraged here. So we can say, okay, this PCI express device belongs to ClickID0 and this NIC with this PCI address belongs also to ClickID0. And this is going to be picked up by the Cata runtime and built properly, configured a QMU or any other hypervisor lying around which has a PCI express topology implementation to enable peer-to-peer between two devices that are capable of. The other mode that we can use is host apology replication. So we are not replicating the complete host. We are only using the main parts that we need, like the two PCI switches where we know, okay, there's a MeloNIC and NVIDIA GPU. We can easily replicate those and create in the VM the very same architecture where it's easier for the driver stack to deduce which devices can be used in the future. So this is on the host and this is stand in the virtual environment. The GPU driver and the NIC drivers can easily deduce the topology and enable peer-to-peer far more easily then having this side channel to add more meta information as explained earlier. So of course we have also some hypervisor limitations. I am not going too much of the detail but this is mainly based on QEMU. You cannot attach an indefinite number of GPUs and NICs to the to the VM. You need to make sure that you are attaching only what you need. Another feature that we added if you do not care to which PCI if it's if you don't care where your device is going to be attached like does it need to be a PCI root pod? You can attach it to the PCI PCI bridge meaning you can say okay my GPUs are important I need to have them on a high speed PCI express link but for the melanox you can just attach them to the PCI PCI bridge if it's needed and the constraints on the host are telling you so. Again with CDI with CDI we can tell where to attach those devices so that we can enable easily P2P or GPU Direct RDMA or GDS. So we have the confidential GPUs we have the runtime we have the virtualization one piece missing is the confidential GPU the confidential GPU didn't happen from one day to the other it's a longer stir where it's getting more and more features to the GPU started with firmware modification encrypted firmware measured boot, secure boot on the root of trust and a couple of days ago we also announced the Blackwell architecture which is the first accelerator which supports TdispID TdispID is a new standard in the PCI Gen6 standard all is done on the PCI Express Bus the other station is done on the PCI Express Bus so you get full performance with the Blackwell architecture on any workloads the H100 was using bounce buffers to exchange information between the CPU and the GPU so if your workload has a lot of CPU to GPU communication you may get some performance degradation because we are limited by the capability of the CPU to encrypt data to the GPU the GPU can encrypt at full line rate so it's only the CPU to GPU communication but with Blackwell this is all gone all those features now we have the PCI topology in the VM we have the confidential GPU we have Cata at the runtime which all leads us now to confidential containers and again the premise that we also made for confidential containers we don't want to do any code modification here as well and we are using the very same data here as well nothing changes we are still reusing our container enablements stack inside of it again it's hypervisor independent and we are supplying also the confidential parts for the GPU artifacts here as well one important part of if you're running a confidential environment is that you want to make sure that your components are trustworthy meaning you are expecting that your kernel is running that your components the firmware the guest image the memory are all in a specific state that you're expecting so it's essentially what you're doing during the station you're measuring your artifacts and comparing to some reference value that you're expecting and you as a workload owner can then decide what you're going to do with the station report meaning your components you may want to release secrets into your VM and you only want to do that if your attestation succeeds NVIDIA and all the major provider of confidential all following the RATS architecture so it's an IETF standard there are more things like how do you provide reference values and other standards that can be read by the RATS working group but essentially the workload is very same you set up a confidential environment you get measurements of all your components you send it out to some remote entity which compares the measured value against the reference value and you as a workload owner can then decide I want to release my secrets into the VM and with the secret in the VM you can for example just decrypt your encrypted container or any other things that you you may want to do in the VM this is just an overview of how secure release is working in confidential containers but I just explained the simple version of it do the attestation if at all it's okay release your keys into the confidential VM and then decrypt whatever you need deployment of confidential containers again with the GPOperator on the left-hand side is the stack for the traditional container that you all know we can configure the GPOperator deploy a confidential container with GPO pass-through or you can configure a Cata container with let's say the virtualized GPU on the same cluster so you can tell on which node you want to run let's say Cata on which node you want to run a confidential container or on which node you want to run a traditional container this is all on top of the confidential containers operator which provides us the CPU artifacts and the GPOperator will provide you GPU artifacts and as I said before it's just a matter of changing your system it's an AMD system or you run on a TDX system or you're running on an ARM system it's the characteristic what we want to achieve what we want to achieve the lift and shift characteristics use your workload, change the runtime class and decide how you want to run it okay now we have confidential environments, we have a confidential GPU what can we do with it what can we do with it I took Wreck-a-Gallance as an example and I'm listing here the potential threats this is coming from Ovest the top 10 threats for LLMs and the potential mitigation strategies that can be mitigated, limited or eliminated with confidential computing we cannot eliminate all threats with confidential computing for example if we have model over reliance that's nothing we can do with confidential computing if your model is hallucinating we cannot do nothing with confidential computing that's your part to do but where resources are exhausted where breakouts are possible like running insecure plugins that many LLMs have or denial of service things this is where virtualization or confidential compute can help you for prompt injection this is more a topic on how your API can validate, sanitize or check what the user is doing with confidential computing we can limit the attack surface or we can enable secure execution of plugins so if we are looking at a very simple LLM pipeline with a front-end API server model server vector DB the question would come up which of those parts do you want to run inside a confidential container if you remember what we talked at the beginning that I said that each container breakout can take over the complete note but I would say just run every of your containers in a confidential environment because if, for example, an attacker in the front-end breaks out he has no access to anything on the containers that are running the API server for men in the middle attacks or on the vector DB vector DB it's an abstract representation of your trained data but still there are some ways to extract confidential data and again on with the secure key release and the attestation you can make sure that you only release your data sources into your vector DB if attestation or your confidential VM is in a state that you are expecting it the same for the models that you are running in your model server you want to only deploy your model your confidential model that you are trained only to the model server if the confidential VM is in a state that you are expecting it some closing remarks not even RECLM is a special example but looking at all AIML pipelines and personas there's always one persona who wants to protect data there's always some stage in the AIML pipeline that you want to protect which can be partially mitigated, eliminated or where confidential containers can help to mitigate attacks but nothing will protect your data if you're running a random shell script from the internet or a random model inside of your confidential environment and this thing is doing a reverse shell to some attacker and he's leaking your data and this would be end of my session any questions I thank you for your presentation, I have a question would it doesn't make sense to do that on a smaller sized GPU like for example the NVIDIA Jetson smaller boards where it's not necessarily a GPU but it's maybe an ASIC on the json board I'm not sure if it's same sort of GPU than the A100 for example only the H100 and the Blackwell architecture have the hardware architecture to support confidential computing if you have virtualization on your platform enabled you can run GPU pass through of your Jetson devices and use Cata containers but confidential computing is only available on Hopper and Blackwell great, thank you thanks a lot for the presentation more generic questions so how about the peer-to-peer communication and the GPUs with unveilings, does it going to work or not? I cannot say anything about that thank you