 I'm Peter Solanke, VP of Engineering at CoreWeave. Who is CoreWeave? We are a specialized cloud provider. We focus only on high-performance workload. What this means is anyone who needs a lot of compute power, predominantly machine learning, training, or inference, but also molecular dynamics and similar use cases. We're a bit younger cloud than your traditional hyposcalers. We started building our stack in 2018. And unlike when Amazon Unlikes started building their cloud, we had some new, very interesting technologies available to us, namely containers and Kubernetes. We also realized that all of the major supercomputers in the world were not running VMs. They're running bare metal. So the choice for us to start building with Kubernetes on bare metal was a pretty easy one. It made a lot of sense, much as overhead, and really great patterns to manage and deploy software. The role to running Kubernetes on bare metal is not necessarily as easy as it seems. And we'll get into that right now. In this presentation, we're going to focus on a specific cluster that we built as part of our cloud fleet. It's the one that we used in May, June to set the MLPerf record on the LLM training benchmark. The project started in earnest in January when we started building one of the absolute first NVIDIA H100 training clusters together with our customer inflection AI. The MLPerf benchmark itself was run over 3,500 GPUs and finished the benchmark in 11 minutes. The MLPerf LLM benchmark is a GPT-free model structure and model size of 105 billion parameters. It's not trained to convergence. Obviously, it takes a little bit longer than 11 minutes, at least on this cluster, maybe not the next one. But it gives you an approximation for a true LLM training workload. The cluster itself, at a time of training, had 3,500 NVIDIA H100 GPUs. These are all interconnected in their supercomputers using Infiniband network technology. There's 400 miles of fiber inside this supercomputer, which is all housed inside one sector of a data center. So it's a lot of fiber in a small area. And there are 40,000 connections between the systems. So you have a fiber, goes into a switch, goes into another switch. Each of these connectors connect on optics. They all need to be cleaned before you put them in. And if any of these fail, you have performance degradation in your cluster. And the results that we got were 30 times faster than the next nearest competitor. So let's dig in how this cluster is built, what are components, and what will fail, and how we handle it. The servers themselves, it's a 4RU or 6RU or 8RU server. They contain eight NVIDIA H100 GPUs. And then they have eight Infiniband NICs with eight uplinks and two ethernet uplinks. So there's a total of 10 fibers coming out of the system. All 10 can fail. And each individual failure can be catastrophic to a job. These are all configured in what you call a rail optimized configuration, but not too much into that. So this is how it looks in a data center. So you'll see you have rows of core switching racks. This is where all the Infiniband cables go back. This is one of our core groups. There's actually eight of these in one of the builds. You have 400 miles of Infiniband fiber, standard multi-mode fiber. And then you have the servers to your upper right there. We see eight cables going in for Infiniband in the back from the GPU servers. Any failure of any of these components will tear down the job, because all these GPUs are working together. All the 3,500 plus GPUs are working together in one single job. And a single failure will cause a failure for a job that's restart from a last checkpoint, and you lose a lot of your training time. So ensuring that your nodes are healthy and your entire fabric is healthy is critical to not lose performance on these very expensive AI training machines. So to actually build this and get it running and get it running faster than pretty much anyone else out there, we build a couple of core components. We choose Kubernetes to be very flexible on how we deploy our software. We run them on bare metal to skip the virtualization overhead, and we boot everything statelessly. After the nodes are booted, there's a full suite of validations that we run, both during burning and continuously, and gather all the metrics and act on them immediately. And once the nodes are up and healthy, we need to actually run the workloads. We can either run them in Kubernetes using many of the schedulers or proprietary schedules discussed earlier this week, or we can run them using traditional HPC scheduler slur. We're going to cover all these three steps in the next couple of slides. Starting with how we boot the bare metal nodes. The systems are delivered without any OS on them. We don't want them to come with any OS from the vendor because things change constantly, right? We have new drivers to deploy, new kernels, new CPUs, can't really expect anything that's preloaded in the factory to work for us. Inside each server, we have an NVIDIA Bluefield 2 or Bluefield 3 DPU, Data Processing Unit. What these do in the stack is they help us provide our VPC isolation, since all of our cloud, everything is multi-tenant. They provide isolation between customers, and they also act to serve the boot image for the node. I'm not going to go too much into the DPUs that's act itself in this presentation. That's a whole different talk. But what's important to note is that these DPUs, it's basically a SmartNIC with a Raspberry Pi on them, they also run Kubernetes. And they connect to a DPU management cluster, where there's a CRD running that tells them that this specific node, node number A, should boot this image and join this Kubernetes cluster. Then when a node boots, it pulls a stateless image. It's an Ubuntu image, very tiny, but it has GPU drivers, some infinite band drivers, and a kubelet. It gets a join token from the DPU via cloud in it and joins a Kubernetes cluster. Whenever a node reboots, there's no persistent state on the node, does the same process again, pulls an image like it's the first time it booted and joins a Kubernetes cluster. This way, everything is stateless, there's no manual provisioning a node, there's no state drift, it's fully ephemeral, which means we can plug in new nodes and get them up and running, joining a Kubernetes cluster immediately. Moving on to the next phase, node lifecycle. So nodes come online, we need to make sure they're healthy and we need to continuously keep them that way. We picked Kubernetes as our source of truth. This means that we don't have any other databases storing information about what is a node, is it up, who does it belong to. Everything is in Kubernetes. I like to say that Kubernetes is a database. And it kind of is, right? It's EPCD, it's API server, and COD is a schema for objects. And then you can do nice watches and there's great patterns for writing reconcilers and controllers. So building an ecosystem around Kubernetes makes it very easy for us to plug in new things and get metrics out into different metrics store without having to build a bunch of glue in between proprietary systems and Kubernetes itself. We flow all data from Kubernetes, from Kubernetes states up to Prometheus for observability and for alerting. And then once database in Prometheus, we can take actions on this both automatically or manually from an operations point of view. And you will see the theme of how everything is, how Kubernetes is our database as we go along. We use a bunch of open source tools and a couple of proprietary controllers in our stack. No problem detector that many of you probably use is a very core building block. It doesn't come with a lot of detections from out of the box, but it allows us to easily plug in different triggers on say kernel messengers and kernel logs. Kubernetes itself kind of expects you to have happy nodes, right? Like when you run Kubernetes on a VM from a cloud provider, it's usually happy. In itself, Kubernetes doesn't have a lot of health checks for nodes going bad. It doesn't handle nodes being in half bad states well at all. So we need to build a lot of tooling around it. We use the standard node exporter, DCGM exporter to get node and GPU metrics, Prometheus for our metric store, Luki for log aggregation and querying, Grafana and Grafana on-call for visualizing to our operations teams. We have proprietary Redfish exporter and Redfish controller to talk to the autoband management of nodes. This is used to get autoband metrics from our inventory and do things like reboot the node when you don't want to reboot it from the OS. We have our proprietary lifecycle controller and Argo workflows, which is used to execute health checks to talk more about. Dig into lifecycle controller and how to handle the bare metal node lifecycle in Kubernetes. As I said earlier, Kubernetes expects happy nodes. If you're plugging a node, boot, you can probably schedule a pod on it. Does it have networking? Hopefully. You have to see if the Kublet thing is ready. Pause will schedule it. Doesn't really mean that your GPUs are healthy. Doesn't mean that your networking is actually working, that you can access persistent storage, and so on. This is why we developed what we call the node lifecycle controller. Node lifecycle controller acts on states and states can either be a Kubernetes label, annotation, or a node condition. It will then take the node-fruits journey up until it's in production and even when it's in production to take it out of production on some kind of fault or condition. It starts by verifying that the node is physically correct. It has all the disks it should have. It has all the Infinity Band connections it should have. It's connected to the right Infinity Band switch and ethernet switch and so on. It then takes us for a firmware upgrade process where we want to upgrade the firmers of all the GPUs, the BIOS, the BMC of the node. Especially with these new platforms, there's a lot of changes in the microcontroller software running on them. So you need to be prepared and have an automated way to upgrade all these components at a very frequent cadence. Once the node is up verified and upgraded, we run this battery of self-tests. We run for 24 hours and this includes everything from testing GPUs and so on. Talk more about that in a little bit. Once the tests have passed, the nodes are put into production. This means the customers can schedule on them and we can run things like our big MLPERF training job. When a node is in production, the core node controller takes over. This controller is responsible for labeling nodes with metadata. Again, Kubernetes is our database. So metadata about node's location, its placement in the fabric and so on is all written on the node object. The node controller also handles lifecycle events like a reboot, which is requested either by a customer or an admin of ours reaches out via RedFish to do so. Very detailed state diagram of the lifecycle controller. Not going to go into everything that's going on here, but there's a couple of important themes here. As you see, the CUBE API server is central. Every action flows through Kubernetes. There's no path that doesn't go through Kubernetes. If an admin here on the left requests a reboot of a node, he does so by setting a condition on the node and then the node controller will go out, reach out through RedFish to that node's BMC, out of band management interface, and trigger that reboot. Why we force everything to go via Kubernetes is because we get built-in flow for event logging. We have Kubernetes events. It's clear on the node, you can describe it. What was the last condition taken? And we can use metrics exporting through, say, CUBE state metrics and CUBE node labels to get all of this in Prometheus automatically. By centralizing the entire management flow in Kubernetes, we can get a lot of stuff for free, and we also get a programming model when different teams need to interact with this that they know, because a lot of people know how to program against Kubernetes. You also see on the top, there's a RedFish exporter. It goes to Prometheus. It goes to something called Epimethios, which is basically a converter from Prometheus alerts to conditions, back to the API server. What this means is that if there is some kind of alert on a node, say a bad GPU, we can ultimately feed that back as a condition on the node, and then, since the node controller or node lifecycle controller acts on conditions, we can take an action on it. So it's this full flow that's driven by labels and conditions that are set on nodes. And an X script from a node described shows how we used Kubernetes to stick all of our metadata. The first two rows, first one notation, shows an output from our automated health tests. In this case, the GPU has a failure, and we even have from the health test that it should be submitted for RMA. So our ops team knows what to do with a node. You see some topology information in HPC clusters. Topology and location is important. When you schedule your ML training workloads, you want them to be close to each other on the network. So annotate every node where they are in the Internet Band fabric, which is their connected to, so we can verify correctness. We annotate the node with what we call a firmware bundle, which says this is the firmwares that this node should have. And then the node lifecycle controller knows to reconcile it to those firmwares if that differs. You also have the state down here, which is a very unsuspecting label, but a core in the entire process, which is enabled to drive the node controller to say what state is in this node. Right now, it's in broken. It was automatically put in broken based on the test failure. And when it's in broken, it won't be available to customers to run workloads on. The node will be cordoned off and so on. Looking at another describe of node conditions, you can see a slew of node conditions here. Some of these are set by the node problem detector, then the node controller can act on that. The GPU fault condition, which is based on a kernel log. Some of these are set from alerts, like the network alert fault, where an alert is triggered on a node and we set a condition on it, so the node controller can act on it. And some are set by admins. There's the admin maintenance mode in the bottom that one of our operations engineers can set when he wants to take out the node for maintenance of some reason. And the node controller will then cordoned off from the customer, make sure there's no workload running on it before a less admin do any work on it. And since all these states are labels or conditions, they're already exported for us for free to Prometheus via kubesatmetrics kubenode labels. So it's making it very easy for us to visualize what's going on in the cluster. So this is a visualization of the node controller node controller lifecycle. We see that node starts kind of in onboarding, goes through a different test phasing. Zap is when we upgrade firmware's. Test is the 24-hour test window we do when nodes boot the first time. And then a bunch of them are in production and then you have post-production phases with debugging or broken RMA nodes. Since the test times aren't annotation on a node, we can very nicely graph how long time nodes have left in their self-test phase. Talking a bit about our burn-in testing. When a node starts up for the first time, we put it through a 24-hour burn-in test cycle. These tests are run via Argo workflows. They did one continuously over the course of 24 hours. There are about six hours each state iteration, so we run four of them. And they test everything from PCI performance, NVLink, NVSwitch performance for NVIDIA GPUs, InfiniBand performance all across the fabric, CPU, ethernet, looks for any error counters. If any of these fails, an alert is raised and the node controller can automatically move it from the testing state to the testing failed state for manual intervention. After a node is in production, we keep running these tests. So the Argo workflow, as it's written, checks are there available, is there any pods running on the node consuming GPUs? Assuming the node is idle, we can run a test. So we continuously test nodes that are idle. The tests are run at a lower priority class than normal jobs, which means that customer jobs will preempt these tests as soon as they start running. But since we continuously test the nodes, we don't have to wait for a customer to tell us, oh, this node is broken, because we usually find that out hopefully before they do. And since everything is in Kubernetes, we can easily graph these results and get nice state timelines for different nodes like I'm showing here, and it shows you a set of four nodes with different issues. And one of those issues started here at the end of the timeline where the tests run. All the logs from all of these are easily piped up with Luki, and we can parse the log results from different tests to graph these over time as well. Here's the test results for high performance impact over a part of our fleet, and we're looking really for trends here. Say the new driver version is released, we wanna make sure that performance isn't degrading or a new BMC version, or even environmental changes in the data center or part of the data center. We developed dashboards for our obsignations to use when it troubleshoot nodes. This is a very one-fifth of this dashboard. But having all these data available to us on Prometheus and Luki makes it so easy to bring it all together in one place and give a really clear view for anyone who needs to dig in troubleshoot nodes. We also use the Grafana stack for alerting. We use it together with pager-duty, but Grafana on-call provides really beautiful visualization of alerts in Slack, so it makes it very easy for operations technicians to go and see, okay, there's no alerted. I can get a nice timeline. When did it have an alert previously? If it did, so I can immediately know what might I have to do with this node, if there's nothing that can be done with it automatically. The node controller also listens to alerts and can also take action on alerts if it's a well-known issue, like the GPU has crashed. We know that this node needs to be moved out from the customer's environment because it probably needs an RMA, so we do that automatically as well. Okay, so now we have a bunch of healthy nodes. They're up and running. We've vetted out the broken ones and we got this continuous cycle of node health. Now how do we run training jobs on our cluster? We started out building this very cloud-native pods for everything and there's a lot of work being done in this space with running large training jobs on Kubernetes. There are schedulers, there's the co-schedulars, there's the MPI operators, there's companies like RunAI, Kubeflow, do building frameworks for these. And some of them are great and some of them are getting better every day. However, in the traditional HPC community, people like Slurm. Slurm is an HPC workload scheduler. It is a couple years old and it's built for really a supercomputer world where you spend two years building a supercomputer and then you test it and then all your X files and the CPUs and GPUs are up and it's very static. That doesn't really work in today's age of rapid AI evolution, right? Where we build these clusters and as soon as there's one node or eight GPUs online, a customer wants to start running on them. So we need something that is agile and we can upgrade it quickly. We don't want to take down the entire cluster from maintenance for a week to upgrade all the components like you could do with a traditional HPC university environment. But people coming from research, coming from these HPC environments are used to Slurm and want to use Slurm. So how do we solve that? So we built something called Slurm on Kubernetes. Code name Sunk. I mean, if you look at the Kubernetes theme, maybe it means that it's syncing, I don't know. No comment on that. But we try to integrate Slurm with its very non-cloud-native architecture with Kubernetes. Very detailed diagram how Slurm and Kubernetes works. But important pieces are everything is containerized. All the controller pieces are containerized. Slurm, CTLD, SlurmD, logging nodes which are traditionally a bare metal server that hundreds of researchers SSH into and do work and launch jobs from multi-tenant are also a container. And we can launch multiple replicas or individual logging nodes for individual researchers instead of needing to have a VM or a bare metal machine that 100 people are multiplexed over. The Slurm daemon that actually runs the training jobs is also a container. Slurm in itself does a bunch of C groups and so on if you're familiar with that. And all of that will inherit from the pod C group tree. So the most, if not all, Slurm functionality works kind of out to the box in this pattern with slide modifications that we have made. Then there are a couple of interesting controllers. We have what we call a node set controller which basically acts as a daemon set. So what this means is that it will run one of these Slurm daemons on every node in your cluster that should be running Slurm. And it runs them without consuming all the resources on the node. So what that means is that Slurm is running, you can launch jobs in Slurm, but you can also launch jobs in Kubernetes. So we have the Slurm Kubernetes scheduler integration where the Slurm scheduler acts as a scheduler plugin to Kubernetes. When is this useful? One example is if you have a cluster that's used by researchers to train your foundational models. You're training a huge new LLM. But you also have production inference running on it. And production inference, you really wanna run that in Kubernetes, right? It's okay, Slurm, we got this kind of old pattern of working. It works great for batch jobs, but it's really not a good fit for long running critical services. So we wanna run inference in Kubernetes pods, but training in Slurm. And inference traffic, you have people wild from the internet coming to your app, chat dpt style. Traffic can fluctuate. So we want our auto scalers in Kubernetes to be able to manage our Kubernetes pods. And when there's a burst in your inference traffic, we want to be able to preempt unimportant jobs in Slurm. Say that you have one big pre-training job that you don't wanna preempt, runs at the highest priority class, but you have some research jobs that you will easily wanna preempt for more inference capacity. So the Slurm-Kubernetes-Scheduler integration allows the Slurm-Scheduler, with its very advanced preemption, gang scheduling concepts, to manage the scheduling of these Kubernetes pods. So instead of using something like volcano or whatever in Kubernetes, we actually leverage the Slurm's scheduler even for Kubernetes. And Kubernetes pods show up inside Slurm, and we can use Slurm accounting and all those Slurm features for our Kubernetes pods without really compromising on those things running in proper containers, in proper container isolation from Kubernetes. And we also export all of our metrics from Slurm, of course, into Prometheus. And since we have all of our other metrics there, we can create these real nice informative dashboards taking metrics from the job running itself. So here we're running the MLPerf record-breaking job. We can see that there was an interruption there. So I've been crashed. We can see that the Flops counter goes to zero. And by overlaying that alert as annotations on top, we can see very quickly see that, okay, this job stopped because a GPU fell off the bus, fell off the PCI bus. So this doesn't only help ourselves to diagnose when there are issues with the cluster. We can also expose these to customers so they have full insight into what's happening with our big training clusters, not one opaque black box where things work and things don't work. Slurm on Kubernetes will be open sourced in the beginning of next year. We believe this is important to marry traditional HPC work with Kubernetes and how containers should be run. More data center picture and moving on to questions. Go for it. What advantage does Kubernetes bring to the HPC community? It brings, so why we like it, right? Is you can roll out software, more piecemeal, and better isolate it. We can build containers. I mean, it's really all about containers, being able to package up your dependencies or your different components as containers, roll them out quickly, upgrade piecemeal. And then you have a whole ecosystem of tooling that is really built around Kubernetes. We talked about Prometheus, Luki, like, yeah, you can use all of these outside of Kubernetes, but by running in Kubernetes, you get to take advantage of everything with all these great open source developers that are probably all around here today have built. And retrofitting them in on a more traditional model with like Chef, Ansible, you can do it, but buying in gives you a lot of stuff for free. Then we also have isolation, like slurm disordered HPC environments, not really that tight on security in my mind. It's easy to have noise and neighbor problems to get visibility into each other's jobs. They're not really traditional built from like a university perspective, where, okay, we want people to keep their data apart, but if I know what my neighbor is working on, it's not really a big deal. And this is changing when we're building these AI models and so on that are either very expensive to build because they take thousands of expensive GPUs to train or they're critical because the level they operate at we're very worried from leaking out. So using Kubernetes to for our back and for isolation is just better than anything has been in traditional HPC. Great talk, yeah, Yuan Cheng. Can you hear me? Sorry? Yeah, Yuan Cheng from Apple. First question and clarification question about the sunk. So as for slurm running on Kubernetes, did you mention you run slurm as a plugin of the native schedule or it's a standard known and a separate schedule? No, so it runs, yeah, so the sunk sinker, I'm gonna pull up this diagram again, maybe this diagram is the best. So yeah, so it doesn't actually use the native scheduler, so it's a different, you set the new pod, it's a different scheduler name and the sunk sinker is a scheduler implementation. So it talks the scheduler, I mean it access the scheduler and then talks to the Kubernetes scheduler to actually schedule the pod. So it's not a plugin to the native scheduler, like the native scheduler is not involved in scheduling the pods at all. Yeah, then I think previous slides, you mentioned you can run both slurm job and Kubernetes job on the same node. So the Kubernetes job and are scheduled by the native and the Kubernetes schedule, then how do you solve the risk condition and the conflicts? Exactly, so you can run both slurm jobs and Kubernetes pods, but both of them are scheduled by the slurm scheduler. So you're placing the Kubernetes scheduler with the slurm scheduler. So this means that there's some things that slurm, the Kubernetes scheduler can do that you won't get, but for a lot of functionality, especially when it comes to these type of workloads, the slurm scheduler is a superset of the Kubernetes scheduler. You know, you can do very advanced preemption, gang scheduling, bin packing, is to poly, so you would use, you would use the slurm scheduler for all your pods. You know, short daemon sets and so on that kind of run in the background, you don't use it for, but for anything that uses up resources, you would use the slurm scheduler. Otherwise you would have very weird conflicts and wouldn't be good at all. Okay, got this. Second question about your load provisioning, load registration. Have you used underclassed API? If not, and any reason why not use classed API? The cluster API? Yes, so I didn't actually talk about how we instantiate these clusters. So we run multiple Kubernetes clusters, obviously we have multiple customers and they own a couple of different models, how we provision them. There was a talk on that yesterday by my colleague Brandon, but we have our own cluster operator which exposes our CRD and that's how we instantiate clusters. It's not actually the cluster API right now, it's our own CRD because some philosophical differences, but principle is the same. So we instantiate new clusters with the CRD and that really happens before. Anything represented today, so the cluster API servers and so on are created before any node is booted. They run in a separate management cluster, much like GKE or AKS, and then nodes are booted into these cluster and that's where all the node lifecycle happens. Great, thanks again, you have great talk. Thank you for the question. Testing, have you, is there a balance between how exhaustively you test them and like the cost of power to do that test and because obviously an idle node doesn't draw power in the same way as a let note. So have you done analysis on sort of that cost back then and how much you heat them up? There is and that's a constant debate between me and our facilities team who doesn't like me to run them at 100% load all the time. And so the compromise we have to right now is they run, the ongoing test runs once every hour, they take about 30 minutes and then there's 30 minutes of downtime. So that's 50, 50. And probably could run it less to be honest, could run it less frequent to, and it was still cash issues. We're pretty early in the life cycle of these components, especially focusing on the H100, right? As the cluster matures, like there's a huge curve where most of the issues happen in first two months and it goes down from there. So as the clusters mature, we can probably decrease that even further. But right now, at the time where we are, where GPUs are so scarce and it's so important for everyone to have healthy GPUs, like yes, we're gonna spend some extra money on power and put some extra load on our data centers to make sure that we have, as many healthy GPUs as we can for our customers. Yeah, I mean, I guess. Yes, and since the jobs take such a long time, you have to load from checkpoint, start up Megatron or whatever framework you're using, like it can easily take you half an hour to restart the job and it's very painful. I'm wondering how much power are you running to the racks and how are you handling coolants? Okay, this is a great question. Very much outside of an old life cycle. So it depends on the data center. We have like 15 different data centers plus around the US right now and it's gonna be like 40 in a year. Some, okay, as our average is 17 kWs per rack, which is just easier. Since these things are so dense to begin with, like space is usually not a problem because most buildings are built to like eight kWs per rack. The one, this specific cluster is, we're running at 18 kWs per rack. No, sorry, we're running at 32 kWs per rack, but most of the builds are less than that to give more space. All of it is air-cooled currently. Next year, I think we'll see a lot of things switched to direct the chip liquid-cooled as there are newer generations of GPUs coming out, they're gonna consume so much power that is infeasible to air-cool them. One more question though, what are you using for the base system that you're plugging the H100s into and how many do you get per rack? Yes, so each H100, HGPU H100 system consumes about eight kWs of power. So if you have a 16 kW rack, you get kind of two and we refer to two kW racks, we get four. The base system, we work with a couple of different OEMs. One of them is Supermicro as an example. It's all H100 HGX system. Great, I've seen that one, thank you. I think you just answered my question though. Those are Supermicro chassis in the data. In the picture, yes. But they weren't DGXs. They're not DGXs, they're AGXs. Okay, cool, thank you. Hello, I have two questions related to the network. So the first one is that when you saw the DPU there, so you're using DPU to provision the hardware first, like download the image and create the node. So how do you switch from that installment to the real Kubernetes environment? Yeah, so that DPU itself, it's a NIC, it has a CPU on it, it has its own OS. So the DPU is always booted, it runs its own Ubuntu ARM CPU, okay, I'm supposed to stop now. I'm gonna talk for one, for any more second. Has its own ARM CPU, so DPU is always running. And then when the node boots, the node has nothing on it and it sends a PXE, it's PXE boots from the DPUs. There's a PXE server on the DPU which serves the image to the node. And when a node boots, there's no installation really, the node boots statelessly, an image served from the DPU over ethernet, so talks from the server up to the DPU over ethernet, and then a node boots an Ubuntu image in RAM. So the DPU acts to serve the image to the node. Did that answer your question? Yeah, that's the, yeah I know that, but what do you transfer to the network that will be used by the customer? So then, how does it? It's a kind of management network, right? So the DPU itself has an ethernet port that connects to the management network. And then the DPU serves the customer towards the node, it exposes the customer's VPC, which is an EVPN VPC. So the customer's node, when it boots, is always inside the customer's VPC. And that's why the DPU is the one that serves the boot image, because inside the customer's VPC, it can't reach anywhere where the boot image is. So the DPU does all the network magic to make the node think it's talking to this private network, but it can still load in stuff like images and so on. So I have to stop now so we can do questions around here.