 Mwneud i gael ymlaen i'w rai. Felly, rydyn ni'n Nathan Harper o'r graffall, ac rydyn ni wedi'i gen i'n gwneud John Gilbert o'r bwysig HBC. Rydyn ni'n ogymell ymlaen i'n fwy o'r byd i mi i gael ymlaen i'r gwasanaeth oherwydd am ychydig yn gallu'n bwysig o'r gyrdan i'r ceisio. Rydyn ni i gael ychydig o'r gwasanaeth i'r ceisio, rydyn ni'n bwysig o'r gwasanaeth i'r gwasanaeth i'r gwasanaeth. ac we had rather a lot of, you know, what we think is interesting things to be able to talk about and not enough time to be able to tell everybody about it. So, so this morning we've got an opportunity to talk through a little bit more of the detailed talk in depth a little bit more about some of the challenges, run through some demos and also an opportunity to take any questions that you guys might have. So just briefly to kind of set the scene again. So we at Graphcore, so we've we've built our own hardware alongside our own software. So that is one of one of these machines, which again is looks like a server, but it's not a server and or it's not a server here as far as the end users really care about it. This is going to be treated as a as a network appliance, something that is just accessible remotely over over our IP over fabric protocol. And then we've been building them. Effectively, you know, our reference architecture, each one of these racks is a pod 64. And that is our reference system. And for that, you get four application servers, you get 100 gigabit switch, you get 16 IP machines, all connected together. And they're designed to, you know, they're originally designed to run effectively standalone. So you have a pod 64. And on that pod 64 would also alongside the, you know, alongside the application servers and the IP machines, it would also have to run all of its own infrastructure for actually operating that. So that includes things like your system services, even things like your DHCP to hand out IP addresses to each of the IP machines, because each pod 64 had its own broadcast domain. So, you know, we'd have to run things like DHCP. Our virtual IPU manager, so the server that is used to inform users and applications about which IPU they should be talking to, that would also need to run inside of this system, generally on the first, on the first node, which creates some of the challenges about that you're a first among equals, which, you know, in terms of systems, which we'll be able to swing back around to in a little bit. But, you know, as I said on the slide, we're building these things effectively by hand that you're the management of the operating systems and the system builds was, you know, kind of nicely automated config managed. But it's management of the infrastructure that sits around it, particularly things like the network config and switch, you know, switch VLAN management was something that had to be effectively operated by hand, because even though we could automate it, there was nothing that joined those things up, what switch port was connected to what thing, what VLAN those things should should sit on. And so, as a result, the systems would generally be one big VLAN with one or multiple users running on it. And that then generates, you know, if all the users are very nice and conscientious and, you know, ensure that they don't trample all over the work that their colleagues are doing, then everything's fantastic. But the reality is even the most, you know, well-meaning user will, at some point, end up using some resources that they don't, they shouldn't have access to, because someone else is supposed to be using it at the time, or unhelpfully using all the memory on our first among equals machine and not only taking down, you know, that particular system, but taking down system services that the rest of the pods relied upon. So, you know, we then had, you know, two drivers, how do we make it easier for users and our developers to have access, get access to IPUs, get access to IPUs in their own small area, you know, something that they've got access to dedicated so that they're not in a position to be able to trip over each other. How do we, you know, move some of those system services outside of user space so that, you know, a user working development isn't going to, you know, out of memory kill their, you know, kill the system and, you know, things associated with it. How do we give them kind of a little bit more choice, you know, in terms of what the configuration should be, operating systems, access to, you know, access to different numbers of IPU machines. But all of that was also to help us to try and drive utilisation. So, one of the challenges that we've got as, you know, because we are building our own, you know, our own hardware, the pool of hardware that we have access to is only so large. And so, when we, and we have a number of different teams within Graphcore all developing different software models, frameworks, and if we have a fairly, you know, static deployment system, you know, what we'd end up with in the scenarios like, okay, here is the Pod 64 that has been dedicated for PyTorch development on Ubuntu 22. And then we have to have this one over here that is, you're very, very similar, but it's got a, you know, it's actually, this is Ubuntu 20. And then we've got a couple of customers that are running your Red Hat, so we need to have another system for that. And what that meant was that we ended up with a lot of systems, you know, these are not inexpensive systems, especially if these are development, you know, this is development hardware, so this isn't part of a large, large manufacturing run. You know, these are expensive systems to be sitting around idle, and so that was something that we really absolutely, we knew that we needed to improve. So that started the development of our VPOD, our virtual pod, and this was effectively an iterative process. And because of the, you know, the first stage was actually just doing things like carving up one of our Pod 64s into smaller systems, taking, you know, different application hosts, dropping them into their own VLANs with their associated IPU machines, and so we could take a Pod 64, carve it up into four Pod 16s, and then each one of those could be used individually without users crashing into each other, without trampling over each other. But as you can imagine that, you know, that helps some things, but then we've now generated a whole load of extra work. Because those system services, things like DHCP, the virtual IPU manager, all of those things now had to be replicated across each of those, each of those VPODs. So the potential for scaling this is, you know, without any form of automation and orchestration is, you know, it can only, it can only go so far. So this is then when we started down in your adventure bringing our IPUs into OpenStack. Because at that stage we knew that, you know, if we can manage that infrastructure, manage the, manage your applications, oh sorry, manage the servers, potentially virtualize them. So this is one of the things that we were missing from our reference your systems is everything was very high performance computing focused. It was everything was bare metal stripped down, you know, with performance being the key. But that can sometimes be the, you know, the perception is that could be the enemy of flexibility. So we started working with StackHPC and looking to bring our IPUs into that. So effectively phase one of that was building an OpenStack cloud with our application servers baked into it so that we can then manage our, you know, manage the virtualization starting to use your OpenStack networking, which then means that, you know, by using your DHCP from OpenStack rather than having to run that on systems that started to move some of the, you know, some of the system services outside of the outside of the iPod. But then effectively what we were then doing is plugging our IPU machines into the OpenStack infrastructure. They were still outside of OpenStack. All we were doing effectively was creating some neutron ports that were associated with its network ports so they get handed out by DHCP. But OpenStack didn't have any, didn't have any understanding, any concept that these things were there. But then the final goal was actually to bring those IPU machines into OpenStack and be able to treat them as first class citizens. So, yeah, so this is where we started to, or this is where we ended up with our low-key based IPU cloud and with our three key requirements was around isolation. So I've talked about some of the problems we had with, you know, users crashing into each other. We wanted to be able to absolutely prevent that. We didn't want to sacrifice performance. But within GraphQL we've got a lot of kind of high-performance computing heritage. And, you know, you go to certain people still and talk about, you know, talk to people about high-performance computing and virtualization. Then, you know, people will give you dirty looks and sort of say, well, you can't do both. But we wanted to be able to kind of, you know, try and demonstrate what we can. So we wanted to be able to, you know, ensure that we achieved all the same level of performance, but then driving self-service. So giving users the ability to request the infrastructure that they needed. And so I'm going to hand you over to John, who can tell you a little bit more about how we developed some of this and what were, you know, some of the process that we had to go through. Excellent. Thank you, Nathan. So I'm going to go through how we use Loki to create that kind of reconfigurable infrastructure and the isolated V-Pods. How we make that performance, as you said, crucial. And how we make that more accessible to users through the self-service. So let's start with the isolated bit, the reconfigurable bit. So it's a little bit like these sort of like reconfigurable conference rooms. And what I mean by that is you need to plan ahead to decide how you're going to slice up the system. So for example, if you're building a big supercomputer, sometimes you need all the machine to be the supercomputer. Sometimes you have people doing medical research that need the performance of that supercomputer, but they need to be isolated from everyone else. As that's where this requirement for slicing and dicing high-performance infrastructure has, you know, one of the key requirements has come up. And another key requirement is development environments. How do I get a little slice of exactly the system that you're running in production? And again, this is kind of very close to the graphical case, right, where you can slice and dice it up. Sometimes the developers need a big system to test those kind of problems. Sometimes they need a small test, you know, small test system. We have this ability to slice and dice. So how do we do that? For the IP machines, their physical boxes attached with network cables to a physical switch, and if you squint at that setup, that's very like what Ironic does. So you used Ironic to actually change those switch ports. So to get hold of an IPU machine, you create an instance in Nova, and it's a flavor IPU machine, and you can have multiple types of IPU machines' flavors. And what that does when you create it is it goes off to Neutron. Neutron reaches out to the switch, actually SSH is in with Networking Generic Switch, what it's called as NGS, and it goes in and changes the access fee land, for example, on those ports. So that gives you that dynamic ability to say these VMs, these physical x86 machines, and these physical IPU machines are all connected to the same Neutron network, that really having to deal with a lot of the actual nuts and bolts of putting that all together. It's all exposed through standard OpenStack APIs. So when you're talking to the IPU machines, you need to use the popular SDK stack, generally speaking. That requires IDMA connectivity, remote something memory access. How we do that is we use SRV. So what is SRV? If you look at a lot of modern NICs, the NIC cards, they present a physical function and potentially several virtual functions into the machine. So what we do with SRV is we pick a virtual function, we pass that into the virtual machine. So inside the virtual machine, it sees a NIC. So that virtual machine has NIC drivers and everything pretty much you can do on bare metal, you can now do inside the VM. In this specific case, these physical x86 machines are connected to a 100 gigabit ethernet bond. So rather than just doing the legacy SRV with that virtual function, we actually plug the virtual function into OpenVSwitch. Now, some of you are going, what's he done? He's thrown away all the performances in OpenVSwitch. But what happens is when the flows are established, OpenVSwitch knows to talk to the hardware offload system to start flowing packets through the hardware direct into the VM. And that's how you can actually get the IB bandwidth test running at line rate, just like it would be on a bare metal machine inside the VM. That sounds simple. There's always a few little things that you have to sort out with that. So for one example, these were AMD based Zen 2 systems, the x86 machines. What we found was in order to actually get that level of performance, you need to actually tweak the default bios settings to make sure that the PCI slot that your network card in is the preferred one, and you have to make sure that you get the right sort of level of numer. So this is me walking into the next slide, which is about performance v-pods. How do you get something a bit like the red arrows? And what I mean by that is, you know, we need to make the thing fly. It needs to be performant. We kind of need all the things to work together properly. So how do we do that? So as I was saying, we use this VFLAG SRV. So using the OBS offloads, that gives us line rate inside the VM. And how do we... I know I was talking about the IB bandwidth benchmarks and the latency benchmarks. There is a small-ish latency penalty for using VFLAG versus legacy. That's a whole different talk. But let's talk about how we optimised the Poplar SDK stack. So we took one of the ResNet ML Perth tests as the reference benchmark. And the challenge was, how close can we get to the bare metal machine version inside the virtual version, just as a baseline? It turns out, with lots of tinkering, we got the VM version going faster than the bare metal. And to be clear, I was definitely cheating because we optimised the bare metal more than the previous bare metal setups where we had to make sure that we had a consistency of firmware across everything, consistency of the bios settings. So our good old friend Ansible came up, Trumps, because so we used Ansible to automate all the enrolling of the ironic machines and making sure that we get good firmware across all of them and good bios settings, and good iDRAC settings in this case, because it was Dell. So we achieved that. We were able to get that performance because this is a technical deep dive. Why not let's go into some of the grubby details here? Because that's fun. Wait, it's for me. Sorry if you're bored. So if we have a look at the AMD CPUs, they use a technique called chiplets. If you go on some of the news articles, people love to show you the picture of what the silicon looks like. If you look really carefully, you'll see there's an IO system in the middle and like chiplets around the outside. That IO system is roughly speaking where the memory's attached to, sort of. Let's pretend it's attached to the top and the bottom because there's several memory zones. And often, and in this case, it was two socket, wasn't it? So it's two sockets. Yeah, yeah, I thought so. So in this case, there were two of those. So traditionally you go, ah, yes, two sockets, two numer zones, bishbashbosh, all done. Well, no, basically. So in this particular case, what I didn't know at the time was that the bare metal was done using custom NPS settings. That's not net promoter score, that's numers per socket. So that will confuse people in all sorts of businesses. But anyway, NPS of four. But anyway, so in these particular servers, what we found was roughly, so this was Zen two, not Zen threes. It doesn't have shared cache in the chiplet. Let's ignore all that nonsense. In this particular case, what we found is a good rule of thumb actually is to do numers per socket equal to chiplets in your SKU. Now I'm sure my wife would probably pull me up saying, what the hell are you talking about now? So what I mean by that is in the particular different SKUs of the chip, the different versions, different models, they have different numbers of chiplets. Let's not go into binning chiplets and everything else, but it's cool. There's a cool way they've got the flexibility, but this made a difference here. So let's zoom back up out of silicon design where I have no right to be. So why was that interesting? Well, actually it turns out that affects memory latency. So if you think of those chiplets, that's where the cache lives. So if your VM thinks it's got some CPUs over here in chiplet one and CPUs in chiplet two, and it thinks they're all in the same place, the Linux kernel doesn't really optimise for the fact that they've got completely different cache zones, and that really impacts your memory latency. And there was a particular step in the ML Perth test that was really latency, memory latency affected. So what we did is we basically, once you had that numus per sockets, we then actually passed that configuration through in nova flavours to slice and dice that. So we were able to say, you know, Mr VM, you have eight numer nodes or four numer nodes, and you have so many cores per numer node. We also made sure that the thread pairs matched up as well so that it knew if it was a real core or a pretend core, I mean thread. I spent too long with these HPC people. Anyway, so I've had a big deep dive into CPUs because that would be fun. So I've probably run out of time for all the other slides. So azimuth is how we take all of this groby details and optimisations and we make that repeatable. So what's the stack that we can go but, you know, cookie cutter that out and actually get all of those performance optimisations. I've just been doing a talk on azimuth and going into a lot of the details here. So let's focus on the graph core specifics. Thank you, John. So crucially, you know, our developers at Graphcore are, you know, they're focused on AI and machine learning models. They're not terribly interested as much as we might be interested in all the stuff that John has been enthusing about. They don't really care. They just want the thing to work and they want the thing to work repeatedly. They don't want to have to know what special extra things to enable to ensure that performance. So, being able to provide our self-service v-pods through azimuth basically means we can just settle the sensible defaults, enable all the required, you know, all the required gubbins just to make those things work and expose only the relevant config that the users care about things like water operating system while running. How many IP machines do I have access to? By building on top of the workstation appliance, which you get out of the box with azimuth, because that's leveraging Terraform, or it's Ansible, is then running, you're templating out some Terraform and then running the Terraform, well, there's nothing stopping us from adding our own things into that. And we could go as customers we wanted and as specific as we wanted into our infrastructure. And so, what that meant, we could try and really bake a first-class developer experience in so that users get access to all the things that they expect to have when logging into machine. So, it's single sign on access to things. They've got a consistent home directory available. They've got the applications or the operating system that they've requested. Everything is set up, effectively ready to go. So, rather than just talking about it, this is a bit that we unfortunately didn't quite have time to do while we were running. That is interesting. Why is my... You guys only have the presentation? Sorry? You're not mirroring. I was mirroring. Oh dear. All right. Let's try this one instead. Sorry, guys. There we go. So, I can actually run through and build one of these. So, you can see we've got the selection of platforms that we have available. So, Kubernetes is one of the things that you get out of the box with azimuth. So, particularly if you've got an open stack cloud and you want a straightforward way to give your users access to Kubernetes, this is a fantastic way to do it. But then we've created our computer clients and our popular appliance. So, these two have been set up very specifically for our graph core environment. So, the computer appliance is effectively the popular appliance but without access to any of your IPUs. So, it just means that users can get access to the same sort of development and working environment they'd have access to but without actually needing to tie up any IPU machines while they're using it. So, we can go through and create a system. I'm going to pick Ubuntu 20.04. Within our cloud, we've got two different versions of our IPU machines. So, Bo is our current latest-gen system but we also have what we refer to as our classic systems which is like the previous version. The users can select one or the other or they can just leave it to any and then that will be selected. And then we can select the number of IPUs all the way from 0 to 64. So, the advantage to picking 0 here is because actually this is one of the mutable fields that we have access to inside of our azimuth system. So, we can actually create a system with zero IPUs on the basis that the developer wants to get started. They don't need access to it but then they can then come back to it and say, actually now I want to have 69 IPUs and the infrastructure will get reconfigured while their appliance is still live, still alive and running. So, partitions in this context is actually how a virtual IPU system works. So, a partition is effectively a set of config that says these IPUs are available over on these machines over here. And so, this is provided so we can choose either to create one for our users or you just leave it. We've got our set of different set of flavours that we've got. And so, this is a curated list so that we've ensured that only the flavours that have all the relevant traits enabled for it are made available there. And particularly in this case is we've given a lifetime to the systems. One of the challenges or one of the things we wanted to try and avoid was going back to the old days where systems just sat idle because people had picked them and said, I want this thing for the next month, please. And then they use it once in a blue moon. So, for our popular appliances, we've got a maximum of 48 hours, which means that we'll always get churn and bring things back to the system. And then we can choose to associate a floating IP with that. So, building one of these appliances, it takes a couple of minutes because it's got to orchestrate all the infrastructure on the back end. So, while waiting for this to happen, I figure we've got an opportunity if anybody has any questions of us who wants to know anything more about what we've been doing. Oh, that'd be brilliant, actually. Thank you. Just for the recording. So, it's basically a question regarding the VMs. You talked about the Neuma nodes. The IO die. One question is, you said you have a two socket system. Now, from my knowledge, your nick normally belongs to one of the PCI buses, belong to the IO die. How do you do that with SR IOV? Do you run two nicks and then schedule your VMs logically or how do you do that? That is a very good question. So, if I were a telco for going for maximum latency, I would do what you just said. However, in this case, I was a bit more worried about bandwidth. One of the limitations, actually, of using the VFLAG SR IOV is the bond has to be on a single nick. So, it's a trade-off. If you want maximum latency, you can do the whole card per socket trick and you can make sure that you schedule the right card to the right, yeah, the right virtual function to the right locality of your Neuma and everything else. Also, in this case, we were having quite large VMs that sit across both of the Neuma nodes when we were testing this. So, on the very large case, we kind of didn't see so much of an impact because I'm pretty sure the nick came in on the right Neuma zone. That might have been luck rather than actual judgment. So, if you're using SR IOV with the VFLAG with the OBS offload thing, it has to be on a single card. Okay. So, basically, the infinity fabric was fast enough for you in case of latency and depend with what's enough. It was. Okay. Thank you. The question was, are you sure the VFLAG has to be on the same nick? We don't assign two, so it's one VF that runs at the right of the bond. So, the question was, am I certain that the VFLAG only works in one nick? 80%? It might depend on the nick. It certainly probably depends on the vendor. There's a lots of different OBS offload methods. This particular case, we're using ConnectX 5 milinox nicks, right? So, in that case, yeah, it's one nick. The bond is in the one nick and you pass that through. The six would be a bit better in terms of security groups and things. So, why is it a bond? Magic. So, if we go back to testing this feature on older generations, the VF is actually limited as far as we can see by the PCI generation. So, basically, the bond is in the nick. So, the nick knows how to do the bonding and those packets flow down the PCI bus. So, actually, you're limited by the PCI bus. So, if you use these ConnectX 5s in a Gen 3 server, you can only actually get 110 gigabits a second rather than 180 or 200, because you're limited by the PCI Gen 3. But these are then to AMDs. So, you've got Gen 4. So, we don't have that limitation. Does that make sense? As to why, no idea, but Nathan's thing just works. So, woo! So, I guess to add to one of the things that John mentioned, one of the advantages we had with using the bond is, although, you know, our focus has been RDMA access from our popular host to the IPU machine. That wasn't our only requirement for kind of high-performance networking. So, alongside access from our, you know, the application access, particularly in our graph cloud system, we've got one of the pure flash blade systems, which provides kind of very high-performance NFS. And so, when we built one of our VPods, what we would do is we'd actually give one, you know, one VF would be used for the RDMA access, but then we'd also plug another VF in, which would be on a different VLAN dedicated for storage access. One could. So, I mean, for our kind of ideal performance, you know, our kind of maximum performance VM was effectively the same size as the bare metal system with just a little bit carved out of each numer zone to run the rest of the hypervisor. And so, that was the one where we could guarantee kind of the best performance. At that stage, we're not sharing with anybody else. We don't have to, you know, so that was, so, one of the things we have. This was just a quick thing on deep dives. Don't forget N-connect if you use it, if you're in that particular scenario. The NFS N-connect makes a huge difference to make use of that bond. Sorry. Oh dear, what was that? Oh, so one of the things that, you know, because we've now been operating your OpenStack, your IP is an OpenStack for about two years at this point. We've had a lot of opportunities to kind of learn the bits that have worked really well and where we've had opportunities to improve. So, we have actually, we've developed different performance classes across our clouds so that we can either provide, you know, kind of maximum performance, guaranteed performance for things like benchmarking. And so we have a SLAM HPC cluster running inside OpenStack and that's got access to the performance optimized VMs. But where we have, particularly for your developers, sometimes they absolutely need the performance optimized. Other times it's more, it's just functional capacity. So we give access to different flavours that have got perhaps some oversubscription, oversubscription of CPU, oversubscription of network access. And so that basically allows us to have our cake and eat it in terms of, we can provide full fat performance. But then we can also, in other cases, crem as much onto a set of hypervisors as possible to help improve our utilisation. That is a very loaded question within when we discuss with our users at the moment. So it's, you know, the azimuth experience is kind of extremely straightforward in terms of picking what they want. They have a persistent home directory, they have persistent networks. So every time that you get a new system, you'll get a, you'll have access to, all of your data will still be there. The only difference is it is a new VM each time. So anything that you've installed in the VM is a little bit different or might require a little bit of setup. We've been looking at different things about how can we help bootstrap that, so having your first, upon first deployment run this Ansible Playbook to install all the things. We have been trying to avoid making it too easy. The reason is, because you've sort of said it's like, can we not just have an API call that will just do all of this for us please and set it up? But the problem is, as soon as we do that, you know, we have a lot of very capable developers who have no problem with setting up a cron job, which at seven o'clock every morning will fire up and create a fresh system, whether or not they actually need it. For any time check? Oh, yes. Oh crums, we have run out of time, unfortunately. So, but crucially, our system has been set up and you're just in terms of convenience features that our users have got access to. If they've created it, if they've added a floating IP, they've got here is how you SSH into it. We don't have DNS integration yet, but that's something that we're looking to do. And we've also, we do things like setting convenience environment variables into those systems. So some of these things like how to access the virtual IPU manager, they've got access to straight away, they don't have to set any of these things. So, well, yeah, unfortunately, we've run out of time to be able to tell you more. If we had the opportunity, we'd probably keep on going for a while, but thank you very much. Thank you, everyone.