 All right, our time hasn't started yet, but, oh, there we go, okay. We're good to go. Thanks for being here. I am Frédéric Claudinois. I'm the Enterprise Editor for TechTronch, and with me today is Nathan from Graphcore, who you probably saw yesterday, but why don't you introduce yourself? Yeah, so I'm Nathan Harper. I'm a member of the Cloud Development team at Graphcore, and I've been working on integrating our hardware products into OpenStack for about the last two years at this point. And for those who weren't there yesterday, what does Graphcore do? So Graphcore, we have developed our own silicon systems, software, SDKs, all of which have been designed for AI machine learning from the ground up. And what does it look like in practice? What can I do right now on Graphcore? So right now, our systems are our IPU machines. So one of our IPU machines is it looks like a server, but you can't use it like a server. You can't log onto it. It's effectively a network appliance containing four of our IPUs. So then a user will then access that over the network. We'll try and obfuscate as much of that as possible. Users don't need to use it. From a practical user point of view, we have been pushing very hard to try and make sure that our SDKs and frameworks are as well supported as what you would see on other AI accelerators. So if you use PyTorch, you can use PyTorch on IPU. If you use TensorFlow, you can use TensorFlow on IPU. And if there's anything that doesn't work, these are things that we're interested in finding out about so that we can enhance that. We've been planning ahead for our next release of our SDK. And the intention is at that stage is that everything should be one-to-one. So if you have a model that has been trained and run on GPUs, you should be able to pick up that same in PyTorch, for example. You should be able to take that same thing and run it on IPUs without having to make any code changes. So you're basically the Nvidia of OpenStack. There's no point kind of trying to, I like to think of us as, you know, we sit on our own space. Very enough. Add a feeling you're going to say something like that. Now you made this bet on OpenStack pretty early for your infrastructure, right? Can you talk a little bit about the journey and why you ended up using OpenStack? Yeah. So we had, you know, as one of the things I mentioned, the keynote yesterday is our reference systems. We have our Pod64 systems. You get some application servers, some IPU machines, and the networking that connects it all together. And the reference design for that was very static. You know, all of these things, all the systems we're in will be in a single VLAN for that pod. Each pod would then be, you know, entirely separate from all the others. And then most of the time, that would be fine. You know, it works. You'd absolutely hit all the performance that you'd expect. But the challenge came around, it was actually what the developers required. So if they needed something different, so an alternative operating system, if they wanted different packages, but crucially it was, if they wanted something different to what their colleagues who were also using the same machine or the same pod had. That was where we'd end up with users crashing into each other because they were both trying to do different things with the same hardware at the same time. So we had this real drive to how do we, you know, enable our developers, give them access to self-service, and also be able to prevent, you know, the, you know, users without any maliciousness from affecting each other by, you know, changing the config of a system or, you know, using the IPUs that weren't necessarily allocated to them at any time. So when we started using OpenStack, it allowed us to be able to automatically carve up those systems in a way that we, you know, we weren't really able to do before. We could do that before, but it would all be driven by hand and, you know, would require administrators to go in and log into switches and switch VLANs and manage which VLANs are associated with what. What we've gotten out of our, the solution we have in OpenStack now is that users don't need to know what VLAN they are on. OpenStack and Neutron deals with that for them. Does the user actually ever see OpenStack? It depends on our users. OK. So, you know, for a lot of our users, they use the Asimuth self-serve interface and because that abstracts them away from all the complicated questions that they don't need to worry about. It's, you know, so they don't, you know, things like they don't need to worry about is their network a VXLAN or a VLAN? Have they turned on the right features on their Neutron ports to make this work? So I'd say a good 80% of our users are using, you know, they're consuming OpenStack, but through Asimuth. So they don't have to look at it directly. The other 20% is either the slightly more power users. So those who actually want to drive, you know, drive Terraform, drive, you know, the OpenStack CLI actively, you know, make active use of the API. And then the last chunk is actually the system-to-system comms. So one of the use cases we've got is for managing CI runners. So, you know, we get systems provisioned, connected to IPUs, run a CI job and then get torn down so that every single CI run gets a fresh instance. We don't get any kind of crossover. There's no crux being left behind. And this is one of the drivers in that particular case is if you've got your partners, be it, you know, kind of trusted or not, or if you're taking, you know, pull requests from external parties, running the CI on that, you're effectively inviting codes that has been written by someone else to be executed on your infrastructure. So absolutely having a fresh deployment every single time means that, you know, any malicious or not things that have been left behind. Well, those IPUs make for good crypto miners maybe? Well, yeah, these are problems that a lot of, you know, service providers have run into, you know, where it's like. Even if you're sadly, you know, even if people aren't using the IPUs, you know, if you've got some nice juicy AMD CPUs, people will always find a way to run miners on them. There you go. That's my business plan when I'm done here. But what does the scale look like for you right now? Like how many clouds are we talking about? How many servers? So at this point, we've got five clouds of varying sizes. The largest ones at this point are, I think, about 12,000 cores a piece with hundreds of IPU machines associated with them. Rather than deploying one large system in kind of multi-region, generally they've been deployed for different purposes. So we have internal systems made available to our developers. We have an external system so that we can make systems available to customers on a kind of try-before-you-buy approach. We have another system that we're currently bringing up at the moment which is going to be focused much more on bare metal deployments. And so we've just got a lot of flexibility. We've just decided, you know, rather than trying to make one cloud to do everything for us, let's build a system that has been focused and that way we don't have to worry about geographic complexity as well. Makes sense. And maybe just to take a step back there, what's really different about AI workloads compared to some other workloads and kind of where does OpenStack fit in there? So AI work, there's probably a difference between AI workloads that most people encounter and AI workloads in graph core lands. So for a lot of users using AI workloads, it is very much about access to GPUs. So those GPUs will generally be directly attached to your VMs or your bare metal systems. So that will drive a very particular sort of workload. Because of our disaggregated approach, there is much more of the, you know, slightly more traditional high-performance computing elements to it. So because we are our IPU over fabric protocol uses RDMA over Converged Ethernet, actually the performance of the networking and so being able to do line rate RDMA inside OpenStack is vital to making our systems work. Got it. And to make all of that work, I think you also had to write your own ironic drivers, right? So at the moment, we are carrying a patched version of ironic or particularly of the, to drive our IPU machines. The intention is we're going to be ensuring that that gets made available upstream. And so, I mean, we had a nice thing about being here at OpenInfras. You get to actually talk to people face-to-face about some of these things. And, you know, we had the opportunity to talk to some of the ironic team about why we were doing what we were doing. And their response was like, that's a great use case. You know, let's try and make it so. So that actually, do you do a lot of, do you open source a lot of what you are working on? So in terms of the OpenStack sort of things, we've got reference architectures for how we've been doing things. There hasn't been an enormous amount of specific custom code. It's just about, you know, here are some of our best-working recipes for how to make Nova fly. You know, how do we make our VMs achieve the same sort of performance that we get out of bare metal? One of the things, so when in our keynote yesterday, John talked about when we were running ML Perf, that was our benchmark. And that was what we were using to basically ensure that we actually got, you know, achieved the kind of performance that we wanted. There were some firmware, you know, BIOS settings that actually, once we'd applied those, we were actually able to get better performance out of our VMs than we were out of our reference bare metal systems at the time. And, basically, when we then took those settings and applied them to the bare metal, the bare metal then it also got faster. But it's just the whole process that we went through and trying to drive this had a benefit for what we were doing in OpenStack, but then had a benefit for the wider GraphCore that weren't using OpenStack-based systems at the time. We were talking about that. You've been playing around with OpenStack for quite a while, but also some other systems before that, right? Now, if you had to go back, I think you started using OpenStack at GraphCore in 21, 2020? Oh, sorry, in terms of personally, or... At GraphCore. At GraphCore, yeah, we've been using it since, yeah, since 2021. Well, you've been playing around with it for much longer, right? Yeah, I was trying to work out what, actually, when the first OpenStack, I think it was a Cata was the first deployment that I perfectly... Goes back a while? Yeah. Now, if you had to do it all over again at GraphCore, what would you do different? So, I think with the benefits of hindsight, there were probably a lot of capabilities, features, inside OpenStack that, for good reason, we chose not to use. So, for our first deployments, we were very focused on achieving our goal, which was to take our IP machines and put them into OpenStack and achieve a very similar capability to what we got on our bare metal systems. So, as a result, there was a variety of features which we chose not to turn on. And so, but in hindsight, or with the benefit of hindsight, now that we've achieved that, there are a lot of things that we would, it would have been useful to turn on. And a notable one is our very first system, we didn't deploy Octavia, and which then suddenly became a bit of a headache whenever we wanted to start doing Kubernetes in that cloud because there's a lot of... Octavia, the load balancer? Yes, yeah. And so, yeah, it was just... But at the same time, that is with real benefit of hindsight because the thing we need to also manage is when you're building a new OpenStack cloud and you might be very easy, especially when you're using something like your Color Antibole or your Koby, when you've just got, here are all the things, do I want to turn them on? It'd be very easy to just go, yes, enable, enable, enable, enable. Yes, I want all the things. But then you might end up deploying things that you never use, or you might end up deploying things which might confuse matters, or some of the things, when you turn them on, they are just on. Others, once you've turned them on, there's post-configuration setup. And if you don't know exactly what you need to do, then you might just end up, you're biting off more that you can chew from day one. So it's definitely a balancing act, how much do I want to enable, but without making it overly complicated. What does it look like today? Like, compare your OpenStack deployments from a few years ago to what you're doing today. So the scale has definitely got bigger. Our very first OpenStack deployments were, they weren't static, they were still driven by automation, but we would generally be turning over one of our virtual pods every couple of days. Today, we've now, we get, I don't know, maybe about 70 vpods being deployed every day, and they get turned over every day, looking entirely different every, every time. So every day, our developers get access to a fresh, you know, a fresh system that may be way larger or way smaller than the system they had the previous day. Sure, and I always like to end on a bummer, because we only have 10 seconds left, but what's still your biggest challenge when it comes to deploying OpenStack right now? So, our challenge is that even, we've got a really nice process, a really nice template for how we're gonna do things, but when deploying into a new geography, a new data center, we run into different challenges. It's, a lot of it is around the networking or the networking that has been provided to us by the COLO or the data center provider that we're working with, and where things just look very slightly different, and it's the cascading, cannibal effect of, you know, oh, we're just, we'll deploy this. It's gonna be exactly the same as that last system, but we're just gonna change that one thing, and it's the unintended consequences of changing that one thing and how it goes, you know. I've seen that before. Awesome, well, thank you so much. Our time is up here. I'm afraid these 15 minutes go fast. They do indeed. Thank you, Nathan. Thank you.