 Hi, everyone. Thanks for coming. Hi, I'm Scott. I'm a Cloud Engineer at G Research. It's a picture here of me at Dan Alice at the LHC at OpenStack days a couple of years ago. I'm an OpenStack engineer, a Cloud Engineer at GR. I mainly focus on deploying OpenStack and Ironic. I'm Ross Martin. I've been at GR for a similar amount of time, about four years, working as a Cloud Engineer, and I mostly focus on OpenStack and NSF. So I imagine quite a lot of you probably haven't heard of us. So we're G Research. We're Europe's leading quantitative research company. We're mostly based in London, in the West End, but we're actually expanding into the US this year into Dallas. So it's really exciting for us. And what do we do? So we create software to analyse and manage large data sets. We look to identify patterns in that data using some of the latest machine learning techniques. And we try and predict future movements in financial markets around the world. So a little bit about opening for G Research. So historically GR was a bit of a window shop. We're currently going through a digital transformation period over multi years. And then there's a big shift towards using a lot more open source technology. So there's things like OpenStack, Linux, Kubernetes, that kind of thing. And to deploy our OpenStack Cloud, we use ColourAnsible and Kyobi. So for those of you who don't know what that is, Colour allows you to build OpenStack services in Docker containers. ColourAnsible allows you to orchestrate that and deploy it across your infrastructure. And then Kyobi allows you to sort of do that on bare metal nodes. And it also helps to other things like generating the configuration, setting up the network, installing Docker, PIP, all that kind of things. And then we also have Ceph, which backs our OpenStack Cloud, and we use Ceph Ansible for that. So we're currently on the OpenStack train release. We've had a few issues getting off of Centre 7, but now we're kind of coming off over the horizon of that. And we're moving over to Centre 8 and then we're going to quite aggressively upgrade hopefully. So hopefully when we get to Vancouver there might be a bit of a happier story to tell there. So Ceph is currently on Nautilus, and this was upgraded last year from Luminus. So with our OpenInfo, we try to keep as close as we can to the upstream stable branches. We contribute back a lot of stuff that we do. And we currently have around 67 commits merged into OpenStack projects. We also get involved with the community, so we do things like we go to IRC meetings for Colour, we get involved with PTG, and we actually have a development team within GR who are specifically just for doing Open Source contributions on behalf of the company. And we're also members of the Foundation here and the CNCF as well. So I'll just talk about a few of the challenges that we have at GR. As with every cloud, they come with challenges. So these are the kind of four headings we think of them as. In terms of flexibility, we've got lots of different teams and loads of diverse use cases, and lots of different types of hardware to support that. And while some teams might just require a generic virtual machine to do some dev work, others might require specialized equipment, accelerators, things like that, for more specific, like deep learning, as Scott mentioned. So we need to be able to run a heterogeneous cloud that can support all of this and more. And further to that, we need to be able to pivot quickly when something new comes along. So when there's a new hardware accelerator or a new networking technology or something like that, we need to better take advantage of that quickly. So efficiency, we've been a research company, we've got a fairly significant state, and in order to make the most of this, we need to ensure that we're efficient in everything we do. It's not good enough for us to leave spare compute on the table that one or 2% really multiplies up at scale. So not only with our compute, we also need to be efficient with our deployment and support processes. So we need to make sure that the number of engineers that we need to manage the estate is as small as possible but that we can still stay on top of the maintenance and make changes quickly. And whilst we need to be efficient, we also need the performance from the estate, so we need to push the boundaries of performance to give us an edge against our competition and the quicker we can complete some research, get it to market, the quicker we can start some more. And finally, probably the biggest, one of the biggest things for us is security. We need to be sure that our platforms are formed to the latest security standards and maintain like a high level of confidence in the integrity of the estate. One of the ways that we're looking to achieve this is via like a monthly rebuild cadence in which we aim to kind of rebuild every server, every month. And obviously that means we rely really heavily on automation. But if we get it right, then the estate will always be clean, patched and reproducible. So by securing, by doing that and securing our deployment pipelines and removing humans, using our infrastructure as code and peer review and Jenkins automation to deploy that, we can ensure a high level of security across the platform. So whilst it's quite straightforward to deploy OpenStack for traditional virtual machines, it becomes quite a lot more complicated when you take into account some of the challenges that we have and the ability to schedule instances that have different hardware requirements and different software requirements on top can be quite overwhelming. I'll kick into a couple of our use cases. This is the first one. So our general compute platform traditionally consists of CPU and memory dense servers to provide generic machines. They're pretty vanilla. So they have access to a little bit of local SSD, large all-flash CF cluster that's presented to the users via Cinder. And then that gives them access to RDB and Redis Gateway. So our platform team, one of our platform teams in GR was set the task to provide our researchers with a secure environment to let them test and develop the models that they build before they're submitted onto our much larger compute farm. To make this possible, what they wanted to do is build a hard and Linux virtual machine for the users to have which had direct access to a GPU. So when we say GPUs we're not talking about like eBay GPUs. They're proprietary like the yeah, like A100 to B100 that kind of thing. And they're specifically designed for machine learning, deep learning and these can be really dense. We get like eight in a 6U chassis. So when we were sort of present when the team approached us with this and we were presented with the task, we had two options to present a GPU into the virtual machine. So the first is a VGPU. So this allows an administrator to split kind of multiple workloads across a GPU, essentially sharing like a portion of the GPU rather than the whole thing. You do have the option to pass the whole thing but there is a limitation in NOVA which would which would mean that you can only pass a single GPU or VGPU through to the guest. You wouldn't be able to do two, three, four or up to eight in that case. Eight separate instances. So this can be really useful in terms of like VDI use cases and things, but for us that's not really our primary focus. So going on to the second option, that would be to present the whole GPU up to the service in PCI passory. So this method means that the guest has full control over the device and its memory and it should be able to use 100% of its performance and HBC is obviously really important. So we decided to go well no surprise, we decided to go with passing through the entire GPU rather than the portion for us basically the extra configuration and the extra virtualization that didn't really provide any benefit. Yeah, and as I said our users they can use more than one GPU rather than one. So yeah, that kind of flexibility is really good but we'll see this is just in terms of our use case. Yeah, obviously evaluate your options. So yeah, the second option, well the second task we were challenged with, we had a team approached us and they needed NVME attached storage to provide raw NVMEs into the guest to do things like databases and caching servers. So our users wanted to maintain their usual flow of work so using OpenStack and Terraform to deploy their machines but they needed the additional hardware. And on the side of the OpenStack team, we wanted to make sure that we reused the lessons learned when we did the VGPUs and GPU stuff. So the PCI pass-through was actually the only option for NVMEs we didn't have like a virtual NVME option. But what we had kind of set up when we did the GPUs, the pass-through we kind of had this in mind this is going to change in the future and there's going to be other pieces of hardware so we kind of set the foundations there that allowed us to do this part using very little engineering effort really. And to do this we used a Nova feature and placement called traits. So we're going a little bit more into detail of what traits are and how they are applied to hypervisors a little bit later. But first of all we'll just quickly take you through some of the GPU pass-through stuff. Yeah, so a quick couple of examples for PCI pass-through is super easy to configure but it's useful to just see. So the most important thing is you need the vendor and the product ID for the device so we can look that up quickly with LSPCI. Once we have that we just need to do a bit of configuration for the controller so Nova API needs to understand what that device is. So we give it an alias product ID and a friendly name and we also need to enable the filter scheduler for PCI pass-through. Additionally for every compute node you also need to add this bit of config which is essentially the same alias but also a pass-through whitelist and the whitelist basically just allows Nova to access that device. So in order to be able to pass-through the device we also need to do a little bit of host configuration by modifying some grub options. So we need to use essentially blacklist the device with the Navo driver to stop the OS picking it up when it boots and we also enable IOMMU which gives better performance from moving memory between the GPU and the host. So once you've done that you can update grub, give the box a reboot and that will apply the changes. So once you have that we obviously need a flavour so you can just create a basic flavour but it does need a property and the property as you see here is a piece of pass-through and again that alias so it knows what it's referencing and then we also configure the number of devices we want to pass through so in this case we're essentially passing through the whole, all eight GPUs in that box. So now we have a flavour and OpenStack is configured how does the scheduling actually work? So in its most basic form OpenStack scheduling uses flavours, filters and wires and flavours have CPU RAM and disk requirements and can contain that additional data as seen. The scheduler will request a list of resource providers from placement, compute nodes as we know them and it will filter them down then based on the requirements of the flavour. There's lots of different filters you can enable like availability zone filter, image properties like there's a massive list and once we have that list of hosts that are suitable for this instance they get put through a weighing algorithm. So this is straight from the docks but as you can see the wares are used to sort the remaining hosts into an order of precedence and they're configurable, you can configure the multipliers and we can tweak those to affect the outcome that we're after. So a clear example of this is using them to prefer certain hosts over others which is like stacking or spreading. So in this case the algorithm will compute the weight of the host based on the remaining PCI devices and it will then sort them and the host and the largest weight will get the VM which will be the one with the most free devices so essentially we'll be just spreading them out across the estate distributing evenly and if it's configured with a negative value we'll get the opposite so it'll actually stack them up and we choose to stack them up mostly but not completely and the main reason for that for us is that the less number of hosts that are in play means we can essentially rebuild the estate quicker and easier and move so many things around. Yeah and that helps with our monthly rebuild strategy. So yeah, the next part is traits. So placement or as it was previously called over placement gives us the ability to schedule workloads with a lot more granularity and to identify what kind of workload can be scheduled on what kind of hypervisor and when you do that you're using traits. So if you think of a resource provider essentially in Nova that's just a hypervisor and traits can be if you think of traits as kind of a form of metadata and then you can add that to a resource provider or hypervisor and then they can represent certain qualities that the hypervisor has. So for example if you wanted to code an offload like a percentage of your hypervisors within your cloud this would be really easy to do using traits or you'd have to do is just mark all of your production nodes with a production trait. So let's go for an example just how this works. So here we've got three compute nodes and as you can see in the list there's no real way to determine what kind of qualities each hypervisor has they all kind of look the same. So if we do if we look here we've got the default traits that are applied to the hypervisor by Nova so these are expanded by the placement service. So to add a custom trait to your hypervisor we use an open stack command called well open stack resource provider trait set and then all custom traits have to start with custom underscore. So when setting the traits you've got to specify them all in one list for the resource provider and you do this by providing dash just trait multiple times it doesn't currently support a way to like a pen or a move so just make sure you add them all at the same time. You don't need to worry about overwriting the original ones that were there. When you actually do this if you're quick enough or you're watching a command you'll see that the default ones will kind of disappear but within a second they'll come back. So in this example I've added two traits custom foo and custom bar. So finally what we do now is we just add the flavour property to make the link between what the user asked for and where it can be placed within the cloud so yeah we've got custom foo in there as required and then yeah it's also worth noting that if you do this with a PCI device you need to make sure that you set the property for both the trait and the PCI alias just because you're using traits doesn't mean you can just forget about the PCI alias. So yeah the thing that this gives us is real kind of flexibility when we're logically separating the cloud and it's really inexpensive to add these traits. There's no configuration to actually push out to the cloud, there's no restart of NoVo or anything like that, it's just you just apply using this CLI. So yeah to achieve this at scale and avoid like human error and operational toil it makes sense to kind of wrap this up in an Ansible Playbook or some kind of automation script. An interesting feature that we make quite extensive use of in Kyobi is the ability to run something called custom playbooks and this provides a sort of scalable way to apply and maintain our traits before our hypervisors without having to use the need for like an admin to log in and do things manually. So what we do is we just wrap up the commands that an operator would run in an Ansible Playbook and this gives us a repeatable way to ensuring the correct traits are applied to each hypervisor and this can be controlled by using building functions in Ansible like host files. So we could do a next step of this, it's a pretty basic implementation. All we do is just run the opensack commands and to be fair that just gives us what we need you can return the results in the opensack CLI and JSON and that's really easy to kind of do what you need to with Ansible so as a next step we could kind of wrap this up in an Ansible module but for the sake of what we've done at the moment and the presentation this is just a small example of how you can do it really quickly with little effort. So yeah, if we look back to what Ross was saying at the beginning what we actually set out to achieve, how do we kind of do this so the flexibility side traits gives us a really flexible way to logically separate our hypervisors within the cloud so when new use cases come along we have the foundations in place to get ideas to market in a reasonable amount of time so the efficiency, by having full control over where things are scheduled in combination with things like the stacking we ensure that we make the most out of the pool of our hardware and then performance so as we pass through the entire device to the guest there's no extra additional virtualisation layers that we don't need although the impact isn't probably too much by using the VGPU but if there's no real benefit to having the virtualisation layer you could argue just why bother in the first place and then security so we use Kyobi and Color Ansible as a secure method along with Jenkins and Infrastructure's Code and it allows us to follow the kit ops or workflow and make sure that people aren't doing things manually so looking forward there's a few things we still need to think about things we haven't solved yet we don't think they're solved yet but if you believe they are please come and tell us and the first thing is we need to clean these devices so when we pass through GPUs obviously the memory gets used and when we give raw NVMe devices through people write data to them and we need to currently we have to silo some of those machines to different groups to reduce the risk of data contamination and that's not very efficient so we need to find a way to ensure that they're clean so the next user that picks them up has a clean environment and that would just give us more flexibility in our estate and the problem then is how do we allocate them fairly because it's pretty much a first come first serve as long as you have the quota and a flavour access so some good news on this that I've learnt in the last couple of days is that a concept called Unified Limits which I think John Garbert from StackHPC came up with will hopefully help with this so that's something we'll be looking at soon I think it's just been merged and then lastly performance out of the GPUs it helps to link the VCPU processes and the GPUs to the same socket and we have rumor awareness enabled but I believe as the VM currently only provides a single PCI route which means we can't tell which socket the GPU is connected to so if we can solve that hopefully we'll get us a little bit more performance as well so that's pretty much what we have for you today so thanks for listening if you see us around come and have a chat anytime and yeah if there's any questions we'd like to open the floor do you want to use the microphone? is that okay? okay so cleaning is an interesting topic I will probably head out to you afterwards because we were... did that not work? sorry it's a little bit quiet okay let me try this so cleaning is an interesting topic and we should probably talk about this afterwards the other thing is we found that Nova is not consistent with which PCI devices it passes through to a VM for instance I'm not sure if it affects also software boots but I think at least during heart reboots or during hypervisor reboots VMs may end up with different NVMe devices which has caused a lot of confusion with our customers really because they suddenly saw the data or didn't see it anymore because they got a clean device instead of the one they had do you have anything in that regard in your infrastructure have you seen that do you have a solution or is that something not on your radar or... so as far as I'm aware that's not on our radar I don't think we've seen that but obviously that's an interesting issue we should look at do you know what version you've seen that in? I think we've been seeing that also on train but certainly on Pyke okay well we'll definitely have a look at that because yeah that's an issue thank you so I was going to wait and talk to you guys about this afterwards but since this also came up in the previous question have you considered using a remote-disaggregated NVMe storage service instead of passing through direct access to the NVMe devices there are solutions today that have sender support and will give you local NVMe performance okay that's interesting to know and by the way they also solved the problem that the previous question was about okay that's definitely something for us to look at we we know that our particular use case needs like raw access to the disk, like a raw device so I don't know whether or not your solution would cater for that but it just works quite well with passing through the device directly but yeah maybe I'll have a chat and see what options there are cool thank you is there anything else? okay thanks