 OK, we're at 10.2, so I'm going to kick off. Thank you everyone for coming. This is a bigger crowd that I'm used to dealing with, so this should be interesting. My name's Stephen Finwchern, I'm a software developer working for Red Hat. I primarily focus on NFE, SDNE kind of stuff. So anything to do with CPU pinning and new movie stuff, that's my kind of problem to me. Today I'm going to be talking about a new feature that's been introduced in Nova in the last cycle, the Rocky cycle, called Numerware V-switches. And because of the attendance of the crowd that's here, I'm also going to focus on some kind of similar features in the same area that I think people should be aware of. I'm coming from a development perspective or background, but this talk is mostly focused towards the actual deployment and usage side of things. I won't go into too much detail about how any of these things actually work under the hood, but I'll leave time at the end for questions. And if anyone wants to catch me afterwards, you can do so. I'm happy to answer any questions you may have. So the agenda for today, I'm going to give a quick overview of what Numerware is. Again, very high level, there's no point, I won't go into too much detail about it. Just so you can understand the kind of problems it can cause for deployments and why we need to come up with solutions like Numerware V-switches to work around them. I'm going to go through the actual Numerware V-switches problem itself and a couple of the approaches we tried before finally talking about the actual solution that we settled on. After that, I'm going to go through some common questions that people have asked me on this over the last couple of months. And then that bonus section, I'm going to cover some related features that I think people should care about and then wrap up in questions. So starting from the top, what is Numer? So it stands for non-uniform memory architecture and it's a computer process of architecture that's seen common adoption on platforms in the last day period or two. Pretty much if you have a modern server with multiple CPU, multiple physical packages, you're going to be dealing with Numer unless you've explicitly disabled it. From a graphical perspective, this is essentially what looks like you have at least two nodes and processes running on one node, be it a physical stacker or a die within two or within the same package. Memory access is from that to memory associated with that node as fast as you'd expect. If they want to access memory associated with another node then you incur performance penalty. And that performance penalty can vary but 50% can be quite common for something like this fee switcher's issue. If you're talking about multiple circuits, this has changed with Intel processors over the last year. It's more of a mesh topology now but jumping between different Numer nodes, it can have an increasingly negative effect on your performance. So naturally what you want to do for almost any process where on a platform where Numer is enabled is you want to make sure that memory accesses where possible are local to the nodes so you're not having to jump across these links. And this doesn't just affect your processes, it also affects anything on your host that consumes memory. So that includes PCI devices, so things like NICs because they have this same issue. If you have a PCI NIC and it will be associated with a given Numer node and if you have processes running on different Numer nodes you're going to get those performance hits. So when we talk about the kind of networking that's available, there's three, there's a load more because networking is weird and special. But the three that most people spend their time talking about would be the kernel of Vhost or legacy.io, user space Vhost which is generally typified by things like DPDK V-switch and then SROV. So each of these comes with their own pros and cons. I've listed kind of flexibility versus performance but there's a lot more that has to be taken into consideration including cost and things like this. The main comparison, the most useful comparison here is between user space Vhost and SROV because these would be the most likely ones you'd see in a deployment, NSDN, NFE-based deployment. So if we have a look at this diagram to VNFs you've got two physical networks and then you're going to have traffic going between the VNFs and the physical networks plus some management stuff which we're going to forget about. Because we have this Numer topology thing that we need to be concerned about that previous diagram ends up looking something more like this. So unless you've specifically said otherwise, each VNF and instance is going to be running on, is going to be isolated to a single host Numer node. And what you don't want happening is you don't want your VNF running on a different Numer node to where your NIC is, where all the traffic is ingressing and egressing from the host. Naturally you want to co-locate those stuff on the same Numer node. For something like SROV this is pretty easy to do because NOVA since it added support for SROV and PCI Passu has taken Numer topologies into account. There's actually an uncomfortably tight coupling between Numer and things like CPU pinning which I'm going to talk about later. We're trying to separate that going forward but if you boot an instance with SROV interfaces you will get Numer affinity. It won't let you boot that instance unless you can get that Numer affinity. But that's not the case for something like OVS because NOVA doesn't have that view into what NICs your chosen V-switch platform is actually using. And obviously I'm going to go into this now but we don't want to go and have to teach NOVA how to do all of this stuff because NOVA cares about compute, it doesn't care so much about networking. So how we attacked this problem and came up with a solution. The solution that we landed on obviously wasn't what we arrived at day one. This was built up iteratively by looking at various different choices that we had. Starting off we're talking about V-switches. NOVA doesn't need to know about networking therefore the simple answer would have been well we can let Neutron do all of this for us. The issue with this is that you need to think about what NOVA is aware of and what Neutron is aware of. So NOVA knows things about compute resources so basically what kind of basic resources you have available to you. Whether you're using Livevert or QMU, just walk QMU, that kind of thing. It also knows things like pneumatology because it has that introspection through the Livevert API or the Zen API API and so forth. And Neutron doesn't know any of this stuff and just as we don't want NOVA to start learning how to inspect OVS, we don't want to have to teach Neutron how to inspect Livevert and Zen and Hyper-V and whatever your chosen hypervisor is. Placement seems like the saviour for pretty much every problem that opens back at least in the NOVA and Cyborg and Neutronsphere. The saviour for pretty much every problem we have. The problem with placement was at the time that we were working on this and even now the features that we needed placement didn't support them. So we could have waited three or four cycles for these features to land in placement, but even then this placement has been in development for quite a long time and 50% of performance here is quite a substantial performance here. And we at Red Hat were trying to support these custom hacky workarounds that involved restarting services and stuff and we didn't want to do that. So essentially placement wasn't a viable solution for us at the time. NOVA, again this comparison of what NOVA knows and Neutron knows. The biggest issue is that NOVA doesn't know how to inspect and pull out information from OVS and therefore and we didn't want to teach it this stuff. But we thought about this for a little bit longer and we realised that actually there wasn't any real reason to go and do this introspection. We were able to work around this. So we assessed those on NOVA as the solution but with some caveats which I'm going to explore now. So when you're trying to categorise Neutron networks there's two primary ways that they, if you go through the Neutron documentation that you'll see them categorise. The first is something like provider and tenant networks. There's a load of other stuff, I found it quite interesting when I was working on this that I was talking to Neutron cores and asking them questions and before long I started getting back we don't know as an answer which was quite interesting. So provider networks versus tenant networks is a very kind of neutrony thing. There's more to do with who's doing the configuration of networks and what the underlying topology is. It meshes a load of stuff together but it didn't really apply to this. Pre-grading networking and self-servicing networking again didn't quite map up to what we wanted. What we actually wanted to know was whether a network was an L2 network, used something like a FISnet or whether it was an L3 network. And the reason for that is because whether it's an L2 network or whether it's an L3 network determines how many nicks your network is allowed to use and how they're configured and accessed. So if we look at the sample configuration this is from one of the configuration files for Neutron. If you're using the ML2 OVS plugin this configuration option maps an OVS bridge which I guess you consider like your primitive to a Neutron FISnet tag. So it says if you have traffic going through this FISnet this is where it should get tunneled in and out of. From the Nova side of things we introduced a new configuration option and dynamic groups and what the configuration option let you do is you said what FISnets were available or accessible on your host. And then a plain old this is what NUma nodes that this FISnet maps to this is what NUma node this FISnet maps to and so forth. The advantage of this approach is we didn't need to teach Nova how to do that Neutron inspection. I didn't need to learn how to go and use the OVS DB APIs and import a load of code from Neutron or redevelop code. And any deployment tool worth its soul is that's doing configuration of these networks anyway. So that already knows all this information about what nicks you have on your host and is able to introspect and find out like what the NUma node is and stuff. So this moves it from something that has to happen at runtime to something that you can configure as part of your deployment tooling. An interesting point that came up during this is in real productions most people tend to use nick bonding to make sure that they have like active passive failover and fancy networking stuff. So we made sure that this NUma nodes configuration option was able to be a list. So if you had multiple nicks connected and for some reason you decided you wanted to place them on different NUma nodes for whatever, you would be able to do that without, you would be able to do that without this option you wouldn't be able to do that. L3 networks are a little bit different so where with L2 networks you can then be typified with having multiple fizznet on a host. L3 networks, they all have to go through the same endpoint so you configure your endpoint IP. You can use multiple actual interfaces but they'll have to be bonded together. So that made the configuration, again this is an example from the ML2 OVS plug-in. That meant the configuration from a nova perspective is actually a lot simpler for the tunnel side of things. And again we were aware that people do bonding and they can also spread this over multiple NUma nodes so we made this a list. And that's pretty much NUmaWare vSwitch, it's not overly complicated and like in tests that we've seen without this feature, like I said up to 50% of a performance hit was possible in the situation where you had that diagram with it being in where a workload was running on one NUma node and your nick was located on a different NUma node so everything was going across that bus. This eliminates that basically as an issue, as a realistic issue. From the deployment side of things, I know that the version of triple O or the OSP deployment tool that supports this configuration manually in OSP 14 and the plan is to make sure that the deployment tool will just do it for you without need to worry about it in OSP 15 and 16 and so forth. So a couple of the common questions that I've been asked about this is A, why is it so manual? It is a pretty low tech solution given the solutions that had been proposed. The reason for that is going back deployment tooling knows all this information already. We didn't feel that there was a need to introduce a whole load of additional complexity into either nova or neutron and teach those systems about things that they previously didn't know about. Placements, you can automate all of this, we're exploring doing that in triple O going forward. And placement was explored and it was looked at in great detail. We made the determination that it wasn't quite ready for this but going forward we will probably be exploring moving this to placement. I would recommend there's a really interesting talk happening later this week. It's a demo for the bandwidth that we're scheduling. All of that uses placement and if anyone here, especially the more technical focus people, is interested in this stuff I would highly recommend going to that. So outside of Numerware V switches there's been a couple of other changes that have happened in recent cycles. All focus on improving the performance or the determinism of instances with an NFV focus. The first of these is a thing called configurable TXRXQ sizes. In short, if you have an instance running and something on the host or within the instance preempts your standard workload and you have traffic flooding into that instance, that stop where it gets preempted and work happens on something else before it returns to your main workload. You have queues and those queues can fill up pretty damn fast, which results in drop packets. From talk into architects drop packets seems to be like the worst thing that could ever happen. The solution to this is simply to make queues bigger. In all the testing that Red Hat have done and various other companies have done on pingfarm the default, which was 256 up to 1024, was seen as sufficient to get the performance that they were demanding. But this is configurable because we're expecting people like we've moved from 10 gig to 40 gig and 100 gig is already I guess in production and who knows where it's going to go after that. This is available in Rocky and a sample configuration is provided there. Another one, this is more useful for people that be thinking about real time workloads. So if you are running an emulator, the emulator, a hypervisor, the hypervisor itself has to do a certain amount of work, whether that's basic IO or just kind of like clean up tests. These can steal resources from CPUs and if your CPUs, your VCPUs are running real time workloads, you don't want anything else stumping on them. So the solution to this, the initial solution at least, was to assign a dedicated core for each instance that all this overhead stuff could get thrown onto and keep it away from your main workload cause. This was introduced in the Cata, but in Rocky we've built upon this so that instead of having a single core per instance, you have a pool of cores and all your workload is all those overhead tests are all pooled on this pool of cores. Sample of how you configure the former and a sample of how you configure the latter from the command line. And for the latter there's also some NOVA configuration necessary. And then the last one which is, this is still under development at the moment. I said earlier on that we have this unfortunate thing in NOVA where NUMA and things like CPU pinning have been really closely coupled for not very good reasons. We are moving things like bandwidth that we are scheduling. That's all happening in placement. We've moved VCPUs, memory, disk, all of that stuff into placement. The next step in this from the NFE perspective is to start tracking PCPUs as well as VCPUs. So that we're expecting this work to be completed in Stein and from the user side of things. This will have two implications. The first is that the days of having to split your cloud into hosts that will run dedicated instances with pin CPUs and hosts that will run non-dedicated instances, normal shared floating CPUs. That won't have to happen anymore because they'll be able to coexist on the same host. The other thing is that you'll be able to mix those pinned cores with the non-pinned cores within the same instance. So you'll be able to say, well, I've got eight cores in my host. I want six of them to be pinned and then I want two that are just going to be doing internal OS overhead stuff that don't need to be pinned. They can float. We're going to do all of this through tracking it as placement. It's not completed yet, but this is roughly what it's going to look like. Anyone that's worked with placement before should recognise these kinds of commands. This will all be exposed via Nova Comfort configuration. The last one, this is the personal bane of my life because of the amount of times I've had to support issues around this. Live migration for instances with a new metapology or CPU pinning at the moment is completely and utterly broken and it has been broken since it was implemented. There's some workarounds that people will do. More experienced operators will have their own little bag of tricks that they'll use, but the fact remains that it doesn't work out of the box and it should. The solution to this is to fix it. The technical details of how we're going to go about fixing that would take a session by themselves, but this is in the pipeline for this cycle. So quickly recap what I've covered today before I move on to Q&A. Firstly, not accounting for NUMA can have tremendously bad impacts on your performance and therefore you should account for it. NUMA-aware V-switches have been introduced in Rocky. This closes one of the issues that we're remaining open issues we had around NUMA and NUMA awareness. This is based on a Nova configuration config and we think of it as mostly a deployment issue, future versions of triple O and I guess other deployment tools will make this configuration almost automatic. We will be looking at moving this into placements in the long term, but I envision that being three or four cycles, so that'll be about two years down the road before we're in a situation to do that until then this is the way to go about doing that. There are a whole load of other features that have been released in the last cycle or two that I would highly recommend anyone working in this area focus on and look into. With that, that is a 10,000 feet view of NUMA-aware V-switches. If anyone has any questions, feel free now or after the talk and thanks for listening. As you mentioned that some of the features landed in Rocky and some landed in Akata is OISP director upgrade takes care when you upgrade from the earlier version to the versions that do support the configuration automatically or that needs to be done manually and wants it integrated with Rocky from that point on is upgrade both one version upgrade as well as fast forward upgrade handles. So my understanding of this is it depends on the feature you're talking about. So triple O will handle migration of, so if you have a feature that existed in one release and it changed in another release, then it'll handle the migration between the two of those. But in terms of when you're adding additional features it doesn't tend to tweak any nubs around that. Specifically the NUMA configuration which you were introduced. Yeah, no, it wouldn't turn that stuff on by default. That would be, I'm not sure how you'd actually go about doing that but it would be an operator would need to supply input to say this is something that needs to be switched on now. We wouldn't do it by default in case we broke something, somebody didn't want it. Any other questions? If not I can run away. Thank you very much for listening and enjoy the rest of your summer.