 Hello everybody, and welcome here to the Neutron control plane performance improvements. So let's go through the agenda. So I'm Aracela, and I will be talking about L2H and performance. Then we have the distributed virtual router and security group, this will be Brian. And then we have measuring RPC performance and some very interesting data by Kevin. So let's start introducing a little bit of the L2 agent. So the L2 agent runs on compute nodes, and he's in charge of configuring the virtual bridges. So in the obvious implementation, we are talking about the brint and brtun. Brint is the bridge that has the flow to tag and untag the packets coming and going to the VM. And brtun, as the name says, it's the bridge in charge of the tunneling. So it's the one that has the flow to translate the VLAN ID assigned to the network into the segmentation ID. So if you're using, for example, grid tunnels, it would be the grid tunnel ID. So its main task is to wire new devices. It basically checks every time if a new device was added. And by new device, I mean a tab interface that it's created by Nova to connect the VM. So if it attacks a new device, it will communicate with the neutron server to get the device details, and it will wire the device. The L2 agent, it's also in charge of applying security group rules that are firewall rules, and in neutron are implemented using IP tables and IPsec. So if you want to improve the performance of the L2 agent, so what's the area that you're going to look at? So in my mind, we have two different categories of improvements. We have the one related to what I call internal processing. So it's the processing that the agent does without involving any other process. So for example, when the way it detects new device or the way it computes which flow needs to be applied and the way it handles failures. So in this category, to improve the performance, the best thing you can do is to use better algorithms. In the other category, we have some external process involved. For example, the obvious agent uses RPC a lot to get port notification, like when a port is updated, for example, to get security groups updates, and also to require the device information to the neutron server. And to notify the neutron server that our device is up or down. As I was saying before, the obvious agent also interacts with IP tables and IPsec to apply the firewall rules. And of course, it communicates a lot with OpenVid switch to apply the flows and also to get information regarding the interfaces. So if an external process is involved, an easy way to improve the performance is to reduce the overhead in the communication. And that's actually what we did. So regarding the L2 agent, we can divide the performance improvements in three categories. So improve the RPC call, reduce the overhead in the communication with OpenVid switch and provide a better way to handle agent restarts. So regarding RPC call, the agent, when a device is up or down, will send a message to notify the neutron server. And liberty was one message for every device. And what we did was to create a new call that it's a bar call so that you can update many devices at once. And we also added a parameter in this call, fail devices. This will be used in future to improve the way the agent handles a failure. We also improved another call that it's the security groups provider updated. This is the call that it sent when the provider rule needs to be refreshed. The provider rule is the rule that allowed traffic coming from the HCP server. And so you need to update it when the IP or the MAC address of the board changes. Before this message was triggering a full refresh of the firewall, so all the devices were updated. And what we did is to modify it to add the list of devices that needs to be updated. So basically, they are the devices that are on the network where the HCP port changes IP or MAC. So to reduce the overhead specifically in the communication with OpenVswitch, we have... So the OVS agent was issuing a call for every interface to get the details from OVSDB. And what we did is instead of issuing one call per device, just have a bulk call to reduce again the overhead. And the last one, it's not only a performance improvement, it's also a kind of correctness improvement. Like before a liberty, the OVS agent were deleting all the flows at start. So if you had a connection, basically the connection dropped. What was done in liberty is that when the agent adds a flow, it uses a cookie, so the flow is associated with a cookie, that it's basically a UUID that it's created when the agent starts. And so the agent wants to delete any flow at start-up. It will create the new flows associating the new cookie. And then where the new flows are in place, it will clean up the stale flows. And it will be able to recognize them because the cookie is different. And so we'll delete the stale flows, but the connection will stay in place because the new flows are already installed. So I wanted to check if that really improved the situation. So I created a very small test, especially compared to what Kevin will show later. So it's just 20 VMs that are booted. And these are the results before these improvements, and these are the results after. So just to show you that it worked, actually, so the minimal time is 0.6% better, the average time is 4% better, and the 95% is 5.9% better. But there's still a lot of work to do, so if anybody's interested, please gram me later, and really you can help a lot. So some of the things that can be improved, like we still use the common line for OVS DB Monitor, and it would be nice to use the OVS Python library. We also should create a queue of events so that we can have multiple workers. We should also add a priority so that higher priority events can be processed first. And generally, we should improve the state convergence so that what's in the neutron DB reflects what's actually on the compute host. And we can reduce more the overhead in the communication with OpenVswitch. For example, right now, when we get the interface details, we need another call to know to which bridge the interface is connected, and this can be easily avoided adding the bridge ID in the external IDs. And also one last point that it's not in the slide, it was just mentioned in the design summits. Probably having all these bulk calls for the RPC is not really an optimal solution because especially when the agent starts up, if you have many, many devices, you can incur in timeouts. So we also need to find a better way to address that. And now, Brian, we'll go on. Thank you, Ursela. All right, I wanted to talk a little bit about distributed virtual router, but in order to set the context, I wanted to first show what things were like before distributed virtual router. So if you look at this diagram, there was typically a network node where it was centralized L3 agent would run. You have all these compute nodes. And any time they wanted to access the external network via their floating IPs, if you look at the blue network, they would all go out over the blue network. Eventually, they would all converge on the network node and they would basically saturate that wire, saturate the network node such that it couldn't perform as well because it's doing SNAT traffic. So with DVR, we changed that by adding these red lines here so that every compute node was also connected to the external network. So this took a lot of the load off the network node. It let all the floating IP traffic go in and out of the compute nodes to all the local VMs on those compute nodes. So what happened was inside of a DVR compute node, in addition to the OBS agent, there's now also an L3 agent that is instantiating the neutron routers and there's a metadata agent to handle Nova metadata. So this was great for the data plane. But now if we look at the control plane, now that every one of these compute nodes has an L3 agent running, the neutron server was initially sending out, or was sending out RPC messages on an L3 fan out queue, but initially it was only sending it out to a small number of networking nodes, one to some small number. But what happened was with the distributed nature of it, since every one of these had an L3 agent, every one of them started getting RPC messages. So this started to kill the message queue. It started to kill the network. So if you look at this purple line, and we've noticed this a lot in the Liberty time frame, was there any time a floating IP create, update, delete, or other such operations were done, we could basically kill the server. So we had to do something about this. So obviously the DVR code needs to be a little bit smarter. If it knew where the port was for the floating IP, why didn't it just send the RPC message to that one compute node? Why did it fan it out? So there was a change made actually just a couple of weeks ago in early Metaka that changed the floating IP create code to do that. Instead of putting it on the L3 fan out queue, it sends it directly to that one compute node to that one L3 agent. We also have patches in flight for the update and delete paths that hopefully will be done in the next couple of weeks. And then once we knew about this issue, we need to start looking at other places where we can apply this same methodology, for example, in the port update path. The other change with DVR is there were some suboptimal DB queries being done. For example, when there was a floating IP query, it was querying for all the ports, not just the ports that were floating IP ports. And Kevin actually has this on one of his later slides, so I won't go into this in too much detail. So security groups, to kind of set the background on what security groups are and how they work in Neutron, security groups are all instantiated in IP tables. And the way that the code works is it first calls IP table save, pulls all of the security groups out of the kernel. It then takes what its view of the world is for rules and chains, and it merges them together such that the table is updated, and then it pushes them all back. That operation can get extremely slow when you get a large number of rule sets. So while we always thought that increasing capacity in compute nodes was great, it actually turns out to be a bad thing in cases like this. Because as counts increase from, say, 50 to 200 VMs on a compute node, the number of IP tables rules grows very quickly. So luckily, we noticed this early enough in the Liberty cycle that Kevin here, who does a lot of the IP tables patches, was able to change the code from doing serial searching of this entire IP tables rule set to doing a hashed lookup such that we got this performance improvement down in a manageable space. Additionally, we have some other changes in flight because we still have a few issues where, again, with large number of rule sets, the worst case scenario can still be 30 second delays in getting rules configured. And while this might not seem like a big issue, a lot of times it impacts booting of VMs because DHCP packets can't get out until the security group rules are added. And then that way, DHCP could fail. And it also causes the agent to be stuck doing IP tables work when it could be doing a lot of other things and it can get backlogged. So we have a change in flight at the moment. So instead of building into the entire table, IP tables table, and merging in the rules we want, we're just going to start doing deltas of those rules and then push just small deltas back into the kernel so that it can apply them. So the delta could be a deleter, an add of a rule or a chain. And Kevin has done some work on this. And the startup time, if you restart the OBS agent, he's seen the time actually cut in half of doing a restart. So it's a pretty good improvement. And then the last bullet here, some people might wonder why I'm mentioning Nova Network in a neutron talk. But at the time the quantum project was started, that IP tables manager code was the same in both projects. It was basically taken line for line. And over time we've diverged in how we work. But Nova Network continues to make performance improvements. And periodically I do take looks at that code and do diffs with neutron and find some little nuggets there that we can possibly apply. And the latest one that Nova Network added was the concept of a dirty table. For example, there's multiple IP tables, tables. And if we knew that we only need to touch one of them, or two of them, then we would only operate on those and we could leave the other ones alone. And I talked with Kevin about this the other day. And we're thinking that if we could apply that concept to chains in neutron, that it could also reduce the amount of work we're doing from right now, which is a zero N down to below that. Because if we know we have just a very small thing to work on, then we don't have to look at the rest. We can just ignore it. So there is some future work here. The first obvious one is that the OpenV-Switch agent is gonna have security group and contract support. And when we're able to get that work in and able to run later kernels in the gate, that should improve the OVS security group performance significantly. But we still have to worry about IP tables because the Linux bridge is still gonna be using IP tables. So it might be time to start looking at alternatives like NF tables, which is the follow-on to IP tables and is gonna eventually replace it. We don't know yet. I think it's something Bill needs to talk about. One of the other things that we're starting to do is profile the code. IBM has started profiling the L3 agent to try to find places where we're spending much too much time in certain areas of the code. So they found a possible issue in the plug code in the drivers. And they've also started noticing calls to SBIN IP are taking much too long. So there have been a couple of patches proposed in the past regarding privsep privilege separation that started using a built-in Python library instead of using a call out to shell to list the interfaces. So using the Netlink library basically right from Python. And I think that found about a 4x improvement in some of these commands, even though it's maybe a second, but you're getting that second back. So I think trying to revive those patches and do some more performance testing on those would be great. The other thing, since I had mentioned DVR at the beginning, we've noticed a lot of both failures and performance issues with DVR through the Liberty Cycle. And we decided yesterday that we're gonna start doing a weekly meeting just to cover the DVR performance scaling and bugs starting next week. So if any of you are interested, I'll be sending an email to the list so that we can start doing that. I think the meeting time is 1500 UTC. And with that, I hand baton to Kevin. Okay. So when we started talking about giving this presentation, what I wanted to do was measure how the performance of our RPC APIs have been changing since we've been making changes throughout the cycles. The problem we have is we have Rally, which is really good at measuring HTTP API performance, but compared to how many calls are made to Neutron from an API perspective, the agents make way more many calls than any kind of HTTP client like Nova or tooling that users are directly making calls to the server with. Because like one security group update can result in 50 agents asking the server for information. So that's 50 calls to the RPC API versus the one that would go onto the HTTP API. And so we have this issue where the performance of our RPC APIs while we're developing isn't really visible to upstream developers because if there's an operation that's slow for only a few ports, which is what developers are usually working with, it will still return and start working fast enough that we'll have a performance degradation go unnoticed. So what I wanted to do was see if I can get Rally to measure the RPC performance that we have. So what I initially started to do was try to make a Rally scenario that embedded Neutron client bits in it directly. But this led to a lot of brittle stuff because Neutron, or not Neutron client, the Neutron agent components in it directly. But we would refactor, those would move around and that would break. And they would also have different Oslo configurations. So Rally has its whole configuration set and the agent has hit its configuration set that would conflict. So I added this extra layer in between where we just have a small process running that runs the Neutron agent components, the RPC components that make calls to the Neutron server and then exposes those via a little HTTP using the bottle library. Then I can point the Rally scenarios at it and just do single HTTP requests that end up measuring the performance of our Neutron server RPC APIs. So for the test scenario data, because we noticed some of the issues that we have weren't starting to show up until we'd have a lot of scale, I created a whole bunch of stuff in the DB to begin with. So 1,000 networks, almost 2,000 subnets, 1,000 security groups, 14,000 ports, 27,000 security group rules. A lot of these numbers weren't too specific. I just wanted to create big numbers and stopped when I got bored waiting for it. 950 routers, 4,000 floating IPs, and then make an RPC call, retrieve information for 400 of the ports and 50 of the routers. This would be kind of a heavily loaded agent in a deployment you might see now. So I wanted to measure how this changed from the start of key load development all the way to the end of the Liberty release now. So I have had a script that would check out a commit, load that DB test data, run the DB migrations to get it to the right position for that state of neutron code at the time, start up the neutron server, run the rally scenarios against it, and then start over the process again with the next commit in the line so I can get performance data for every single commit throughout the neutron development cycle. So the first one I'm looking at is the L3 agents get routers call. This is the main call the L3 agent uses to retrieve all the information it needs from the neutron server. This includes router information like extra routes, the gateway interface, the interfaces to its local subnets, floating IPs, and internal interfaces for DVR. So here's the get routers performance from the start of Kilo all the way down to the Liberty release. I don't know if you can see the small dotted line right in the middle, that was our Kilo release and then the small dotted line right at the end was the Liberty release and that'll be the same on the next graphs. So this first drop, as this goes down, it's better because this is the time it takes to fill one RPC call. So we went from 12 seconds down to, I think this was about nine seconds. This was a patch that added a DV relationship directly between a neutron router and its interfaces. Before this had to be inferred by looking up all the ports that had the same device ID as the router that used to exist. So this resulted in a performance improvement. Hierarchical port binding, which didn't really have anything to do with routers at all, for some reason gained us some performance here. This changed some of the internals of the get port retrieval on the ML2 side. So we got a performance bump there sort of for free without even trying. Here, this little drop down, down below eight seconds was when we changed the query to get the SNAT interfaces for a router into one bulk query. So before it used to be, if you have 50 routers, we were doing 50 database lookups, this bulked up into one lookup. And well, before I talk about this one, this little bouncy thing here you see is on all the graphs and that was the HP cloud being unstable. It wasn't anything to do with neutron. So here we had a really big performance regression, this add auto address to subnet lookup. It was a part of a DNS feature to match a funny thing we were doing on the DNS mask side. But what happened was it was causing a get subnet call for every port when a port lookup was being done. So a router's asking for, or this is asking for 50 routers. It doesn't care particularly about the subnet information. It just needs all the port information and it was resulting in a ton of subnet lookups. And this combined with another bug actually resulted in a really big performance drop. I'll show on the next slide how bad this one was. And then here it dropped down to about six seconds. You can't see it. There's actually a distinct point by only fetching the floating IP ports that were relevant to the host. And that was the query that Brian mentioned. So that brought it back down to six seconds. And then this last thing that brought it down to, I think it's just under two seconds now, is removing that auto address subnet lookup that was added in to mimic what was actually a bug. So this is how bad that add auto address subnet lookup regression was. It caused us to go from eight seconds to about 270 seconds to fill one get routers call that was going to the server. And part of that was because of this query we have here. It was hard. We had a whole bunch of people staring at it and didn't notice a difference. But the actual problem was that when we were trying to find all the router reports that were, all the router reports that were related to a router that was passed in, we were issuing this query, but the join condition was wrong. So what was happening is it was returning every router interface back from the DV. So with that initial data I had 950 routers that would be attached to two subnet seats or something like that, it was returning some 3,000 ports. And then when you combine that with a subnet lookup for every port because of that other bug we resulted in that huge spike going up. So the next one I wanna talk about is L2 agents get device details list. This one is, from the L2 agents perspective, it gets all the information that's necessary to wire it up, connect it to the network. It has all the encapsulation info, so if it's VXLAN, knowing what VXLAN tunnels are set up, or VLAN tags, that kind of stuff. It also has extension info for quality of service, the port security extension to see if it should disable port security. And the network obviously, so it can put all the ports on the same network if they're from the same, from VMs on the same network. So this one unfortunately is one that got worse over time and kind of illustrates why we need to add tooling in that automates this process so we prevent these regressions. The funny thing is this first jump was from a commit that was refactoring a call to avoid issuing unnecessary database queries. So this was a case where this was submitted to improve performance of a completely unrelated call that was going to get subnet so it resulted in a preemptive database join that ended up slowing down this code path while it was improving another one somewhere else. This time, hierarchical port binding, the driver changes resulted in a jump from 12 seconds up to 13 seconds and the database changes for it too also resulted in a jump. So with hierarchical port binding, now with every port lookup in ML2 you also have lookups of network segment information and subnet information that are implicit and the L2 agent is getting all this back even though it doesn't necessarily need it. So here we can see someone came through and did reduce DB calls and get device details list and that kind of brought it back down to where it was before the hierarchical port binding DB changes and then one more change to avoid eagerly loading subnet range or the eagerly loading the IP allocation range from the allocation pools but that still didn't get us back down to where we were way back at the start of Kilo and this is when Kilo was released. So Kilo released with like a 20% performance regression on this particular RPC call. And these two jumps here or this jump here, this one was my fault from the network RBAC work that I was doing resulted in a new lookup to another table so that resulted in a two second jump. This is something we'll have to come back and optimize because from the agent's perspective it doesn't care about role-based access control at all. Quality of service because it added more information that the agents need resulted in another bump up, another second that it takes to fulfill. This call is to retrieve 400 devices so. And the last one was another one to improve the performance of an RBAC query that ended up impacting this one. That one went in right after the Liberty release so Liberty release with another 20% regression on this particular RPC call. The L2 agents security group info for devices call is pretty easy, it's just get all the security group information for a given list of port IDs. This includes all the rules for each security group that the port is a member of and then all the member IPs of each security group so they can know if someone has a rule that says allow everybody from this security group they get a list of IPs that need to be converted into IP tables rules. This one's pretty easy. We started optimizing this back in Juneau quite a bit so a lot of the optimizations were almost completely done by the time Kilo opened for development. So we just had one patch that went in that improved the performance of this quite a bit. Just batch DB lookup of security group info so it turned a whole bunch of database queries into one big one and then it's pretty much stayed the same sense. So to prevent regressions in the future what we wanna do is figure out how to add the native RPC support to rally or launch the little translator tool I had in the rally gate job and then add rally SLAs so just say things like oh 50 ports should always come back in less than two seconds and without consistent hardware we can't have them the SLAs be too strict so we might not be able to catch 10% 10% bumps but we can easily block out the ones that double or triple the amount of time that is supposed to happen. And then what we can do for these smaller changes is have a periodic task that's running on the same type of hardware all the time and just collect the historical data and watch the trends look at it every couple of weeks hey is our performance of these calls been getting worse over time. So that's it, any questions? Okay. Ah yes, can you come to the mic just so we can hear you. The work you've done with the performance I think that's really great and it's just wondering is there any plan to basically capture that in Neutron and use it to monitor the fixed coming use it as a guideline that if you pass certain threshold that make the performance going down then. Yeah that's what I meant on this last thing with the rally SLAs. Rally actually makes it very easy to do once you get something into a rally job you can just say this should take no longer than this amount of time to complete and then if it does take that long fail the job so it would just be like any other failure like a unit test failure or functional failure Jenkins would come back and say hey this rally job failed and then you could click on it and see hey there was a big regression in this particular. So that's a requirement for any fixes coming into quantum. That's what I wanna add it's not in there yet but I have passed it up to start. Okay great thank you. Yeah a lot of improvements have been shown in from all the speakers there was no explicit like time reference. So for example as an operator I have experienced that a lot of people are running open stock still on Juno and the still they have to upgrade to kilo but all these improvements that you have shown they refer like if you do an upgrade path from kilo to liberty right? Yeah I see no these so what I was how these graphs were generated was just from the master branch checking out way back and then just replaying history basically as we merge changes so this isn't representative of what has been back ported so that won't be the exact performance whenever we have a big performance improvement usually we'll back port it as well so someone upgrading from Juno to kilo you might actually have better performance than what I show on here that we release kilo this is just how the performance was right as kilo was released it doesn't include any back port information. Okay thank you. Yeah I know a lot of the IP tables changes I think were back ported to kilo. Right late in the state late in the cycle because we noticed them on some of our performance runs. It could be nice for example we also do rally tests to see the shape of our infrastructure. So it would be nice like to have some reference numbers so that people that run rally consistently on their infrastructure they can see if they get values that are more or less in shape. That are what they're supposed to. Yeah exactly so that would be nice like to advertise to operators. Right and that's another reason I kind of want to get this stuff upstream so then other people can test this stuff on their infrastructure and see if there's deployment cases that we're not considering correctly in our performance. Thank you. Thank you. Thank you.