 Hi guys. I'm John Schwartz. That's like the last one. This is Ann Tarday. She works at Mirantis. Kevin Benton also works at Mirantis. I work at Red Hat. It's a small company. You don't know it. Today we're going to talk about the race conditions of Neutrons, L3, HA solution, specifically the scheduler, under scale performance. So that's like a lot of subject to get through in 45 minutes. Please bear with me. So the talking points are quite unusual. Unusual. We're going to cover some HR routers 101. What are HR routers? What kind of resources do we need in order to represent an HR router both physically namespaces devices and more specifically database structures on the server side. Then we're going to continue on to some race conditions that we have encountered in Neutron. Continue on to scale testing that I did on both Liberty and Mitaka and finish up with some future work. So what does it mean to even schedule a L3 router? So basically scheduling means binding resources. In our case it's a router to one of the network nodes that we have. So in this example we have three network nodes or more and we have created a router and now we've gotten an API request to add a router interface, connect the router to some internal tenant subnet. So this requires an actual binding, an actual representation somewhere on the network nodes to let everyone know, hey, this is the route in which the package should flow. So binding a router is basically scheduling it and letting everyone know this is where it's at. So once we have the request to add a router interface which triggers a scheduling the scheduler has some kind of an algorithm either the least used network node or the least used L3 agent should pick this up or a random choice among all the valid network nodes. So in this case the scheduler decided to assign to bind the router to the second network node. So that's basically the scheduler and that's what it's supposed to do. On the legacy router that is the non-HA cases of non-HA legacy or DVR non-HA. It's a simple case because we have one router we need to choose one agent among all the possible possibilities, the possible network nodes and we just choose it and create the bindings. Once it is created the server lets the network node, the L3 agent know, hey, this is a new router please handle it. On the L3HA case it's a bit more complicated because we need to choose a bunch of network nodes that will handle it. So in this case we have the node 3, node 4 and node 5 are all network nodes and the router is scheduled to all these nodes and it is active only on the node 4 network node. What does it mean that it's active? It's active as in all the data flows through it and the standby nodes or the passive nodes are just waiting until node 4 will fail or will crash or something will happen to it and then they can fail over and take over the responsibility. This algorithm of who should be the active one, who should be the passive one is handled solely by KIPA LiveD and KIPA LiveD uses a VRRP protocol. It's a non-OpenStack project you can look it up, it's probably open source and the KIPA LiveD they all sit inside HA networks that are unique per tenant which basically means that if I have tenant A, tenant A has two routers, tenant A will have separate HA network than tenant B which is the red one. So this is apparent here, we have the blue HA network and the red HA network and all the connectivity that goes between KIPA LiveD that are in different HA networks goes through that HA network. So basically we have a KIPA LiveD for each agent, for each router. So if we have router 1 here and it's also in layer 3 agent number 2 it will both have KIPA LiveD. The KIPA LiveD uses different VR IDs per router so they won't collide. So if we have KIPA LiveD that is serving router 1 it won't think that something happened to router 1 or the other network node in case router 2 fails for some reason. So the separation is both of HA networks for different tenants and for different VR IDs. So that's basically all you need to know here. On the database layer this requires a lot more resources. So the first resource is the router L3 agent binding. It is a one-to-one matching between a router and an agent so this basically means this is the router, this is the agent, it is assigned to it. On the non-HA case we have a single binding as such. On the HA we have one for each agent the router is scheduled to. On top of that we said we have a KIPA LiveD that is used by the HA routers and they see it in HA networks and the KIPA LiveD uses different VR IDs. So all of these require database resources so we have the L3 HA router VR ID allocation which basically means this is the VR ID for this router. We also have the L3 HA router network which comes in addition to the normal network we all know and love in the Neutron database and this one holds the HA ports right because each KIPA LiveD sits on an agent and it needs to be connected to some HA network so we use I forgot what I was about to say so we use HA ports to connect an actual port that connects the KIPA LiveD to the HA network. Next up we have the L3 HA router agent port binding that's a really long name which basically represents a triplet of a router an agent and an HA port. This is the router it sits on this agent and this is the port that connects the KIPA LiveD on that agent to the HA network. When creating an HA router this is roughly the pseudo code. We create the router database object as usual and then next up is creating the HA network if one doesn't exist if one exists we don't need to create another one sorry after we have created the HA network we create some unbound HA interfaces we create unbound L3 HA router agent port binding one pair expected agent so we create the bindings we create the ports we just don't assign the ports and the bindings to an actual agent that comes later on don't ask me why it's that way we couldn't figure out that. Next up we create the VRIDs one per each router and then the scheduling comes along we determine which which agents will host the router there's a real long algorithm that does it it's like five lines of code we assign the HA network interfaces to agents so now we know which agents will get the bindings so we can take the HA port and assign it to this and that agent and finally we can create the router L3 agent point router L3 agent binding which actually represents a binding so an example race condition is this so the starting point is that we have some kind of an old router it is an HA router and user tool likes to delete it because it doesn't want it anymore and the other one wants to create a new router also an HA router so this is the scheduling that is done by the CPU scheduler of the computer of the machine or multiple machines whatever this is the timing that happens so first of all the user to flow deletes the router wanted to delete the router go ahead we don't need it on the database anymore and next up we say okay so this is basically the last HA router we don't need the HA network anymore the tenant doesn't have any more HA routers so let's delete the HA network so we have made a decision let's delete the HA network this isn't committed until later as it can be seen by this dotted line and then the CPU scheduler says okay let's give user one some runtime so we create the new router the router we just wanted to create and then we the next step is checking if an HA network exists and if it doesn't exist we can create it but if it already exists we don't need to create it so in this case the deletion of the HA network wasn't committed yet so it's still there so we don't need to create it the workflow then continues into committing the actual deletion of the HA network and then once we try to create the HA port for the HA network for the specific HA network that we detected here we are yelling because we don't have an HA network so the HA network has deleted from under us so this is an example race condition there are a few others I'll let Kevin go through them or fuel them so the the obvious question that comes up here when you have these big complex operations is why can't we just put this all inside of a database transaction so we start a transaction create an HA network create all the L3 HA ports depending on the number of L3 agents you have set up for HA create the tenant to HA network binding then call commit on the database and if there's any issues any duplicate entries that kind of stuff you just have one exception that you need to handle and you can start the whole process over so that would be problem solved the issue comes when you look at the internals of how neutron resources work we have networks and ports and routers and these are all distinct resources that have separate transactional semantics in their creation and they can correspond to external resources on other systems so if we put everything inside of one giant database transaction and we go through create a port create a network create a router and then encounter an exception the database will roll back but if we've been communicating with external systems we'll leave a bunch of orphan resources on these external systems this example right here of just creating a network and a port shows that there can be up to just even three different external systems with just even one ml2 back end configured creating a network needs to talk to the neutron database then go off to the ml2 back end then come in to create port we allocate an ip from an external ip allocation management system go back to the database here and then go off to the ml2 back end when it comes time to bind the port and then go back to the neutron database so these are all separate transactions that we can't roll up into one giant database transaction otherwise we leave a bunch of stuff left over in ml2 back ends or external ip allocation systems so the next question that comes up is well can we use locks this would be an easy solution just put a giant lock around whenever we have a router operation to make sure we don't have a delete happening at the same time as a create when it comes to ha networks for a tenant so there's threads and file locks db locks and then distributed lock manager two is in this case that we looked at and i'll talk about each one so thread locks and file locks are just based on exclusive access that a process has inside of a server so it's either some internal process exclusive resource like thread mutex or a file open for exclusive access on the file system but this only can prevent concurrent calls within scope of a single neutron server and neutron supports running a whole bunch of neutron servers all in active modes you have a load balancer sending requests across a bunch of servers thread locks and file locks won't do anything to help you because it's only protecting an individual server so then you can do kind of a fake lock with the database you can use transaction level locks which would be okay for a single transaction but in this case we're doing multiple transactions so select for update doesn't work because that's scoped to a specific transaction once you commit that lock is released and on top of that if we try to use like a separate side transaction to hold a lock for the entire time of other concurrent transactions this falls apart when you're using multi writer galera clusters so if you're doing ha availability for your minus ql back end then this doesn't block concurrent lock access it throws an exception at the end at which point it's too late and it didn't actually protect you from concurrent access the other option is just a basic reservation table where you insert a record into the database that says hey i'm doing an operation for this tenant and then it has a unique constraint and then nobody else is allowed to insert a record while they see it there but then the problem you have there is you have to deal with stale records if a if a server acquired a lock by inserting a record and then died all the other servers will just sit there waiting for that other server to get rid of it so you have to come up with some kind of uh kind of hack together timeout mechanism to decide well maybe after 30 seconds maybe it died and i'll just try to take the lock anyway but it's not really good locking mechanism so that brings us to uh using a distributed lock manager which is designed to solve this problem and there's twos which is part of the open stack big tent that's a library provides distributed coordination permanent so it has leader elections grouping and most importantly for our use case locking and then it has this driver architecture to use different back ends to where it stores the locks or the these coordination primitives that can work with zookeeper readus memcache there's several other back ends that you can find on their on the site for it and this would solve the problem that we're running into here but it has a couple of downsides the first one is there's a constant overhead now of lock acquisition across multiple servers so whenever we want to do a router operation on the neutron server we now have to talk to twos which uses a driver to contact a another server or another service and if it's in an ha availability mode then they have to go through some kind of election algorithm to acquire lock across the twos back end and this isn't a huge deal we could maybe uh deal with that because this isn't something that happens very frequently tenant creating and deleting routers but the the big downside we ran into is this would be a new operator dependency requirement for the twos back end right now or when we were originally looking at this there weren't any open stack core projects that had shipped twos as a requirement yet so we didn't want to be the first one to force the adoption of twos on all the operators there's a emailing list thread here where you can see the discussion about deciding whether or not to use the twos for the distributed lock manager so the solutions we come up with for the race conditions or you know use the database as much as possible model model constraints model things as unique constraints whenever possible so l3 agent plus router id there should only be one binding of a router to a single agent it doesn't make sense for an agent to host the same router multiple times that was one one example utilize transactional guarantees offered by the database where we can if it's not a multi transaction thing we can look up to see if there's a binding there if there's not a binding there then we can safely try to create one commit and then catch any exceptions that are based on stale data or db duplicate entry if there is a race condition from another rider and then wrap this stuff up in retry loops so i hit some kind of transient error start over try again they'll either already be something there and we have a code path to return early or the other thing encountered another error and we can safely create again so the approach just broadly looks like this we say does the resource and the binding for it already exists for example the ha network does the ha network uh is there a binding for ha network does it exist use it so that's easy that's the first return if it doesn't exist then we go on and we create the resource and try to create the binding which has a unique constraint for the tenant if we get a duplicate error it means somebody else already beat us to it there's already a now a tenant network for um uh already an ha network for that tenant so now we delete the resource and then trigger retry at which point we just go right back up to the uh this start thing and we'll hit this early exit of use it um same thing with the with the binding if we think the resource exists we'll try to create the unique constraint if we get a db reference there it means somebody deleted the network out from under us so we just loop back and start over at the create resource phase there so this essentially solved the problem it's kind of a complex state machine that it goes through but it'll retry up to 10 times and this effectively solved the issue of racing with uh the tenant ha network management one of the other issues we had is the l3 agent would periodically call the server to sync all the routers that are running on it and it would catch an ha router right after the database object for it had been created but it would be missing important things for the ha stuff like the ha network or the l3 agent ha interfaces because it was still being built on the server side uh so we decided to use the status field to flag partially formed routers so the agent could say hey this one's still being built just skip over it and then uh get it on the next sync uh the existing values we had in the status didn't map really well to this intermediary state of the servers still in the process of building this resource because it's kind of a composite resource of different individual resources so we added a new status field called allocating so now when you first first create a ha router you'll see an allocating state and then the l3 agent won't receive any routers that are stuck in this allocating state um then once the server is done setting up the ha network all the l3 ha ports that kind of stuff it'll flip it to the um active state and then the agent can get it um there's there's a downside to this approach is if a server dies or encounters an error while the router is in the allocating state we can end up with a router that's stuck in the allocating state and it won't get scheduled to an agent and there's uh a fix for that that just recently emerged that john will talk about later so what was fixed in mataka we fixed most of these server side races with api requests so if you're running rally tests that's hammering the server with lots of create delete uh ha or yeah create delete ha routers most of those issues have been solved in mataka and a lot of them were back well a lot of them were back ported to liberty too right so um and then races triggered by concurrent agent operations the agent syncing to the server at the same time um and crashing when it would get a ha router that was halfway being formed so those have all been fixed in mataka as well so now i'll let ann talk about testing for this okay so um when we found out these errors uh that john and kevin described uh we decided that we need to find out how they look like in production like environment so we ran some tests uh on the uh scale labs for liberty and metaka these results are published in performance docs the links will be at the end of the presentation and here i will share the most important results from there so our our testing was performed on the lab uh that consists of three controllers 45 compute nodes uh server's hardware of each node is presented on the slide and uh what was actually tested uh here we speak about races but um other aspects all 3ha was also checked um so at first we check races the best tool to ran out as much races as possible is rally rally is um which marking tool for open stack that allows us to create huge amount of resources concurrently um another uh thing that uh is a 3ha is actually about low for lower time so we want to check and to verify how um uh reboot for example impacts on connectivity for vm or a pair of vms so manual tests can help to measure these and to see how um or start all three agent or stop impacts on connectivity for large number of vms or used shaker tests shaker is a distributor data plane testing tool for open stack i want to highlight here that um stop or start all three agent won't cause connection interruption until crouching in spaces aren't cleaned up as well so at first let me share liberty testing results that we get on environment uh which was deployed uh with uh mirages open stack 8.0 so um at first let's talk about rally tests uh rally gives us um an ability to test uh open stack an ability to open stack to perform simple operations like create delete create update create list uh for us the most important test was create and delete routers uh and here on this slide there is an example of um create and delete routers yaml um so this scenario creates a network giving number of subnets and routers and delete all the routers uh this scenario has uh times and concurrency parameters times means how many times this scenario will be executed and concurrency allows us to execute current uh scenario in with given number of concurrent threats so uh to get um as much race conditions uh with deletion and creation of h.a. routers h.a. networks and h.a. interfaces we uh run this test um with um different times and concurrency parameters um l3 h.a. uh has a restriction of uh 255 h.a. ports pay h.a. network pertinent and if number of bits exceeded this limit uh new h.a. network is not created so uh we have some um tests that fail due to misconfiguration uh this is shown in yellow bars because uh the number of tenant wasn't enough for the number of routers that were going to be created so um with um increased number of tenants this test passed without any errors uh here are some more numbers of uh issues that we faced in so subnet and use error is about the race uh with the h.a. network management uh and session all back is an error uh because of the race between uh three agent fill sync and uh creation of h.a. router uh so um now let's take a look on the data plane side all three h.a. so the idea of uh main node tests is very simple uh at first we boot a vm or a pair of vms then we perform some destructive act uh then we start connectivity check and then we perform some destructive action uh and check uh the packet loss then we perform this test again with increased number of routers uh the number of routers are increased from test to test to uh increase uh the lowered for l3 agents so the most important for manual test is spin to external network from the m during uh reset of uh controller with active l3 agent uh so on the left there is network scheme for this test and uh on the right uh there is a graph um of dependency uh number of routers or from uh number of packet loss so uh there wasn't stability of rescheduling time for 175 routers uh the problem with this issue that it wasn't reproduced on their virtual environments at all so now let's take a look how things are going with large number of the m's and first let me describe how shaker works shaker um deployed uh instances on different compute nodes and execute connection connectivity check between them uh shaker contains server and agent modules uh server is responsible for deploying of instances uh execution of tests processing results and repro generation agent uh lightweight and they run inside of the m's and put tasks from server and replies with results uh the um typology deployed uh using hit blade once um um we can see that there is um agent in slave mode and in master mode in slave mode uh this means that these agents are used as a backends for corresponding commands for example they run a pair in the server mode so once all the agent replies to server the test execution is considered to be finished and um if some of the agent didn't make it in the dedicated time such results marked as lost uh I will share all the basic scenarios of shaker and full or full different scenarios can be found in performance docs so uh a three east west test um tested the bandwidth between the m's that located in different networks but plugged into the same router uh the test started from one pair of the m's on different compute nodes and increased load until all available compute nodes are used um during execution of shaker tests uh around scripts that um stop uh and active l3 agent and clean up coroutine spaces then when another agent become active uh the first uh agent was started so on the graph on the left we can see upload and unload uh bandwidth of with and without restart all three agent the same test was performed for ordinary rotary scheduling uh l3 hc was turned off and on the table on the right there is a table um how many preference were ended uh with errors or lost and we can see that for ordinary rotary channeling a lot of results were lost or ended with errors uh three north south tests check the bandwidth between the m's that um allocated in different networks uh the uh instances with um master agent allocated in one network and uh instances with slave agent are reached there via floating ip's and the same is started from one pair of them until all um available compute nodes are used and the set and the testing was performed in the same way and uh we actually get the similar results so this test show us the benefits of for l3 hc in comparison with ordinary rotary scheduling so and now let's take a look for metaka testing results uh which we get on environment based on um management stack uh eight uh nine point oh um so the rally tests um uh were around in the same way and actually have the same misconfiguration at first uh and uh here are some more numbers uh we can see that uh errors uh some net and you still persist but we uh fix other errors but we've got new error the server didn't respond in time uh for creating delete and creating the date and we suspect that actually this happens because of allocating or change being a texano network from the m during reset of controller uh help us to catch and investigate bug to masters of the reboot of controller uh and on the um left there there is the same network scheme and the graph uh with dependency so actually this uh problem was happened due with this problem was fixed during newton and on the right we can see that during reboot of the note a lot of packets were lost so um here are uh the shaker test which also was performed uh with and without to restart for 3 ad chain uh and uh we got uh the similar results for l3 east west and l3 north south uh so summarizing metaka results we see that although some things were fixed we um have new issues and now john will describe us how we managed to fix them thank you um so during the the testing and did we found a specific bug uh this is the bug number you can see uh so basically after the uh rebooting of our controller there were two uh master nodes we fixed that we fixed other uh bugs as well that's all uh can be divided into the three main types of races that kevin talked about uh that involved um concurrent api requests create and delete our outer concurrently or sending uh incomplete or partial data uh into the l3 agent or the l3 agent um issuing some kind of a request that will eventually cause problems in the schedule or like uh we saw before a timeline of the bugs and the uh patches that fixed most of them can be found in the etherpad um and very close to the end of uh newton at the uh mid-cycle in ireland um we sat down and we figured out that we can add a binding index to the router l3 agent binding so basically uh now every time we create a router l3 agent binding the object that matches a router to a single agent we also add a binding index which is a unique field uh for each router which means that basically when i create um a binding for uh some router to some agent i can i need to use some kind of a binding index which means that if two um two different workers or two different schedulers are trying to create the same entry using the same binding index uh this will fail because uh the binding index is unique uh in the non-HA we only allow binding index uh with the value of one and in the HA case uh we use different uh binding index between one and the maximum number of uh l3 agents that we will uh assign to uh that we will schedule to and this basically prevents over scheduling an HA router um that um that might happen uh in concurrent scheduling because now in case uh two schedulers try to bind uh an HA router to uh two different agents that will both use binding index equals one and that will fail uh and this allows the removal of the allocating uh status so basically um we are going to uh looking forward what we are about to do is we are going to refactor the l3 scheduler to allow us to first create the router l3 agent binding change the ordering such that instead of creating all the HA resources and then creating the router l3 agent binding we can first create the router l3 agent binding which is now considered safe uh and then after we know that the assignment was um was successful we can continue on to create all the other HA routers uh resources and this will allow us to remove the allocating status because we don't we no longer need it because now everything is apparently safe uh and even forward even further uh in the future we are planning to use uh push notification which will simplify the communication between the server uh and the agent even more uh so one thing we want to emphasize is that the current code is safe to use uh these are looking forward these are improvements to the code not uh bug fixes so please use layer 3 HA uh it's awesome um some links uh the links uh for hands uh um benchmarks and scale tests can be found here and uh we have time for a few questions if you go back to the links page that they want to take a picture of it yeah if you want to take uh pictures and questions any questions somebody so did you guys ever stop for a moment to think this solution is way too complex do we do do we solve this problem differently I mean what you're trying to do is you're trying to get packets from one compute node to another right there is no routing in the way we know where they are to begin with and you're saying well I will not add any intelligence at the node to know where to send the the packets I'll add another node that will do routing right then I bottleneck it I have a point of failure so I need to start scaling it so I need HA solutions and now you're saying we have races even in the API right so when do you stop and say is this the right solution right so the sending traffic between two compute nodes the DVR plus this is mainly now to handle the source NAT case where everybody wants to share a single IP address for a whole tenant's worth of traffic and we need to coordinate the use of that address so your testing was east west yeah that is just north south right right but if you want you can skip the HA for east west by using DVR plus HA okay so then but then this is just for s then then in that case as it would just be s NAT for the l3 HA okay but it is yeah it's a valid question we kind of go through a lot of work to make this single source NAT IP address usable across a whole bunch of locations by putting it on the network node then we have to make it highly available but this is people are concerned about ipv4 usage so it's a difficult problem to overcome oh hey look chris go first um as someone who actually ran into the uh the uh multiple actives on after a reboot uh one thing that i saw was uh is there the the option to do a policy uh or some other mechanism to actually verify there's a stat there's a configuration that is in a valid one such as multiple actives um do you remember what the the bug fix was for that yeah actually the bug fix this uh we started using port status so we check the port for 3ch the bug that was with two masters the problem is that um the l2 agent started later than l3 agent so the ch port wasn't active and when uh the uh the problem actually so when l3 agent started rescheduling router he thinks that the ch port was okay but it wasn't so we start using uh start checking uh there if a chip port is active so now actually this problem shouldn't be there but uh we're at in newton and i think we ported it in mitaka even somehow so in mitaka and later it should be fixed and we're also working on uh liberty i think and uh back porting it to liberty yeah but actually at this moment there are some problems so it's we're working on liberty uh i know my question uh hang on he was and he was next any plans for back putting all these bugs to mitaka liberty so quite a few have been back ported to mitaka i think liberty's hit a point now though where it's a lot of security like critical only fixes um so i'm not sure how many of them will be able to get back all the way to liberty now yeah we're trying to uh back port as many patches as we can and and and as many uh fixes but some of the uh fixes depend on some database uh changes which can't be bucked can't be back ported because uh database changes all right um we're actually out of time i think maybe just one more that's quick maybe we'll see my question i have two questions but they're related one of them is have you tried to you think you hope your thread safe is it have you proved the algorithm and you're just worried about the code or have you just uh because have you proved your algorithms thread safe that's my question so well so there's no um it's not multi-threading it's eventlet coroutine based switching inside of python there's no shared data structures between coroutines so we don't have to worry about concurrent access to the same data structure it's about coordinating around the the database and so it's just just through testing right is it's there's not like a formal proof of it but it's and and what do you do when you add a new now you add a new api call that might interfere do you test it is is there a way for you to check when someone puts pushes code instead of just running it just when you're thinking about new apis how do you know that they're not going to interrupt it's very ad very ad hoc it just requires testing say if it breaks stuff and then just knowledge of the database and knowing what's a distinct unit and if it's if it's a composite this is one of the rare like resources in neutron that's made up of other sub resources so it's not a very common problem normally we can wrap everything up in a single transaction and be good to go just add here that actually we try i'm trying to check all the changes that we got uh on a 3ha on virtual environment with several nodes at least so ransom rally test and check that it doesn't get worse so from the some point i do it regularly so thank you thank you