 So we might as well kick off my name is German Tynan. I'm joined by Tom Howley Hi, my name is Tom Howley. I'm a member of the block storage service team And I've also done some work kind of around common services shared across various OpenStack services particularly working in the area of of HA and I'm currently working on the the CloudOS core engineering team And I'm German Tynan and I'm a cloud consulting principal for an OpenStack services group within HP Helium Up until recently I was a member of the neutron team for HP Cloud and before that like Tom I was on the Cinder and Bok engineering team So some of the stuff we're going to cover today It's got a strong neutron focus, but we're also going to cover Howe HA the neutron database and rabbit MQ We're going to talk about the network nodes the API servers and then we're going to get into some of the issues We've come across running this at scale in production particularly around inter-AZ clustering split brain Failover speed stoneth keep lives cluster maintenance and some conclusions And I'll hand it over to Tom so Just to kind of give a bit of background on this My original work on HA for OpenStack services was in the context of the in the context of the Bok project, which is a custom Distributed scale of blocks door solution that we developed Originally as a back end for a Nova volume service and and subsequently for Cinder and in the Bok project we had a lot of HA requirements for a number of custom services Including a rablin queue instance and at the time. I mean this is a couple of years ago rablin queue itself Was quite immature in terms of the clustering and the mirroring of HA of then rablin queue queues And we also got recommendations from rablin queue experts that you know, it wasn't a good choice at that time for a HA solution so we decided It would be a good idea to use a kind of more traditional Linux HA stack of cursing and pacemaker for both our custom services and for a rablin queue instance that this kind of progressed then later when we went to Deploy our grizzly grizzly in our public cloud We had a requirement at that time to provide HA rablin queue and database across a number of services And we kind of made a pragmatic decision, you know based on on deadlines And also based on the fact that we had a lot of experience already in pacemaker and car sink to use the same deployment Methodology in the same set of deployment tooling that we had So as a result we you know as for you know, Nova and Lance and Cinder for Neutron We use pacemaker car sink dRBD stack for Neutron so Just kind of the stargum shows an example of how we deployed it across a number of availability zones and You know we'll mention the fact that we had a car sink cluster running across three azs because that brought up some own issues of itself and Just to kind of briefly mentioned so as most of you are probably aware that car sink is clustering software that provides particular guarantees around delivery of messages to upper layer applications specifically for implementing HA solutions and You know provides notifications when the node is joined or left a cluster and also will tell you when you've lost corn for example And pacemaker sits on top of car sink and basically manages a set of resources where resource can be anything from you know a process or a managed IP or a file system and In addition to that we have used dRBD for basic replication of our data stores and in this case in our Neutron database deployment we have what's a very traditional deployment model of a pacemaker managing our IP our file system and the service itself for the file system is mounted on our dRBD and Just to kind of finally mention on that we also use the stoneth mechanism for bringing nodes back into a cluster should we get a failure So very simple scenario. We have loss of a node in AZ1 pacemaker fails over our service our database promotes our dRBD device on AZ2 and Our APIs can continue to work as normal Meanwhile, stoneth will kick in reset our node in AZ1 and As soon as that node rejoins a cluster, it'll begin to re-sync our dRBD devices So that's basically the same model we've used for our Neutron, Neutron Rablin queue instance And as it turns out, it's a very same Deployment set of tools that we have for for sendered lancinove in public cloud We happen to have separate physical in deployments in this case So I'll hand over to Durham just to discuss the rest of them Neutron HA so the sorry the key one for For Neutron is the other network nodes. So We run multiple network nodes in each of our three AZs in each of our regions In this picture here the six Network nodes in the picture are all part of the same cluster. So essentially we use pacemaker in a clone cluster rather than a Active passive so all members of all nodes are part of the same cluster within a region so Essentially what we do that is we use a custom resource agent to manage the The nodes themselves now resource agent is a pacemaker term shouldn't be confused with L3 agent or DHCP agents Unfortunately, it's an overloaded term but our resource agent checks that the the services are running and Can start and stop the individual services so What we actually do is rather than fail over an L3 agent to another system as we actually shut it down So if we look for example at what Neutron sees in a typical cluster, this is a four-node cluster running DHCP agents and L3 agents on each node That's the output of a Neutron agent list the numbers and parentheses there as you can see them are actually Obviously Neutron agent list doesn't list that but that's that's just showing a reference of how many networks or how many routers are on each agent so Essentially what happens then when we have an outage, I'd say of those formings if Bob dies The first thing is car sink is going to a notice the node missing a token lost reports that to pacemaker Pacemaker then we'll take corrective action you stone us to immediately shoot the node and then runs runs our resource it update The Neutron database so what our script does is it moves it first marks all of the agents is down and then it takes all of the net and Routers on that node and migrates them elsewhere So essentially just reschedules all of them in a round robin fashion So at the end of that process, which is pretty quick This is what Neutron Agent list looks like and you can see that the two That the two services running on Bob, which is a DHCP agent and the L3 agent are now showed as a min-state down with zero Routers and networks on the node and Everything has been migrated. I'm sure you didn't memorize the numbers earlier But if you did you'll notice that the actual workload has been Migrated to the remaining nodes So that's essentially how we do HA with Neutron network nodes When the machine returns to the cluster it marks its own admin status back up But we don't migrate anything back so the workload is now distributed between the five remaining or in this case the three remaining nodes and It will take new work because it's now available, but we won't actually migrate a thing over It causes a slight outage to do so and it's just not worth it. It would be nice to be able to do that, but it just It's an unnecessary outage basically so on the API server front We upstream some code to use multiple worker processes. That's in I think Havana We stole that code. Sorry. We developed that code using the glance model and what Cinder uses and what Nova uses So we kind of combined a lot of what they did so as from Havana The API server runs in a multiple processor. I think it's a worker processes to 20, which is what we use So we have 20 processes on each node dealing with requests and A load balancer in front so because we have a load balancer and why show it fairly simplified there It's actually a series of load balancers and rate limiters and so on but It's a very straightforward failover in the sense that the load balancer detects the an API server is gone It removes it at the entry and that roots all the requests to the remaining API server or servers So what's missing here? Obviously a stoneth So it's up to somebody in the knocker whoever to actually discover that the API server isn't available And to either restart the service or to reboot the machine or do whatever needs to be done And eventually when issues corrected the load balancer will pick it up. So that's a much more simpler must more classic H a configuration for for the API so With that we'll talk about the what can go wrong with that Yeah, so the rest of this talk will Will describe some of the experiences we came across some of them are quite specific to pacemaker cars I think in my case some of them are a bit more specific to neutron being used in that context the first area of interest is the fact that we have a cars think multicast cluster running across three AZ's and But I didn't mention earlier. It was the obvious thing that as well as You know the fault tolerance that we have in looting in losing one node We can also lose an entire AZ in In terms of this service and keep the service running So one of the interesting problems that came up and in one of our environments Was we had some kind of network failure in our backbone between these AZ's and The failure manifested itself such that each of the three nodes in our cluster figured that all the others were offline So we had an effect to complete partitioning of our cluster and So one of the first things would is developed the plan for what we do if this actually happened in production Which is basically quickly get one of the nodes back up in a single node kind of standalone cluster mode and take the others out so it was a useful and Even for that kind of failure on its own it was useful to be able to come up with that kind of plan for for dealing with the problem in the future We then got to the task of actually figuring out what went wrong Any local checks and you know in terms of firewall ports and You know didn't show anything we ran TCP dump on all the nodes to see if multicast packets We're we're been seeing on each of the nodes and it appeared that they could all see all see multicast packets sent from each other And we got networking to have a look at some switches You know that the nodes were directly directly attached to and also some other switches in the backbone They couldn't see any error counters or anything and missing the logs So what turned out was when we restarted car racing service on one of the nodes then we started to realize that we were actually missing packets Because when it when a node rejoins a cluster it tends to send out larger packets when the messaging is you know negotiating the membership the new membership in the cluster And these are the packets that were being lost So it transpired that one of the switches in our backbone was dropping Only dropping multicast packets above a certain size. So it was passing unicast of Unicast packets of all sizes, but only dropping multicast packets and silently dropping them above a certain size So this you know was a particularly interesting problem to come across You know we did we decided that And I kind of missed out a very important point here when when we had a complete partitioning of our cluster The way it was configured at the time, which is kind of the default configuration and in pacemaker when you lose quorum Is is to be is to shut down everything. So this is kind of the conservative approach We want to absolutely ensure nothing is corrupted. So if we lose quorum in our cluster, we just shut down everything But it turns out there are other options such as freeze So keep it just running as it is or even ignore which I don't think would generally be recommended So in this particular instance, we decided for the future that freeze seemed to make sense and in addition to this kind of You know new knowledge about how we should configure the cluster. We also decided it was worthwhile to add some more monitoring On the networking side But even on our own local side that we could add monitoring just to check that multicast packets of a range of sizes are being you know sent and received and If this means if we do see this problem again, we can at least narrow it down, you know much quicker So the next kind of main Issue that I'd like to cover is a kind of partitioning of another form. This is specific to dOrBD I'd say about you know a year and a half ago that the phrase split brain was the least my least favorite phrase in the English language and In dOrBD split brain specifically refers to the scenario where your two dOrBD devices during some temporary loss of connectivity Both tried to promote themselves to master and the obvious problem is here you can get some forking of your data and ultimately you can have data corruption So, you know, we want to avoid this at all costs So if I just kind of take this back to our neutron database setup, we lose a node in AZ1 When that node rejoins a cluster after being shot by Stannet it begins to re-synchronize our dOrBD devices What sometimes the problem occurs for pacemaker based on whatever resource score waiting scores have been calculated It decides to migrate all our services including it a promotion of our dOrBD device on AZ1 And because the re-synchronization Hasn't completed we get this in our log, which is basically split brain and This is a problem. It will basically disconnect our dOrBD connection and leave it at that until you manage to really resolve the situation But it's very important that we that this doesn't happen in production so luckily pacemaker and dOrBD by a combination a mechanism they They provided a mechanism to avoid split brain and this is kind of an example of a dOrBD resource file Configured with what's called dOrBD resource level fencing. What's basically involved here is you have two handler scripts One is called fence pier which is invoked as soon as you lose connectivity between our dOrBD devices and the second is called when you when your dOrBD devices have re-synced When we lose connectivity So if we go back to the example of where we've lost AZ1 Our fence pier script will automatically add a rule to pacemaker the pacemaker Sib. So the Sib is basically a distributed Store of the current cluster configuration and the current cluster state So all nodes are up the Sib and all nodes are updated every time there's a new change and This rule is it's slightly confusing. It's a bit of a double negative. This basically says under no circumstances promote any device any Note that isn't easy to master and this should avoid split brain during this this occur It's kind of highlighted there So we shouldn't get the scenario now because until the devices have re-synced that When the devices do re-sync it would actually remove that rule from our Sib and then we're kind of back to normal So, you know, this this all seems fine. I should also add that The AZ1 node itself will also try and invoke this rule But there's a timeout added to our fence pier script so that ensures it gives enough time for Stannet to kick in and Basically kill that node before doesn't think anything drastic so In theory, we shouldn't encounter split brain When the force Lee QA raise a bug and there might be some people in here that recognize this So, I mean I'd be quite interested to know how everybody here would reply to this bug We obviously took it seriously and did some investigation and it What's important here is as I mentioned There's a Sib maintained in our pacemaker cluster that that stores current configuration current state every time there's a new change We increment what's called the epoch so if we go back to the same scenario where in Cluster one here where effectively our core rate cluster, we've lost AZ1 node We've added a rule to say don't promote any other device any other node except AZ2 and That brings it up to epoch 2 0 1 say for sake of argument Meanwhile the node in cluster 2 What I've called cluster 2 By some kind of quirk in the design of car sink when you bring down a Network interface it actually rebinds to the local host and sets up a kind of a standalone cluster Apparently, this is kind of for some kind of test mode operation and unfortunately what it typically does is it in this standalone cluster it typically adds a couple of Property changes to the Sib Which brings a epoch to 2 0 2 I should say this doesn't always happen So it's very much a timing thing because the node may be shocked before it's had a chance to do this When the nodes Reform at luster because it's of a higher epoch It basically kills our rule and leaves us open to the potential of slip brain occurring So, you know, it doesn't happen all the time we ran a number of tests and It both kind of real tests and other tests we can run hundreds of iterations simulating network loss So there was a simple one where we used IP table rules to block all incoming packets all out going on various combinations hundreds of tests run in the lab on various clusters no split brain occurrence similarly we did Disabling a ports and a switch Tens of tests tens of these kind of tests no split brain And we also went into the lab did some good old-fashioned cable poles again No split brain When you ran the same Basic test suite in our lab using the if down eat 50% of the time it resulted in split brain and kind of based on even you know asking questions and pacemaker main enlist and you know searching the internet Can anybody guess what our response to QA was? So Talking more specifically about the the neutron into things I should first point out that Jack McCann has a presentation on Thursday where he goes through operating neutron at scale So he'll cover a lot of the more detail about the the facts and figures of how we have found neutron When you run thousands of routers thousands of networks and so on so I'll give a plug for jacks talk at four o'clock on Thursday so Our idea of scale here is Again a thousand to four thousand routers And we have a pretty large Sys test environment that we run this stuff at and we beat it up And we've come up with some interesting conclusions Pseudo for example does some interesting things NTP D. Does some interesting things fail over isn't always as fast as pacemaker and the rest of us would like DNS and DHCP may not do what you think and Something we call the dreaded soft lock up, which I'll get into as well So it's actually starting with pseudo one of the things we noticed was that pseudo the out-of-the-box pseudo as delivered to us Scans the network node for all of the ethernet interfaces now It's probably been doing this for years and maybe everybody knew that but we didn't so unfortunately on a network node when you have 2000 tap devices every time you run pseudo it has to scan through 2000 devices So we now run a version of pseudo that doesn't do that Other things like NTP D. Also basically looks to see every time a new interface is added to see whether or not it needs to do Anything with it so all of these processes running there as we're adding network node or sorry adding networks and routers to the L3 or to the Network node means that they're slowing things down on the machine So an IP net and s also Has to troll through these things and it doesn't do it really fast way when you start to get into four digits of name spaces And if you look at the log files for neutron, you will see Will it really ignore the use of the name quantum there? But you see a lot of these IP net and s execs run as pseudo and even if it's only three to five seconds of a delay When you're doing so many of them it can actually incur quite a lot of time penalty So where that really hurts. This is in terms of failover speed. So When you're adding networks and routers to an L2 a network host And they're taking three to eight seconds each You don't really notice too much if you're just adding them sporadically But when you need to fail over two thousand onto a machine the failover can take hours So that's not something pacemaker and car sink was really designed for they're more designed for a very very fast failover of a Virtual IP or an Apache server or whatever But when you have that number of routers that need to be migrated across the cluster it can take a while also adding 500 routers to a Machine that already has two thousand can actually take less a lot longer than adding them to a machine That's just rebooted. So when you look at the meantime to repair of a network node Particularly if it's been shot by stoneth it may come back online within 15 minutes And sometimes it's actually better to just let stoneth shoot the machine rather than try and actually do anything ha because the machine Will actually come back up quicker with its own payload than the others So so that provides us with a kind of a quandary as to whether or not we should shoot the node or whether we should And then it reboot naturally or whether we should migrate the load So we've done quite a lot of work in terms of particularly Carl who's in the audience here somewhere in terms of speeding up how long it takes to Recreate that workload on a different network node But it's still quite a significant amount of time Another thing we noticed is that DHCP when you fail over DHCP from one Network node to another Every now and again we get a new port assignment Now I know We have fixed that upstream on it Well, we fix it internally and I believe it's upstream now But the issue with that then is is that any VM that happens to have that IP address for their DNS mask? Will no longer be able to see any DNS unless they're using their own DNS configuration until such time as they renew their lease now For a variety of reasons we run very long DHCP lease times. So you'll suddenly find that a machine can be Quite a long time without any access to DNS because the DNS IP addresses has shifted from 10.002 to 10.003 or something And that we have upstream some fixes in there, but these are the types of things that you suddenly run out here And my favorite is the 22 second dropout. So we discovered this with The version of the Linux kernel which we're using and when we have a large number of namespaces on the machine so We have a performance group and I should mention Rick and Michi who did a lot of the work on this When you get 4,000 namespaces you end up with 16 million entries in the mount hash table And the as configured hash table has 256 entries because nobody really expects a machine to have that number of namespaces So what ends up happen is that you end up chasing down these long chains of namespaces while you're holding a VFS Smartlock and the the upshot of that is you'll see this in the log file now These things probably appear in log files on all kinds of heavily loaded Network nodes But you don't really notice unless it so happens that that CPU happened to be running car sink So that CPU is out to lunch for about 30 seconds or so Car sink the default tuning we use for car sink is you need to have your token within about a second That's the out-of-the-box tuning that we use for everything else that says car sink needs to see a token within about a second So if you've lost it for 32 or for 22 seconds We get this right so car sink reports the node is gone and we end up doing this right, so we shoot the node because of a loss of Token within 22 seconds and you end up with this That doesn't particularly help matters because now what we've done is we've transferred all that load to the remaining machines And they're gonna have to work harder, right? so The thing this kind of the conclusion about this really is It's unusual for us to see actually honest to God really dead systems Oh, so we end up with this kind of zombie state both in terms of switches and in terms of machines So the switches Tom covered when he talked about how the switches the fabric could actually Exchange packets between the AZs unless they happen to be car sink packets in which case it said well I'm not transferring those so and then you you end up It looks like you have network connectivity except the only thing that doesn't is car sink and car sink as you as you know It doesn't like that and likewise at servers A 22 second dropout on a machine a 30 second dropout on a machine isn't really gonna make much difference Maybe people don't even notice it But once you put high availability in there now it's an issue so Looking at some of the side effects Pacemaker does expect things to be To fail over pretty quickly And if it's gonna take you six hours or up to six hours to fail things over You really need to know that that machine is dead before you do it So we have pulled back on the car sink config To see how long it can be before you give up on the token from the default of a second Which is what we use again with the database and my SQL and so on or rabbit MQ But for the network nodes we've pushed that out to 30 seconds and beyond because what you have to ask yourself is Will you take a 30-second hit on a machine that's just disappeared off the network or take the six-hour hit of transferring so this brings us to stoneth which is Sort of an interesting area because a lot of the work within ha particularly in neutron is Is is around alternatives to to pacemaker car sink so We don't really need to mention too much about what happens without quorum. We've seen this so To stoneth or not to stoneth is an interesting question to miss cool Shakespeare. So We see a lot of kind of application specific ha of being rolled out Which wasn't necessarily available when we started to do this ha so for example Galera now rather than the my SQL on a DRBD cluster rabbit MQ clustering multiple DHCP servers and VRR P for L3 agents for DVR, which you and my view would be the better solution The one thing those things don't do is they don't shoot the node, right? So what they will do is they'll address the issue of what happens if one of your network nodes goes down or if one of your Database nodes goes down, but it won't actually take corrective action So the real question is whether that's a good or a good thing or a bad thing So if you don't take corrective action, you need a knock team who are going to commit at four o'clock on a Sunday morning and and fix the machine And in the meantime you have a cluster that isn't quite ready for prime time if something else happens now You could have an outage, but you avoid the situation where stoneth will just decide I'm going to shoot that machine because I don't like it or as has happened in our lab environment Where we've misconfigured ILO and you suddenly shoot the wrong machine. So That's kind of gives you a very bad stoneth day to to realize you've just shot a compute node So I'll pass it over to Tom to cover some stuff about keep labs Okay, so just to mention a couple of topics specific to rabid MQ You know obviously relevant to our rabid MQ or our neutron rabid MQ deployment One issue is around the loss of connections when they're not noticed on up here note. So there's Simple example here where we've an RPC server. That's listening for requests on a queue on a rabid MQ server and Then we have a client that's publishing requests that are RPC server Say for example, we lose our rabid MQ server in kind of an abrupt fashion So that there's nothing sent on the wire to let the RPC server know that we've lost our TCP connection So at this time the RPC server is kind of in blissful ignorance and is happy consuming away on the queue as far as he's concerned Meanwhile rabid MQ server is is is brought back and if it's a persistent queue It'll read declared or as a queue will appear the next time a client tries to publish a message to it But the problem here is there are now zero consumers on our queue and our server will never receive them And it will only receive them until it actually reconnects for a rabid MQ instance It's worth the pointing out here. I mean, you know, anytime you generally have a problem I think most people's approach is Obviously, how do we fix it? And also, how do we add a monitor to detect this again in the future? So there's a simple case here where for some of the queues we should never have zero consumers Something is wrong if that happens for a certain period of time at least So our solution to this was to use TCP keep alive Which is reasonably straightforward The only thing is that there's slightly different ways of configuring keep alive whether it is that the server or the client side In terms of rabid MQ you can only configure keep alive and all the TCP connections You can't actually tune the keep alive settings. So for rabid Q server deploys. We have to tune the system-wide Keep alive settings because otherwise you'll be waiting for about two hours before a rabid Q realize the connection has gone down On the client side, we tended to modify Our rabid MQ clients to actually tune keep alive settings and all their sockets So we did this for pica and also for PIP y and QP live and some of these changes were Upstreamed certainly at least in terms of enabling keep alive and I'm not so sure about the actual specific tunings So, you know that that works fine But there is an alternative at the aim QP layer, which is to use rabid and Q heart beat And again at the time when we when we solved this problem originally PY and QP live didn't support heartbeat And but you know, I'd say it's just as viable a solution to the same problem Another quick one on rabid MQ and this is particularly relevant to neutron because of the high number of compute nodes that we tended to have in public cloud Is that this placed a heavy demand on the number of connections that we had to our rabid MQ server? And as a result it burned through file descriptors rapidly and it rapidly hit the limit of the default of 1024 and our solution to this was to modify the rabid MQ pacemaker resource agent to just be able to configure You know with the call to you limit to configure the number of file descriptors that were allowed It was it was the reason we did it this way was a Was because of the way rabid MQ server actually started up the process at the time uses the start stop demon So, you know that that also worked fine We we designed it so that it was configurable. So we different services Deploying rabid MQ could just pass that that file descriptor limit down into their deploy So the final topic I'd like to cover Which I think is a very important one and you know It's it's it's probably tends to mention a lot across the conference and in various guises is upgrade our maintenance And you would think that you know now that we have a nice ha solution once we can migrate our services to more than You know two or more nodes You know our upgrade solution is very straightforward. So as an example here. I have A cluster of ten nodes that's grouped into five five ha pairs And it's it's worth noting here that the DC of a pacemaker cluster plays a very important role And especially in this story so the DC is a designated coordinator and all pacemaker decisions all policy decisions whether you start to service stop a service You know shoot a node they all go through the DC and if a DC in a cluster is lost pacemaker will very quickly re-elect a new note You know that that's not a problem So for upgrade we simply you know say for example take ha pair one all we have to do is migrate services off the node upgrade reboot so Upgrade I'm tending to talk in terms of you know kernel upgrade firmware or even a re-image of a node Similarly we can migrate our services back and upgrade that node and reboot That's all fine But as it turns out in one of the scenarios that we were testing We moved on so we had re-imaged a couple of nodes and you know upgraded them and Then we moved on to the DC just just by chance. So we migrated services off the DC upgrade and reboot as before And we now have a new DC But for some strange reason this DC doesn't have a kind of an exact view of the cluster It it believes that the first half of ha pair one is no longer a member of the cluster And it was quite strange because we were seeing messages At the DC node saying peers not member of the cluster yet the node itself figured it was still in the cluster And I should say that this seems to be a quark of the the version of pacemaker We're running with which was kind of the support release with with the want to precise But in any case an industry problem And obviously the real problem here is that once it really once it decided that The first node was no longer member of the cluster it decided to restart those resources on another node And now we have the same resources running twice, which is a series problem So one of the things we learned out of this is that in the future for doing these kind of re-imaged or upgrades of nodes across the cluster A good idea is to put the whole cluster in maintenance mode so that your resources are no longer managed by pacemaker And if anything goes awry in your cluster it won't have any negative impact and that's assuming that you're upgraded You know doesn't you don't want to have your cluster in maintenance mode for too long because obviously you've lost your ha during that period and Also just to point out that you know upgrading pacemaker and car sink itself Which is something we decided needed to be done when we encountered this particular problem You know because it was related to a particular version of pacemaker Upgrading pacemaker car sink itself is it's quite a different challenge because you have to worry about Compatibility either between the versions of pacemaker car sink and also between your current configuration So your pacemaker sip you may have you may have to upgrade that in a separate step and upgrade your pacemaker separately So I'll hand over to you Dermot for I'll quickly go through these because I know we're running a long time and there's probably a few questions so Some of the thing with high availability really is that it is a difficult problem And when you start out to do something you suddenly realize why pacemaker car sink or just are so complicated The more things you run into monitoring as we said is really really essential It is easier to just reboot the node sometimes Stoneth could be a mixed blessing Again upgrades careful planning I think that the keyword for me is a DVR would save a lot of grief for us in terms of how to basically avoid High latency failovers on network nodes now There's always a need for a network node because of things like default sNAT but if we can offload the network nodes so that the Most of the work is done on the compute host then high availability particularly for the DL3 agent is a lot easier I Will mention that we are hiring like I think most places here Get that plug in and I'll open up to any questions No questions at all. Oh, yeah, hold on Down the back. I see you noted that All right, so you guys noted that Rabbit had issues with We've seen a similar example The combo support for keep lives has been reverted because no thread was implemented to take care of that and The kernel based keep alive appears to not always work also. Okay, that's something you guys have come across I haven't I haven't seen that in In our deploy and we have used two variants of rather than cute lines will keep life And obviously we did some testing but we haven't seen that And I have you how have you resolved that or we haven't yet, which we haven't yet Oh We can make it available We didn't we didn't upstream a lot of that stuff just because we weren't sure that anybody else had an interest But I suspect if the interest is there, we certainly make it available. Yeah Yeah Reliability Is there any thought about snooping the package and sending it to the standby? Sorry, I didn't catch the question could you repeat that place? I don't think that thing is working is it? Yeah, it's not working Rewarding the reliability or higher the ability Are we thinking about snooping the package from active to the standby standby with it being the passive? Not taking and only taking the action on the sandbag but not reacting to the actual word So that if I actually goes down standby is up and ready at any point of time And so that even the statistics are metering everything will be 100% profile with the support of package snooping and TCBJ Is there any thoughts on that? I think we're out of time, but we can have that conversation and Outside actually it's probably easier because I think there's another group of people want to come in but I think the answer is no Tom