 I'm Saachi, and I work for Anchor Systems, and this is how we handled RAPID MQ failures gracefully with HA Proxy. For a quick refresher, what is RAPID MQ? Simply, it's a broker. It's a Q, stuff goes through it. Why is it important? It's at the center of pretty much everything. For NOVA or Cinder or Neutron to talk to, it's other components. Everything goes through the Q. The issue is when RAPID MQ, one of the server's dies, there's no keep alive TCP or in-app heartbeats to let the client know that the server has gone away, and therefore it doesn't attempt to reconnect. And you end up with clients publishing to it, but nothing consuming it, because consumption is push-based and publishing is push. You push it to the server, and the server pushes it to the client. So when you try and push the server and the connection is died, you'll reconnect. When you are just waiting for messages, you never realize it's died. So you end up with a broken cloud. Everything died. One way of dealing with this problem is TCP keep alive. You'd need to change the settings to make it a reasonable timeout, not three hours. And another way would be to use heartbeat. This would be in-app. You could use some in-app smarts and implement it that way. Another advantage of this is it allows you to transfer us through things such as a load balancer without occasionally dropping the connection or having ridiculous timeouts on your connections for no traffic. So a typical rabbit setup looks like this, just a clustered rabbit. You've got an open-sac service connecting to two rabbit MQ services that are replicating the queues between them. And when one dies, the open-sac service dies effectively. To attempt to solve this, we threw in a HA proxy. And it worked pretty well. One of the rabbits goes away. HA proxy notices this and is implementing TCP keep alive with a reasonable timeout and sends a reset to the open-sac client, which then gracefully reconnects to the HA proxy, which then connects it to the still alive rabbit MQ. Of course, this needs to be HA. And we implemented this by putting a service IP that would move between the two HA proxies. Problem being, we just created the same problem with HA proxy that we had with rabbit. When the HA proxies dies, the IP moves over. The open-sac consumers don't realize it. And you've got a broken cloud again. To solve that, we moved HA proxy onto each server that's running open-sac services. We assume that that HA proxy isn't going to die. And if it does, that whole node probably has a problem. So it's fine that it died out. It can be off of a cluster and we can deal with it as needed. This works pretty well. Other than there's no longer a central load balancer, so one rabbit may have all the connections. But that was the case before as well. It's not really a concern because we don't have enough of them for it to matter. If you wanted to, if you did care about that, you could throw another HA proxy in the middle and have HA proxy connect to HA proxy to connect to rabbit. And it would all work. So one dies, the open-sac services remain alive, everything keeps working, and nobody notices that there is failure other than maybe monitoring, which is fine. Back and forth, it's fine. And that's it. The slides are available at this website and as well as contact details, launchpad bug, link, and we can take a look at the config snippet for HA proxy. So it's not a great config. It was just pulled off of our API load balancer and then had RabbitMQ just smack back into it. Because RabbitMQ is not always talking, it may be quiet for some time, the timeout on both client and server has been set ridiculously high. And that keeps it from sporadically just closing the connection. Option TCPKA makes HA proxy do TCP keep alive, or start doing TCP keep alive. So that means it sends an act to the client and the server, that is one before its current sequence, and then it gets a reply. And that's how it's able to tell if a connection has died and able to send a reset. And with that, RabbitMQ gracefully fails. Any questions? Well now I've got two. Sorry, now I have two mics and it's confused me. So you mentioned an overbug. What's the state of that overbug? Did you have to do this because we'd let you down and not actually fix the thing? This bug has been open since about 2013, I believe. So it's a matter that it's not resolved yet. So this workaround was done as a way to make our open stack deployment reliable, even though this part of it hasn't been made reliable yet. Yeah, so I think the summary is yes, then overteam does suck a little bit. And should please have a look at this bug. Oh, so it's probably just miscategorized. And if we put it in the right place, people would see it. No, it's in the right place. Take a look at the bug if it'll load. Yeah, that does look like it needs to be filed against also messaging. Oh, there it is, in progress. Cool, that was my question. Any other questions? Are you running mehortqs in RabbitMQ? And if yes, how are you dealing with potential split brain and any issues there? Sorry, could you say again? Are you running mehortqs? Are you running mehortqs on Rabbit on the two Rabbit nodes? Yes. And with this HA configuration, have you had any split brain issues with mehortqs? Nope. Our network is fairly reliable. It's a infinite band fabric. So we haven't had any issues there, as well as if one goes out, I'm not sure. But hopefully it will just stop serving. Like, I'd rather have both stopped serving than if they're confused. But that probably wouldn't happen. I'll ask the question. Because you said your network is very reliable. So are you relying on the network being reliable? And have you tried plugging the cable between the two Rabbit nodes to see what happens? Or so far, nothing has happened because the network is reliable? So far, nothing's happened because the network is reliable. We've had system failures. We just turned systems off. We've had Rabbit Crash. And it kept running perfectly because HA proxy was in the middle, unable to just move the connections on across. I'll talk to you offline later. Any others? OK. Thank you.