 Hello, everyone. Thank you for joining us today. I'm Matthew Papo, a software engineer at Cisco. I'm Wego Sung. I'm a tech lead at Cisco Cloud Service. And today we're going to talk to you about RabbitMQ at scale and a lot of lessons we learned. A few of our other colleagues that contributed this presentation but aren't here today are Wei, Scott, and Kerry. So to give you a little idea about our environment, we run an overcloud and an undercloud. We are virtualizing the entire control plane for our tenant cloud inside of another OpenStack cloud. So our tenant cloud has over 800 Nova compute nodes, over 700 routers, 1,000 plus networks, and over 10,000 ports. Each OpenStack service we run inside of the tenant cloud runs on three controller VMs, and we're supporting Neutron, OVS, and L3 agent. Our RabbitMQ cluster consists of four vCPU, 8GB of RAM, and we run these in an active-active cluster. So we're not running HA proxy or pacemaker in front of the rabbit node. All of the clients are directly connecting to rabbit itself. We're running this on Red Hat 7 with Rabbit 335-22 in Erlang 16. One thing to note is we are pulling these packages from Red Hat OSP. So although it's RabbitMQ 335, there's actually a lot of performance and stability backports that Red Hat does for us. And we're supporting Icehouse and Juno clients. So this does not include any heartbeat of QOS. So when things go wrong, so when you start running into scale issues with Rabbit, the first thing you'll probably start noticing is your compute services start flapping up and down. Instances may fail to boot. You'll be waiting for port binding. Neutron agents may start timing out, and you'll start seeing errors, timeout waiting on an RPC response. And if you start checking the RabbitMQs, you'll start noticing that messages are backing up. This is never a good sign. So the first thing you may think is, let me just go and restart everything. And unfortunately, if you're running into scale issues, this can actually just compound everything. And you may actually just melt down everything and end up having to stop the whole cluster and bring up the services one at a time. So in our early days, we were using all the default settings for RPC and Rabbit, and our cluster grew a lot larger than we expected. And we quickly realized that RabbitMQ was our bottleneck, and we needed to do some configuration tuning to fix that. So the first thing we jumped into was the client-side configuration around Rabbit. So particularly around the Nova and Neutron client configurations, there are a lot of things we had to do specifically around enlarging the RPC pool, so increasing the thread pool size and the connection pool size, and also extending the timeouts. So the default RPC response timeouts are around 60 seconds, but with really large Neutron stacks, it's not nearly enough because it needs to go and pool all of the networks, the subnets, the ports, the bindings, and some of these calls from the DHCP agent and L3 agent for getting the active network information and syncing state can take much longer than 60 seconds to respond. So we've actually extended this much, much larger to 960. There were some performance optimizations in Kilo that were backported a little bit that seemed to help, but we still needed to expand it. Another thing that we really needed to do was increase the amount of RPC workers. So we currently have this tuned to around four per each controller, so like I said, we have three Neutron servers, but they'll be running four workers each. So just to give you an idea on the impact that would make with like three Neutron RPC workers originally, if there were 10,000 messages in the backlog, this could take almost an hour for Neutron server to catch up and process all the messages. When we increased that six-fold to 18, we saw it was very quickly. So let's assume all the messages catch up and then the cluster was much more stable. So even with the RPC tuning, we were seeing frequent disconnects and reconnects with Rabbit. You would see these 404 errors for QNOT found. Unfortunately, when it was stuck in the state, we would have to go and restart the agents. And running OVS, pre-liberty, when you restart OVS, you essentially have to reload all the flows, and our clients love it when you unplug their VMs on them. So we quickly determined that we're running into basically a race condition that was happening with these auto-delete queues. And so before Juneau, auto-delete was not really an option we could even set in Neutron. And so all these queues were automatically defined with auto-delete. And so what could happen is if, you know, if a Rabbit node went down and the client seemed to reconnect, there would be a race condition with the queue declaration and the consumer actually binding to the queue and RabbitMQ deciding, hey, I don't see any consumers. I'm going to go and delete this queue. So we backported some Oslo drivers that really helped stabilize that. And on the Neutron side, we found some combo driver improvements that really helped. So particularly one of the patches we found from the Ubuntu Cloud Archive, really it added basically some exception handling to determine that, hey, if I see this 404 error, let me go and re-declare the queue because it's likely deleted because of this HA cluster failover. So it would essentially go in there and then have the consumer reconnect. However, we were noticing, although the queue is getting re-declared and reconnected, he wasn't actually consuming. So we had to add a few more little lines there that would actually go and help him consume. So we were, even with the tuning on the client side, we still never got to the root of the problem, like what was actually causing these connection issues. So that's when we decided to actually start digging deeply into Rabbit. So particularly some of the important things we dug into was this Erlang configuration. There's a few options in the Erlang arguments that you can add that are important to add on, especially the keep alive settings and adding this plus A for 128, which sets the Erlang VM IO thread pool size. So Rabbit recommends you set this to at least 128. Another common configuration recommendation that we had was this TCP user timeout, this raw 618-5000. And this basically sets the TCP user timeout to five seconds with the idea that we'll quickly detect when an established connection fails. This is outside of using like a keep alive or something. However, this actually causes a lot of issues. So important thing to note, if you are setting this TCP user timeout, he'll actually override any keep alive timers if it's shorter. And what we discovered is that if you were to drop a single packet and you had this set, it would actually trigger a socket tear down on Rabbit. You would see this error, this Inet error for e-timed out. We were seeing this happen between Rabbit and the clients or even between the cluster itself. Another important thing to note that in Liberty, there were some QoS options added that are really important. Basically, you can limit the amount of messages a consumer actually accepts off the queue. Otherwise, for instance, if you had a queue with many messages backed up and a consumer connects for the first time, all of the messages are going to be flushed to that client. And now, if you had this TCP user timeout setting there, too, the client could just be busy trying to catch up and digest everything. But from the server side, he's like, I didn't get a response, so I'm going to go whack the connection. And he'd actually go and close the connection on a healthy client. So another thing to note is, since we're virtualizing the control plane in KVM, there's a default KVM transmission queue link that's set, and it's really tiny. It's only 500 packets. And so what we noticed is that with a busy Rabbit cluster, this buffer could actually overflow and you'd start dropping packets. And so if you didn't actually increase the transmission queue length, you'd drop a packet, and if you had this TCP user timeout set, same thing would happen. You'd actually be triggering disconnects that you really didn't want. So we really recommend that you increase this transmission queue length. We recommend setting it to 10,000. The amount of memory you're going to actually use is negligible. And one important thing to note is that this parameter isn't really... You can't set it in KVM or Nova. It's something you actually have to do on the hypervisor side. So we recommend you add a UDV rule to set the tap interface, transmission queue length. So we recommend setting it to 10,000. You can do this on the fly, or you can set the UDV rule so that when VM is brought up, it'll automatically get set. Between adjusting the transmission queue length and actually removing that TCP user timeout, that actually helped address a lot of the disconnect issues we were seeing at a large-scale cluster. Yeah, let me make a point here. Excuse me, Matt. So between these two, you realize one, the TCP timeout is actually... The user timeout is actually the underlying cause because you can't tolerate a single pack loss. The tap interface setting on the KVM for the virtualized rabbit nodes, it's one actually occasionally causing the drop of the packet. So by tuning both of these, hopefully you avoid any pack loss, or even the underlying TCP IP stack can tolerate a single pack loss, which is, you know, even in a data center network, that could still happen from time to time, a single pack loss. So certainly if you virtualize your control plan, that's something you want to take into consideration. Yeah, so if you pull up, if config on the tap device, so the thing you want to basically look for under transmission errors, if you see if there's anything dropped, if it's greater than zero, then you probably want to increase that buffer. Some other things to keep in mind if you're virtualizing control plan is you never want to suspend a rabbit VM because when he actually resumes, he thinks he's still healthy, and it'll actually cause a partition. And you also want to monitor a hypervisor for any underlying issues. So pay attention to anything like a CPU soft lockup, or disk or memory contention, or RAID or IO controller resets, because all of these can cause, you know, a little skip in the underlying VM, and then all of a sudden he's out of sync with the cluster. So moving on to some of the RabbitMQ configuration options we've also working with. So some of the important things that we set were our cluster partition handling. We set pause minority. We'll come and talk about that in a little bit. And we also make sure you want to set a high watermark. So in our case, we set it to 0.4, which allows 3.2 gigs of RAM for a rabbit, which is more than enough. If rabbit's consuming more than a gig of RAM, I mean, we're having issues, and so increasing the high watermark is just basically delaying the inevitable. So we set it to 0.4. Some other important options are setting this reuse address to true. It will reuse sockets that are in the time weight, but do note that this is not really safe if you're netting any of your connections to Rabbit. And also you want to do no delay for true, which disables Naples algorithm for increased throughput, as it doesn't have to do TCP processing on every packet. And we also enable keep alive for true. Also, there's some process level tuning you want to make sure you do, mainly around the file descriptors. The default Linux terstro sometimes it only allows a thousand file descriptors for Rabbit, and we want to set that to at least 65K or more. So this allows a lot more connections and queues and messages build up without being trim. And basically all the other limits, too, we just set them to unlimited. Basically, we want to guarantee that on the VM, RabbitMQ has all the resources available to them. If you want to check your current process limits, you can run this command, at least for Red Hat and CentOS, and it'll tell you what everything's set. And like we said, we recommend just maxing everything out to unlimited. So back to partition handling. So the choice between pause majority and auto healing. So it essentially comes down to the cap theorem. We're either going to sacrifice consistency or availability of the cluster. And so in our findings, we really don't think that an inconsistent cluster is actually useful in OpenStack. We feel like consistency is more important than availability, being that if a node is connected to the minority, he's going to have to actually fail over to another one, but we feel that's more important than being available. One thing to note if you are using pause minority is it requires a quorum. If there's only one node alive, he's going to pause. So if you're doing maintenance on a cluster and you bring to the nodes now, you're going to cause an issue. So always keep that in mind if you set pause minority. And both auto heal and pause minority are not perfect. They both have their issues. So you really need to have your own kind of partition monitoring and alerting, which we'll talk about a little later. It's also really useful to have automation to kind of restore a partition. If you can go and identify which the minority node is, you can just go and wipe the amnesia database from that rabbit and restart rabbit, and he will go and resync and join the cluster back over again. So next on to queue mirroring. So this is something that's set by default by the rabbit and queue policy. You will see in some of the legacy configurations on the client side that you have the setting for rabbit HAQs. This actually doesn't do anything after rabbit 3.0. All the queue mirroring policies are only set on rabbit and queue policies. It's not set in the client configuration. So just keep in mind that that configuration on a client doesn't do anything. Another really important thing to note is that queue mirroring is really not needed. It's not needed for RPC, and it's really expensive. There's a two to three times performance hit you take if you have queue mirroring turned on. So you can actually turn this off and your cluster will work fine because rabbit itself will route your message through. If you're connected to rabbit 3, your queue is on 1, he'll actually route your messages through. And so we really recommend that you only mirror some of the billing queues, the notification queues, anything that you really don't want to sacrifice possibly losing. And there's also some examples in Liberty that show you can do this without queue mirroring. So one thing to note, if you want to change your policy where you have previously mirrored and you want to turn off queue mirroring, you'll likely need to restart the cluster. Although you can set the policies on the fly for them to fully take effect, we noticed we had to actually restart the cluster. So on to some operating system tuning. Default TCP settings are really not ideal. We're talking over two hours when you have TCP keepalives turned on by default before he even sends a probe packet to see if the connection's still alive. So we adjusted these packets and it really, really helped the clients fail over when a rabbit node went down. So we really recommend sending the keepalive time to five seconds and setting it to five probes with a one-second interval and also setting the TCP retries to. This really helped improve the clients determining if there was an issue with rabbit and then deciding to, okay, I'm going to go connect to the next one. So next, on to monitoring RabbitMQ. We use RabbitMQ Admin to monitor the health check and we basically want to make sure we query each node for this, the cluster health and partition status. Sometimes when a cluster does get partitioned, you can have one node think that he's okay, but the other two have a different meaning for it. So when you do do your checks, you want to make sure that you set them up for all three, not just one of them. Some of the things we measure are their Erlang memory utilization versus the high-water mark, number of file descriptors used, sockets used, process utilization, the system memory, disk utilization, and the queues and number of unac messages. Whenever there's any messages building up, you maybe want to set a threshold to, like, if it's something greater than 20 or 30 messages, maybe trigger an alarm because messages back it up are always a sign of a problem. You also want to make sure you look through the RabbitMQ logs because you'll always see this alarm set go off, basically saying, hey, no connections are going to get accepted until this alarm clears. And right before that, he'll tell you why it's going off. He'll tell you, oh, maybe you're out of file descriptors or you're out of memory. So always important to look at that. In addition to just using RabbitMQ Admin for a lot of monitoring, we also run a lot of synthetic testing. So we have synthetic testing that's constantly sitting there basically booting a VM, create a router, network, attach to the VM, and ping the VM. If it works, tear everything down, we're good. Do it again. Create a volume, delete a volume. Upload an image. And a lot of times, if you see a failure in this synthetic testing, he's kind of hinting that maybe something could be wrong underneath with Rabbit. So it's really useful to have either a cron job or something running in the background constantly do that. So onto some various rabbit tips. Using RabbitMQ Admin instead of RabbitMQ CTL command. So before 3.6, the RabbitMQ CTL, like list commands, did not actually stream. So it would actually sit there and buffer the entire result. And so you'd notice if, like, if you have a stuck queue or something, you'll try and run RabbitMQ CTL list queues and it'll sit there and hang forever and ever and ever. So we really recommend you use the Admin interface, the RabbitMQ Admin command, because he actually talks REST to a lightweight HTTP server and he'll return the results. And he also uses a different underlying, Erlang VM and it doesn't put as much stress on the system. Some other important things to do are you want to make sure you monitor the memory management of the stats database. This can eat up a lot of memory. And if it does, you can actually go terminate it on the fly. It will not hurt a production cluster. It doesn't affect anything else besides statistics. And if you are running a large cluster, we'd recommend disabling the UI if you really don't need it. You can pull everything from the API or if you can't disable the UI, you definitely want to at least adjust the statistics collection interval. It defaults, it pulls every five seconds, and this is really intensive. So you can also adjust this on the fly by just setting this command. We recommend sending it to, like, 60 seconds or more. Some other tips. You definitely want to set a policy for QTTL. You set this with expires in the number of milliseconds. Just set it to something greater than the RPC message timeouts. So if there's zero consumers, Rabbit will go in there and clean up it for you. This is really useful for cleaning up orphan Qs. So sometimes if a hypervisor goes down or something, you'll have a Q that will sit there and constantly be pulling up more and more messages and eat up a lot of your memory. And by just setting a policy, you can have Rabbit go and clean those up for you. Another thing is don't use auto-delete Qs if you have the option. It's much safer to just use a QTTL and have Rabbit go and clean it up for you. Because, like we said earlier, when a node fails over, sometimes there's a race condition that hasn't totally solved about. Rabbit decides I'm going to go auto-delete this before someone else actually binds. And if you do see lots of reconnects between the client and server, you want to investigate more into the RPC tuning and definitely look into your network stack to see if there's any errors. Another tip is by default, when you set the Rabbit hosts list on the client side, he normally will always connect to the first host in the list. So if you were to go restart a whole stack, they're all going to try and connect to Rabbit 1. So one little trick you can do is basically just randomize the order across different services so that it's not always Rabbit 1, 2, 3. So that way, if you restart all the services, they'll actually kind of distribute the load a little bit better for you. So architectural decisions. If we could go back and redo things from the beginning, we probably would not use a single Rabbit cluster for all the services. We would definitely at least take the chatty services, maybe stuff like Solometer and Heat, and put those on a separate cluster. There's no need to put everything in one because you're just asking for failure. So it's better off to limit your domain so that if one thing fails, it doesn't affect all your services. And so those things were really helpful in us tuning and actually supporting a very large cluster. There's also a lot of good resources out there. There were some good talks in Austin on troubleshooting Oslo that go into depth on how to debug Oslo issues. So I recommend you check those out. And there's also some other talks from Tokyo that were good on Rabbit. And lastly, also check out the RabbitMQ users group. You can search all archive of messages, and you can reach out to the RabbitMQ developers and ask them any questions you have. So it's really useful resources. With that, we want to turn it over for questions. If you can, please come up to a microphone and ask questions so that we can get them on the recording as well. It's really possible to have another RabbitMQ cluster for Cilometer, or you just... whether you implemented that. Would... You were showing two different clusters, one for Nova Glance Neutron and the other one for Cilometer Hit. Yeah, yeah. You hugely implemented that. No, but we would, I guess. Or we would even take separate services. Or it's possible in general. Yeah, it should be possible. The services are all independent of each other, so there should be no need for them to actually talk to each other on the same cluster. You said that you would change the TCP parameters to use the TCP Kappa Live. Yes. Did you try to use the RabbitMQ heartbeat? So in Icehouse in Juneau, we didn't actually have the heartbeats available to us. Yeah, they weren't implemented yet. And I know there were some issues also in... Absolutely, we heard even... There's some... Lots of issues still there. So I think that's a recommendation why the TCP Kappa Live was originally introduced to... as a recommendation from many experts that you should have that. But on top of that, that's why the TCP user timeout also being recommended. We actually found out in reality it's hurt us rather than help us. So certainly if you don't... Even right now, because we don't have a perfect solution for the heartbeat solution, even at layer 4 and above, so you probably should also still tune the TCP Kappa Live at TCP IP stack level. Okay, thank you. You mentioned in the slide that not all the queues has to be AJ just all in the notification queue, so I'm just wondering how to differentiate. Do you have a list of which one is safe to be notification? So you only really need to do notifications in Solometer if it's something that you use it for billing. If you losing the messages would cause you to lose some billing information, most people don't like that. So those are the only ones we really recommend it. You don't actually have to do it for any of them, though. So all the RPC calls within OpenStack will work without turning on cumuring. Which means that you don't have something like a database like, you know, provisioning is going to do some RPC messaging to the RabbitMQ and in half way, the whole RabbitMQ cluster is going down and you at least won't know that you are losing a... In this case, you will have something like a failed or a stopped entry or a half-made job in the middle of the queue. You could, but I mean, it's a very small chance that most of these messages are extremely short-lived. They're just being dropped off and delivered right away. It's not like these messages are persisting on Rabbit. They shouldn't be. We should be delivering right away. So the risk is very small. But yes, if there was a cluster failover, you may have maybe an error state for a sec for 1VM or something like that. In the majority case also because the general OpenStack model they will get reprovisioned either at system level where the message will get redelivered by the calling party if it's RPC because they're the state check, right? So let's say you know the reboot. They lost the message. When they come back up they will grab the state saying I'm really supposed to have this VM here so it's going to recreate all the ports all the VM they need to be on the node. Are you sure this is fully implemented in all the projects? Okay, recall or redo this kind of stuff? At least most of the service we have been implementing internally the major OpenStack service we haven't seen any issue with that. Now, does AJ turn on or read some cleanup scripts? Yeah, I think it's certainly that's still a concern overall if you don't have the HA but even you have the HA sometimes let's say our experience is when you have a problem with RabbitMQ you end up having to reboot restart from scratch we rarely see any issues associated with that because by restarting it you lose all the message anyway all the message in the OpenStack usage pattern where you actually don't it's not non-persistent, right? So they are all transient. No, it's not That's correct, yes. I think one of the things we run into in our environments like number of ports the scale we didn't realize in the beginning all these issues and once we start growing our environment we start realizing all the tuning you really want to down ahead of time if possible so I think that's the I would say one point if you decide to grow your environment you aren't really looking to that ahead of time so I think in generally it's not required it's not needed majority of the message like I said I think if the OpenStack said okay this message didn't get delivered when the system restart they will rethink the state so I think as a distributed model you can't guarantee it's always in sync anyway and the message is short lived for the last couple of minutes it's pretty much useless and you'd be redeveloped okay great thank you if there's any other questions feel free to email us and we'll get back to you hopefully this will be useful thank you very much