 Check check. Okay. Hi guys Thank you for coming to this talk We're going to talk a little bit about messaging and rabbit MQ and travel shooting today My name is Michael and co-presenting that means Mitri. I work for pivotal and a rabbit MQ team Mitri is with Miranda's Yeah, and The first part is Mitris, so I don't So let's start First so we have that plan first I will tell you what is also messaging how it is used by open stack Then I'll give you like some tips how to you can troubleshoot it and give some pieces of advice How you can avoid actually troubleshooting it in advance Then there will be Michael's part where he will Deep into details how you can troubleshoot rabbit MQ and how you can set it up. So again, you don't need to troubleshoot it Let's start. So What is also messaging? also messaging it's a library which enables services which uses it to Build RPC clients and servers. That's its first task and second task. It enables Services to emit and handle notifications It doesn't do it directly. I mean it doesn't do RPC directly doesn't send notifications directly instead It uses various backends. You can see the list of backends on the screen So also messaging supports all of them via drivers The notable thing is that rekt MQ is supported by two different drivers One of them is based on combo library and another is on pickle library But here on this presentation, we will speak solely about combo driver If you use rekt MQ with also messaging Most probably you are using it because pickle was added only in mitaka It is worth adding that also messaging is a project which is developed purely an open stack so it's it's our project sorry, it's our project and Mostly open stack services use it for internal communication. So now we use it to communicate between itself Neutron use it and and so forth I'll add more examples later So here is the first example This is a rough overview how spawning a VM looks From the point of view of nova first client sends HTTP requests and then Nova starts working It sends RPC request between itself. So Nova API sends to no conductor conductors query scheduler to figure out which host Can be used to spawn VM and then it sends a request to Appropriate compute node to spawn a VM as you see all interaction here goes through RPC Oslo messaging RPC. So it is a really essential part of your cloud if it if Messaging doesn't work. Then the only thing where you will have working is keystone Here are more examples of Internal messaging between the various components or sorry with the inverse components But it is also worth noticing that there is a At least one example when services communicate with each other with messaging Sorry, it is that Each open stack component sends notifications to Cilometer We've seen picture how Open stack components use messaging, but where where is RevitMQ in this picture? Let's review one piece of that communication between Nova conductor and Nova compute So when Nova conductor wants to send an RPC request to compute it actually puts a message into a queue in a RevitMQ and Nova compute reads this message from there If Nova compute wants to send a reply back it puts a message into another queue and Nova conductor reads it So that's it Next step we're getting closer to troubleshooting actually How can you understand that you have problems with messaging? most probably most probably you understand that you have problems when something doesn't work you look into logs and When in logs you see that that magic word Oslo messaging It means that you actually have keyed it so here on the second line you see the Oslo messaging highlighted and let's Now let's wish to actually troubleshooting so the caption is my favorite exception and You see the exception name. It's messaging time out. Why it's my favorite because People climb that it happens due to different reasons But the truth is you can't just seeing that it's messaging exception. You just can't tell what actually happened Let me dive into details. Why? To do that I need to dive a little deeper into how Oslo messaging works actually it supports several different operations The first one operation is a cast operation It's a fire and forget kind of operation when client sends a quest and then instantly forgets about it The next operation is notification. Essentially, it's the same cast operation. It just serves a different purpose and finally there is a call operation when client sends a request and then receives a response from the server and Messaging time out is an error which can occur only during call operation It happens when client successfully sends a request to the server, but doesn't get a reply in a predefined amount of time Let's view deeper of which parts Call operation consists so first client sends a request client puts a quest into web dem queue Then rabbit in queue passes this request as a message to server Then sir processes the request and produces a response It puts it into a rabbit in queue and finally client reads these response from a rabbit in queue and the thing is that messaging timeout error can occur if Any of these stages start starting from stage 2 to stage 5 fails How can you understand at which stage the failure occurred? The good thing is that in mitaka we added a fine-grained logging where each request is logged at each stage so here on on the screen you see an example when Neutron I Believe it's Neutron agent sends a L3 agent sends a Report to Neutron server. So here you see Neutron agent's log it's the first stage of the Request when it actually does the call and the last stage when it receives a reply from the server and Here you can see an example of a Neutron server logs when it receives a request and Sends a reply back to the client The interesting thing is how actually you can get these debug logs most people think that it is enough to put like line debug equals true and you got it, but Unfortunately, it's not like it's not that easy for messaging What you actually need to do instead is you need to find a line default log levels Uncommented and find inside it also messaging and set it to debug so sometimes it happens that you actually You see the failure, but you didn't have a debug enabled in advance and still you need to analyze The failure and produce like a report. What can you do in that case? First examine the stack trace Find in the stack trace which operation actually failed guess the destination service and Try to find the correlating log entry in the destination service logs for example Here is a stack trace and a highlighted line. It shows that the request it was a report state request of in jcp agent and Such kind of requests go into Neutron server Obviously, you need to understand OpenStack architecture in order to guess the destination service which should process of the request and Also, another thing is that you'll have when looking into logs of the server If you have several instances of the server for instance in Neutron server you need to look into logs of each instance because you don't know the exact instance which process the message and Going back if you have debug enabled then Debug shows the unique message ID, which is essentially the same through all the stages of request response Also, you can diagnose issues in messaging through revdmq actually Here is the first useful command it List of the queue all the queues in revdmq Number of consumers queue has and its name Also, it is very convenient that the output is sorted by the first column which is consumers in that case and That way you can spot queues which has zero which have zero consumers And that might indicate a problem because if queue has zero consumers nobody listens to it nobody processes requests It and one of the reasons why it might happen is because actually the corresponding service died Another useful command is actually essentially the same list queues command except it As you see I've added messages colonnades. So it will output a number of messages Which are currently in that queue? Why it is useful? Because you might spot that some queue has a lot of messages accumulated in it and it might indicate several different problems First it might be that the corresponding service just can't cope with the load Obviously you might If all worked well before you need to investigate why it happens another reason is that the processing service it might get stuck and finally More funny kind of error. It might happen that you if you are using repped in the cluster that your cluster is partitioned and Regarding the cluster partitioning You will save you a lot of time if you actually check that your cluster is whole by running these command If so what you need to do is you need to ensure that Running nodes list contains all repped in queue nodes in the cluster Because if it's not the failures They are very different, but in general your open stack will be complete completely inoperable Now a little more how you can actually fix such issues when you find them First if you see problem with repped in queue Then obviously you need to look into repped in queue docs and see what you can or how we can fix it In case the problem is not in the repped in queue Then in many cases restart of the corresponding service open stack service might help What you need to do is using the debug you you need to find actually which service fails And the third way it's more mild way instead of Restarting the services you might Close the connections from services to repped in queue and hence force them to reconnect So that's all I have like that these are all my advice this About troubleshooting now I want To make several suggestions how you how you can make your life easier So the first suggestion is we have that parameter mkp auto-delete In oslo messaging the suggestion is never ever set it to true so or what it does it if you in it if you set it to true all your queues will be created with auto-delete flag and That means when consuming service dies the queue will be deleted as well. It's Pretty convenient works as garbage collection, but it has it downsides Which I described later if you really want these garbage collection You it is much better It would be much better if you instead use QQ expiration policy with Some saying time to leave for for example one minutes Here is the link to The revenue commutation Saying how you can enable this policy will share this presentation afterwards so you can go to the link and see how it works One of the advantages of the policies over out of the let's walk is that policies can be changed at a runtime without and As for out of the let's walk you need to actually destroy your queue and the recreated with the Different setting if you want to change out of the lead also Actually Irregardless of how you set this flag also messaging created some queues with this flag enabled Are reply queues and for now it queues That worked before mitaka and starting from mitaka. We changed these queues to expiring ones So you're safe here No, I think I own you explanation why you shouldn't use these outlet flag. So imagine you actually set it to true what might happen in Interaction between open stack services. So here we see a Conductor and for instance it sends a request to Nova compute to spawn a VM But before it sends the request the following happens a Network hiccups and Nova compute loses connection to RebDimQ Next Nova conductor actually sends the request to spawn a VM Okay, and Then RebDimQ kicks in it sees that queue has no consumers And it has out of let's walk enabled and what it does it deletes the queue completely Next what you see is no computer reconnects back to RebDimQ recreates the queue All goes fine except the message which was there. It lost it was it's lost It's just lost forever so and what you have you have your VM stuck in spawning state forever the next device I have is in case you use RebDimQ cluster and Maybe you are using queue mirroring what I want to add is that queue mirroring is quite expensive Our test shows that on three note RebDimQ cluster if you enable queue mirroring then your throughput will actually drop twice and actually When we're speaking about RPC You mirroring and HA actually is not essential for RPC so It might be just not worth the trouble and throughput is more important for you but another point is that It might be that notifications are important for you That happens in case if you use them for billing for instance in that case What you can do is you can disable RPC for or sorry disable HA for RPC queues But you can enable it for Notifications if you want to do if you want to know how to do that here is a link By the link you can find how we do that in fuel and final piece of advice I have or better to say suggestion or maybe advertisement in the Mitaka we developed a Feature where you can actually specify different backends for RPC and notifications and Here you are free to use actually different drivers or you can use the same driver and So that for example RPC messages go through one RebDimQ and notifications go through different RebDimQ and you can Set up them differently for whatever you need Here is a link to implementation Unfortunately, it's not well documented yet. So You will have to look into the code to understand how you can configure it Sorry That's that's it for my part and I passed the mic to Michael Thank you Jvinchen. So we're going to talk a little bit about things that are not open stack specific that are General problems that my team has to you know, we answer those questions every single day. I'm not overstating it So yeah, if I say something that doesn't exactly feed the context of open stack, that's probably why so let's start With a picture of a rabbit that responds Also, let's see if I can do something about this talk. Yep. That's better Okay, so First issue it's not that common, but people completely panic when they see this So RebDimQ runs on the Erlang VM and every once in a while Someone comes to the mailing list that says hey my Erlang VM just disappeared My monitoring system says it used to be there, but no it's no longer So what happens now for those who deploy your private clouds in the Caribbean You could blame it on the Bermuda Triangle, but most of us aren't this lucky, right? So we need to investigate About 95% this is not very scientific of issues that Lead to this is a Linux is out of memory killer That's pretty easy to investigate. Just take a look at this log and grab things Or B you might have run into a number of a small number of known runtime issues This is one popular example This particular guy means that either you run a 32-bit Erlang runtime on a 64-bit machine Which is a bad idea for fairly obvious reasons don't do that Or you are hitting a known bug which at least I personally don't recall seeing ever since maybe 17.5 Yeah, so those are two common scenarios. So let's move on The next question that we get answered a lot So how do I know what exactly consumes RAM in my on my node or in my cluster? Again, there can be all kinds of reasons. Let's take a look at some of them. First of all, don't guess Use rabbit and can control status it contains A rough You know a Breakdown of what rabbit and Q things uses memory. It's not 100% precise because we're at the mercy of the runtime But it's good enough in most cases So just as Mintree has mentioned before you can use you can list Qs and See how much memory each individual Q uses that way Right. So one specific case of this is rabbit and Q has a management plugin which collects stats about things in your cluster and Displays them and presents an HTTP API So the thing that collects stats Historically has been a single Erlang process or a single thread if you will That does way too much. And so once You have things in a cluster connections channels Qs and nodes they emit stats periodically Once you have enough things like enough connections, for example, you don't even have to have a particularly high message rate that thing can get That can start falling behind which means it is getting Messages or stats faster that it can process it So One way to see if that's the case is again run rabbit and Q control status and see and look for the management to be key and If you see that it is disproportionate larger to most other things that might mean and and more importantly if it's growing Then that might mean this is exactly what's going on so You can reset the management to be using rabbit and Q control valve This is a safe thing to do. You can run it against any cluster node. It doesn't have to be The one that hosts the stats database It's a fairly safe thing to do. So while there are tweaks that can reduce How how many stats are emitted per unit of time? Worst case scenario you can just add a chrome job that resets it every however many minutes you feel appropriate The only caveat with that is once you reset That's the stats database. It doesn't have any data in it, obviously and before the next time that Connections or cues emit something they as far as this database is concerned. They do not exist So this may wreak havoc to your monitoring system and we have seen cases where monitoring goes completely crazy about this Yeah, so be careful, but it takes I don't know two or three seconds for that database to restart probably even less so That is an option. But of course, this is not exactly what rabbit and Q should be doing, right? We should have a parallelized event collector So that's this come in rabbit and Q362, which is there is a release candidate of it So feel free to give the try and it's like 20 process instead of one. So You can expect the thing that things have gotten a bit better So moving on there is a plugin called rabbit and Q top All of you are probably familiar with the top tool with the Unix command-line thing This is a similar tool for Rabbit and Q Processes or like has its own tools in that area. We are not going to get into that but You can in modern rabbit and Q versions you can enable and disable plugins on the fly So you can temporarily enable it and take a look so but Cues and messages and stats are not the only things that can consume RAM so connections actually do and Every connection consumes Let's say a couple of hundred kilobytes. Now. Why is that does rabbit and Q has a lot of connection state? Do we if you do something stupid so it allocates that much RAM? No, the answer is And by the same is true for for channel 7.0. It's way way way less The answer is every connection has two TCP buffers, right? And those are auto tunable on Linux and about 100 K by default it can range from 96 to 128 something like that. So you can reduce it in rabbit and Q and the OS alike Take a look at the docs about how to do that. And this is the primary Knob you can tweak if you want to support more connections because RAM will be your most likely bottleneck And also a little no fact that if you can you can limit the number of channels on the connection just as a safety measure like Capit that maybe 64 or 128 or something like that using this This setting key Yeah, so imagine that the next common scenario is you haven't know that's not very responsive There can be all kinds of reason for it, but one One little known tool or at least under appreciated tool that can help you that is This guy it will if you're familiar with GVM threat dumps, or I think they are called the same way in net That is basically what it does. It just it uses a pretty simplistic technique to detect processes that have their stack trace You know not progressing in a certain period of time and It would simply output a bunch of information about those processes and again Like I said, there can be all kinds of reasons for unresponsive nodes But by looking at this information, you can usually narrow things down quite quickly So in in the last couple of months early solutions and pivotal have identified a bunch of deadlocks and the data store that rabbit and Q uses internally for metadata like Q's channels users we host and so on but not messages And yeah, there were cases where you could run into a distributed deadlock As far as we know, they are all in Erlang OCP 1831 and like four of them if I remember correctly are now I think of the past hopefully So let's move on TCP connections are rejected So your open stack components try to connect to rabbit and Q and fail So this is really basic I understand that most of the audience are highly competent engineers, but you have no idea how frequently we have to Tell people that hey you need to actually allow traffic on your firewall So this is one thing to check Ensure that if your machine has multiple Network interfaces ensure that rabbit and Q listens on the correct one by default. It's all so but just in case Check the open file handle limit in your operating system because defaults on Linux are great for running gnome But not modern servers. They are completely insane. It's I think one one thousand twenty four Connects connections. That's by the way for every single process not just Everything you're something I know it is per process, but in any case it's inadequate for servers such as rabbit and Q such as my SQL So you have to bump it to bump it to half a million forget about it. It's You would Spend pay as a very small penalty in terms of how much RAM the kernel uses, but that's nothing compared to The inconvenience that you would get when your clients cannot connect So another thing that happens so the open file handles limit limits how many Open file handles including sockets a process can have at the same time But that thing is not exactly what might What might be the bottleneck in your case another thing imagine that you had a A network failure between your clients and rabbit and Q and then it healed and then clients reconnect And unmask right so you have hundreds or thousands of clients reconnecting So the way in bound to CP connections are handles that they are put in a queue And this has nothing to do with rabbit and QQ is just the data structure Or they form a backlog if you will and that backlog is of limited size and soon as it reaches the limit all new Connections are rejected by the kernel. So you can tweak it per socket using rabbit and Q settings. There are also There are several kernel settings net core So SO max connections is probably the most important one that you have to tweak again the defaults in rabbit It's 128. I think arguably it's okay for most people But in the kernel, I would say they are a bit inadequate Lastly again, it's kind of obvious but you have no idea how often we have to tell people that hey check out logs There are probably something So tell us connections fail right This deserves a talk on its own. I'm not going to explain how public infrastructure works or any of that again C log files whole TLS authentication or purification and other reasons are locked and error messages They often come from libraries such as open SSL. They're not very sensible, but you can at least Google what's going on That's much better than guessing but what is even better than guessing is or or using logs is using tools So one of the tools is open SSL as client. It lets you open a connection We tell us with your provided certificate key pair To any server that has TLS enabled cruising raptom to this helps and and there is a server alternative So those two will help you narrow down. Is it a client issue or or a server issue? And almost always the issue ends up being that one of the peers doesn't trust the other or If you use chain certificates that verification depth is insufficient So there is a guide on rabbitonq.com has been around for quite a while about troubleshooting TLS using open SSL as client and the server and so on So check it out Lastly we have seen known limitations in the TLS and crypto modules in our line prior to say 17.5. So this is what we highly recommend if you use TLS Otherwise you might run into fairly obscure issues But which come down to the fact that certain for example the elliptic curve algorithms are not supported But their message of course doesn't say that it says something completely different because open SSL Right. So another thing that every once in a while have to do is message payload inspection You want to see what flows through rabbitonq? So Rabbitonq has a tracing feature which means that every message that is published to a v-host will be Republished is to the same q dot rabbitonq.trace exchange. It's a topic exchange So you can use any rabbitonq client to consume messages that go there and Yeah, see them see see what's there including the payload There is a plugin that basically has a web UI for it But keep in mind that tracing puts a lot of pressure in the system You basically replicate your entire traffic at least for that we host and Yeah, that You use that carefully again, it can be enabled and disabled on the fly Lastly if if that fails just use TCP dump or wire shark, which is a GUI tool built on the same library and It will display mqp 0.9.1 traffic as well as TCP fields and so on it's extremely helpful when you have to Debug something and it also can work with the last connections if you give it a certificate keep air Right so another issue that's not obvious is higher than expected latency There can be all kinds of reason for it But let's name some of you again using wire shark you can narrow down Which component in open stack or maybe rabbitonq itself introduces it just see TCP segment timestamps that can be you know precise enough to narrow things down Then use something like s trace detrace. There are all kinds of tools. We will get to that To narrow the problem down further And also be aware of the fact that Erlang VM has something called schedulers. Those are things that run your code And they can be pinned or bound in the Erlang parlance to CPU cores. So if you have a multi core system, especially in Anuma one you really would like to avoid scheduler to core migration because that ruins effectiveness of CPU caches and so on and Can introduce latency spike for no obvious reasons. So just by switching Scheduled to core binding strategy with a Virtual machine configuration flag can sometimes help a lot, but this is workload and system dependent just just be aware of it Right, so you you might have noticed a general theme here again I understand that folks in this room are highly competent engineers, but I do private and Q support Seven days a week effectively and I answer the same questions over and over I have to recommend the same thing over and over So one of them guessing is not an effective or efficient debugging strategy Don't do that. So what do you do instead of guessing? Use tools to gather some data that would help you to form a hypothesis Like what always consult log files for both Oslo messaging and RabbitMQ and if you run if you Use HEProxy or a different intermediary for connections that also locks things and has a management UI and can affect Can affect your systems operations Lastly ask on RabbitMQ users. We see a very small number of folks who Come to us and ask questions in the context of OpenStack. We're not entirely sure why maybe we just don't know that their questions Related to OpenStack, but yeah That mailing list The entire RabbitMQ team is on and we are happy to answer your questions But if you never come to ask them, there's not much we can do Lastly here are the tools That can help you Collect some data About how your TCP stack works. Are you a P stack works? What happens with drivers applications system calls? Scheduler stats virtual memory stats IO stats all kinds of things Now no need to to become an expert in all of them, but you can probably pick like five that would already give you a lot of That already would make you a couple of orders of magnitude more efficient than if you apply the typical Guess and ask on stack overflow kind of debugging approach Right. Thank you, and that's open up for questions if we have time specifically about reconnection in kilo so we've had issues where They come when one of the controllers goes down the compute node goes into this loop of timed out waiting for a Response and the only solution seems to be Restarting the compute node. Do you have any better strategy than that? This is specifically targeted at kilo Do you have multiple controllers? Yes, I do and one of them goes down The trigger is one of the controllers going down. Uh-huh, and after that you see our constant messaging timeouts on compute side Correct, and this is with heartbeat turned on I can't tell right now what happens because as I told earlier in my slides Various that can be due to various reasons You really need to enable debug and see where your messages get stuck because actually if you use multiple controllers Then well one controller is down other controllers should be able to process your messages, and I believe them I've seen them doing so So this is not in the context of open stack But as a general remark automatic reconnection to rabbit MQ in rabbit MQ clients is mostly a solved problem I'll say mostly because you know people want some kind of improvement It doesn't really matter what kind of features you have they always want more but in a bunch of languages Sadly in a bunch of clients sadly, maybe pika is one of them, but certainly combo isn't Automatic reconnection has been around for years. So this can be improved greatly at the library level just just saying I Just wanted if you see actually messaging timeouts, then you don't really have problem with connections because that means that Compute or was able to send requests to red MQ. So it's not a connection problem problem is somewhere else Yeah, I agree actually I said the same problem and that was conductor. I Restarted conductor never things start working again Yeah, I don't know if there's a solution, but I see the same problem during the HIV lover that's happened here, I know So my question You said something that the HA for the QC's the replication is not really essential. Can you elaborate on that one? Sir come again So you said something that HA for the Q's on rabbit MQ just replication of the Q is not really essential, right? I guess that is related to the MQP MQP protocol, which basically handle this like a No, no what I meant. So if you enable HA or you You became more silent messages never get lost, right? If you did if you don't enable HA then mess just might get lost for instance If message was in a note which which went down then Essentially you've lost a message What I meant is that it is not essential because failures occur very rarely in production environments I mean server failures and what you get as a result, okay several VMs stuck in spawning state or something else You can manually clean it up But on the other side if you don't enable HA then you get twice as much throughput on normal operations Which to my mind maybe? Maybe important Okay, thanks So I would like to add that there is a trade-off between throughput and safety in every data service Rather than use one example, but you see the same in data stores and many other systems I think the point is that you don't have to mirror everything some data is more important than other Some data is transient in other words. You don't want to have it after a certain period of time So for those cases just don't mirror or it at least not mirroring could be a reasonable trade-off A bit more about Rebit MQ reconnection it's well known between all ops It's called rabid crocs rabid have nothing to do with this But basically if something something happens around rabid MQ you have half-open stacks Just doing something straight something strange or do nothing and the question is how we can? Quickly diagnose this situation is any specific science for novel conductor Stucked in this state Let's start with that about which really about which release are you speaking? Well, we're actually running pretty old release planning to upgrade. It's more theoretical question How we can know that novel conductor no longer operates on rabbit? Did you try to check? Actually if The conductor Q somebody consumes from conductor Q. I've listed the command we are up in Qqtl list Q's consumers name if you see zero consumers that means nobody consumes I Believe I've seen such situations when during fillover our consumers just broke off and never connected I think I've seen On the later releases I've seen such cases of radar Can we have more product proactive way to know there's some kind of ping message to conductor to send and receive on? How about actually monitoring number of consumers if you see that it drops then If a number of consumers doesn't equal to number of conductors you have then it's a problem Thank you. I Mean can we have some active check for novel conductor something like sending by Message Q message are you here and to receive answer for monitoring because checking rabbit is kind of we need to do to take taking account TCP timeouts all those things and It's really Useful to have till to just to see it's operational mm-hmm Okay, okay, I got that time of is that type of message in Oslo protocol No, there is no such type or just to check if consumer life, but maybe it's worth the effort What I worked mostly is Trying to ensure that conductor or also messaging server just doesn't die I never thought in that direction which you suggested Yeah, thank you So my take on this is that multiple probably most Sane messaging protocols including like three or four out of four that rather than two supports actually know it's four out of four They support something called hard bits or however else or keep alive things What they do is actually they let you detect TCP connection and availability earlier So they undo some of the things and TCP in fact effectively So you that is a feature that exists at the protocol level I understand that checking if something is alive is involves more than that I'm not entirely sure what exists an open stack, but what I see around rabbit MQ is not not everybody agrees How exactly monitoring should be done so there is no consensus yet Gentlemen, we are five minutes over the time so any Q&A can be done in the hallway. Thank you but Can we ask more questions you in the hole, please just come to us