 Thank you, everyone, for coming down. And all of you, or most of you, are either OpenStack users or OpenStack system administrators or technical support engineer. Has any of you managed to install OpenStack the full stack without hitting any problems? No. OK. So most of you may have hit at least one issue, be it with Neutron, NOAA, or any other issues. But if you are experimenting with OpenStack, you have enough time to sit back, relax, and troubleshoot. But that's not the story if you go to production with OpenStack. And if you hit an issue, you have the right to sleep. So you always end up sleeping like this. But you do have your right to sleep like this forever. So that means you have to do your homework and configure your production environment properly to meet the optimum performance. So throughout this session, me and Destin is going to explain to you a couple of issues that we worked to troubleshoot. And what was the approach that we have taken to troubleshoot it, and what was the root cause, and how did we arrive into a resolution, and how we are going to help you to save a lot of sleepless nights. So myself, my name is Sadeek. And I have been working forward, had technical support for more than nine years. And recently, I'm working as a cloud success architect. The name itself explains the role to make sure that each and every OpenStack deployment is going to be successful. So this is my colleague, Destin. So I will hand it over to Destin to continue with the rest of the session. Thanks, guys, for joining us. Again, like Sadeek, I am a cloud success architect for Red Hat. Sadeek and I both work as part of the support organization. So what we want to show you today is really from the experience of our broader support organization. Not a whole lot of time and 40 minutes to run through a lot of troubleshooting. So what we chose to do in this session was to highlight a couple of the key issues that we have seen come in. We're going to show you a couple of points where you may be able to make a proactive choice to make a difference. But a lot of this is really about the process as well and how you approach these situations and what we do in our support organization to try to make this as easy as possible for you. So as our contact information, feel free to email or tweet us. So start off with kind of what a problem looks like when it comes in. Sometimes you're going to get a user complaint of one of these flavors. Sometimes you're going to get an error that comes in. To look at each one of these, they all look like they could be different in some way. And these may have come from different sources, different log files, different messages from users, different experiences that they're having. And it's really, in our support organization, we see a lot of this. We have a lot of these types of messages coming in. And we really have to filter through and figure out, OK, where do we look for the information to actually tell us what the real problem is. And you can imagine, I mean, this gets noisy. This is a lot of explanations, a lot of different kinds of wording of the problems. And I mean, you guys have all seen this. If you're deploying OpenStack, nobody rose their hand that they did a flawless deployment when Sadiq asked. You don't. I mean, you're going to run into some kind of issue. And you're always going to have some crazy list of messages that are going to try to point you in different directions. So I'm going to talk about one in particular that we found fairly interesting, because it pops up reasonably often. And it turns out, as you'll see, that we've discovered that we had a rabid WQ problem, thus my rabbit kind of falling over here in his troublesome state. So again, these kinds of issues are going to manifest in so many different ways. Narrowing it down to this is an interesting task for us to take on. In this particular case, I think the problem description in the one that I'm looking at in particular here was described as compute service on the compute nodes is stuck in a state of activating. What does that mean? Where do I even begin to start looking here? So we're digging into logs. Obviously, this is one of the first places we're going to look, trying to figure out what is important in here. And one of the things that we come across in the log files are these kind of nondescriptive timeout messages. So this is going to be common as you guys are troubleshooting things. You're going to see a lot of log messages that in the beginning are mixed into a bunch of other messages and trying to figure out what the actual important information is in that log file is going to be quite difficult. So in this case, we see our operation timeout. It says it's terminating. We've got a failed state. We've got scheduling a restart. And of course, this is even worse because it's going to keep trying to do this over and over again. So we end up with these massive log files. And if you're familiar at all with the sauce report tool, this is a way for us to collect data that we look at from the support side. It gives us kind of a snapshot of what's going on. You get things like this happening, and your logs can suddenly be hundreds and hundreds of megabytes per log file. So just generating a sauce report and trying to collect all the data can be troublesome on its own. So we take this. We're looking at this situation. We're trying to dig into it. We've tried to say, OK, well, something's hung up here. We're using too much of something's timing out. What's going on? Let's try rebooting the node. In this case, rebooting the node doesn't help. We actually run right into this problem, again, pretty quickly, even after rebooting the node. So our process is going to take this a bit further. OK, where can we start looking now to figure out where the source of this issue might be? A tool we commonly use when we can figure out a process that we're looking at is going to be sTrace. So in this case, digging into the sTrace, we get of the Nova Compute service here, we're starting to find these communication problems that are specifically pointing us to RabbitMQ. Now we have a place to go. We started off with nondescript errors. We're narrowing it down. Now we've got a little bit more of a direction to follow. So you'll see here in the error messages that have come up that we've got the ampqp. It says that the server's unreachable. I've highlighted the IP addresses just because when you're looking at these long lines, it's easy to overlook stuff in the log files. So what you'll notice is that we've stopped communicating with one of the messaging servers, we're reconnecting to another messaging server, and it shows that we're connecting to that one. And again, we're getting a lot of repetitive entries of this kind of thing in the log files once we've narrowed it down. So what we end up here, and I purposefully kind of left out the details, because Sadeq has got a nice technical dive into a second issue here that I think will be more valuable to you guys. But the process is always, this piece of information leads to another, leads to another, and we finally narrow it down. Always very time consuming. Strace gives us more logs, the logs give us some hits on some existing bug reports. The bug reports give us some links to some upstream communications. Again, yada, yada, yada. But what we finally get to is that we've run out of file descriptors. And why did this happen? Why are we running out of file descriptors? Well, we used a default. And the default that we brought into the product that came from upstream in this case is 1,024 file descriptors. And that might have been a reasonable enough starting point many years ago. But we know at the kind of scale that we need to be running OpenStack at, that our customers are using it at. We're going to exhaust 1,024 file descriptors for the messaging service very, very quickly. So we know this is absolutely a limit that's going to get to us. This is actually an upstream post on GitHub. Hopefully, I don't get too beat up here for bringing this out. But Michael's one of the, I believe, one of the maintainers of the RabbitMQ. So this is actually from a PuppetLabs GitHub conversation about commit. And this is kind of nice to dig into it because you actually get his description of what kind of problems you're going to see if you do run out of file descriptors and why this is an issue. If you look through this entire conversation, the recommendation from this thread actually ends up putting it at 16,000, which is a good starting point. It's a whole lot better than the default 1,024. What we have done on the Red Hat side is considered this kind of scale, again, that our customers are wanting to get at. And our recommendation is actually to go quite a bit beyond that 16,000. And we really want to push it to the max, and we recommend a 64K setting on this. The memory overhead of making an allocation like this to set the number of file descriptors that high is really quite low, given the scale of the systems that we're typically seeing being used. So it's worth it to go ahead, take that memory allocation, set the file descriptors higher, and that way you're proactively getting ahead of this problem. The good thing is we've learned from this process we know that these RabbitMQ file descriptors need to be higher. You're going to find in the current releases of our product that the default is set to the 64K value. So you're not likely to run into this in the product or in the upstream release in newer releases. But if you've had older releases running, and maybe you're starting to push the scale on those a little bit, it's potential that you could run into this issue, but it's really more of a description of the kinds of things you could run into across the board. It may be RabbitMQ, it may be HA proxy, it may be some other service that you're going to see these kinds of issues. And if you know ahead of time where to change these values, you're going to be in a much better position. So a note on this too, if you are actually looking at this specific issue or any of these traditional U-limit issues, be careful based on your implementation where you're making this change. If you take a look at our KCS solution, which I'll talk a bit about that in just a moment, you'll see that we have some careful descriptions in there about three different ways you might have changed depending on your deployment. So real quick on what we do at Red Hat when we discover information like this. We have this process called Knowledge-Centered Support. What we're really looking to do is take all of the knowledge that we gain during that troubleshooting process every time we're working with a customer. And actually while we are in the process of working with a customer, we are bringing in that knowledge into a new solution. Those solutions are internally referenced when they're on the rough side and we're still trying to figure out what they are. We try as quickly as possible to get those published to our actual customer portal. So you as a customer, you can go in, log into the portal, do your own keyword search, and you're going to find if you do some RabbitMQ searches about file descriptors, you're going to find this article posted that's going to give you these settings that we just talked about. And the same thing happens for every issue you run into. We're always trying to build this knowledge base. We're always trying to leverage this knowledge base. It's part of our support process to touch this first before we do anything else. Build on the knowledge that we have, and that way we get to the solution much, much quicker instead of going through this long troubleshooting process every single time. Again, the link to this particular solution is down there at the bottom, if that's at all something that you're interested in taking a look at. And with that, we're going to end that issue, and I'm going to pass it on to Sadiq, and he's going to give you a nice technical dive into a separate but similar issue. Thank you, Dustin. So Dustin clearly explained that the default value for RabbitMQ file descriptors are very low, and it cannot meet the messaging requirements for various open stack services. He also explained that there are various ways, a lot of symptoms that can manifest in too many different ways, which makes troubleshooting really difficult. And good, we identified the problem, and we also have a solution that increased the RabbitMQ file descriptors, and we increased that for the customer, and everyone involved started sleeping like this. But unfortunately, that didn't last for a very long time. After some days, we had a different issue from a different customer, and the symptoms started to manifest exactly like the problem one that we had, the Dustin explained. That's random failure while spawning large number of instances in batches. So you create 100 volumes, one or two volumes are going to end up in the failed state, or you do any task, 100 or 200 tasks, one or two of them are going to end up failure. So again, initially we thought that this should be very easy to troubleshoot, because we already know that there is a problem that Dustin explained, low number of RabbitMQ file descriptors. We thought this customer might be hitting the similar issue, but unfortunately, either he has reused the knowledge-based article, and he identified those problems before coming to Red Hat and solved on his own. And when we checked the customer environment, we already saw that the RabbitMQ file descriptors are $0.65. He has already done that work before coming to Red Hat. And again, our sleeping mode got changed into this so that we have another bottleneck here that we need to troubleshoot. So again, I mean, if you are going to spawn 200 instances or 300 instances in a batch, and five or six instances end up being failed, it's very difficult to chase behind that instance, which instance got failed, get the instance ID, and go through the various logs. We don't know whether the problem is with RabbitMQ. We don't know whether the problem with Keystone. We don't know whether the problem is to attach a Cinder volume to the instance. So it's very, very difficult to chase behind those instances. So we tried to chase behind the logs to see what happened, so we didn't get any clue to solve it. Then we tried different ways to understand what are the different ways this problem could manifest. And we were likely able to run a script for NovaList or CinderList or whatever command. And when we do that in a script 200 times or 300 times, we get an error message, unable to establish a connection to Keystone admin API. So we got a clear error message here to concentrate on, and we started asking Keystone why you are sending this error message. And unfortunately, Keystone told us, I'm innocent here. I categorically deny all allegations that you are making against me. But then why are you sending this error message? He has no answer. So we tried to debug Keystone to see where this error message is coming from. We enabled debug logging. And we went through Keystone logs a lot of time. And we also inspected the number of connections that is coming to Keystone. Incidentally, we also found that a lot of Keystone connections are stuck in close weight status. And we tried to concentrate on that to understand why this many connections are stuck into close weight status. But we didn't get any clue on. But we still, at the end, we thought it's not good to just concentrate on Keystone because Keystone may not be the culprit here. So again, then we took a step back and tried to understand how the environment works, how the connections are works. So on a default deployment using Red Hat Enterprise Linux director, we are going to get three controller nodes, controller one, controller two, and controller three. And we are going to run most of the OpenStack services active-active on these three nodes. Each and every API service, we do have an HAProxy load balancer. And the three nodes also act as MariaDB Galera database service. So suppose an end user is going to run NovaList, and he's getting a Keystone error message, an error message from Keystone. Then the first thing that the user contacts Keystone and gets a token. And once the user gets a token, then he's trying to connect Nova API. When he does that, he first hits the HAProxy VIP, then HAProxy selects, assume that all the three servers are the backend nodes for HAProxy. And HAProxy is going to run on all the nodes, but the VIP should be active only on one of the nodes. So VIPs are managed by Pismaker. Pismaker will move around VIP if one of the nodes goes down. So assume that the controller node two has the VIP, then the Nova API connection first reaches the controller node two. And controller node two chooses one of the backend node, backend Nova API service. Here, let's assume it has taken the controller one. And then Nova API service then makes a connection to Keystone to make sure to do the authentication. And again, the connection to Keystone goes through the HAProxy node. And let's assume that this HAProxy node that was selected was controller two. Again, the Keystone need to contact the database services. The MariaDB Galera service was running under HAProxy here. So each and every open stack services have the VIP configured as their database server IP address. So then controller two then contacts the database services via HAProxy. Then HAProxy chooses one of the database nodes to send that database request. So this is how it works. And we were trying to understand, during this, where the packet gets lost and where the error happens. So we tried to make a lot of educated guesses. So during troubleshooting, if multiple components are involved, we have to do, we have to make a lot of guesses. And we have to weigh and give priority to those guesses, which one is highly likely to be the culprit here. And so there are a lot of options. The first one could be that there are intermittent packet loss within the environment, something network related. If this happens, if the packets are intermittently lost, then it could cause problems like this. So we did a lot of network debugging, trying to understand whether any network problems are there. We were afraid. It's not there. Then the second option is whether the connection, is it a problem between end user and NOVA? Of course not, because the error message comes from Keystone. So it's not a failure between the end user and NOVA. Then it could be a problem with NOVA, the communication between NOVA and Keystone, of course. That is one possibility. And it should be a problem between Keystone and databases. So of course, it is at this time that we have two options right now to concentrate our troubleshooting. We tried to enable HAProxy debug logging to understand if there is any problem between these two communications. And when we enabled HAProxy communications, we saw a lot of messages, a lot of connections between the Keystone and MariaDB Galera database is being terminated with CD status. CD means it's a client timeout. Within HAProxy, we configure a client timeout. And if HAProxy cannot serve the request during that client timeout, HAProxy will close the client connection. So then we don't see any HAProxy error messages for communication between NOVA and Keystone. All the messages that we see are between Keystone and database. So it's at the time we tried to debug why these many client termination messages are coming for database services. And we then enabled HAProxy HTTP statistics page where we get a nice GUI which shows all N number of connections that is coming to be served by each and every proxy within the HAProxy. And finally, we tried to take a close look at the MariaDB Galera proxy. And we found that the MariaDB Galera has hit the maximum limit, maximum connection limit. That is 2,000. So this is a different route that we never can guess. So where does this 2,000 limit comes into effect? Then we had to investigate in that direction. So this is on the right side with the green background is what is the proxy configuration for MariaDB Galera. I have not defined a maximum connection of 2,000 there. And below that is the global section for HAProxy. I have defined the maximum connection as 40,000 there. So it's not 2,000. So again, then there is a section on the right side where you see a default value default for each proxy. Then there also I have in define a maximum connection 2,000. So let me just explain that the global section has a maximum connection that is the maximum number of connections all the proxies together can make. It's not about a single proxy. Only a single proxy has a limit of 2,000. Where this limit comes into picture is on the default section there is an invisible default 2,000 limit. So HAProxy basically assumes that this is the maximum limit each proxy is expected to make. This is hard coded in the code, the default limit. Either you need to change that value in the default section or you need to change it on a per proxy section. So this is not known to anyone. This is not clearly documented anywhere. And we had to clearly go to the code and understand from where this limit comes into picture. So OK, then why we didn't get a good error message from Keystone that the connection to the database is lost. So the problem is basically HAProxy is going to send the messages into a queue. And it will never sometimes close that connection on time. So I'm not getting very deep into that. But we thought that we have found the bottleneck that's HAProxy default maximum limit that is invisible. And we thought we will be happy and we can go ahead and sleep after this. But we got different questions from the customer. If 2,000 is not the right limit, what is the right limit for me? So again, we are already sleepless. Let's go and work on this tomorrow. For time being, you make it 10,000 or 20,000 or whatever you want. OK, so it took us more time to understand what should be the right maximum limit for MariaDB Galera for an OpenStack environment. This is what you need to give a clear, close look and implement in your OpenStack environment. Number of database connections depends on various factors. So how many workers are spawned by each API services? This is clearly dependent on the workers or API workers or OS API compute workers configuration that you make during the deployment. And we get a recommendation everywhere that number of CPUs that you have or number of cores that you have on your system is the optimum value for this number of workers. So let's step back. This is what you are going to see in each and every configuration files that you see with the blue background number of workers for OpenStack API services. The default will be the number of CPUs available. So let's just go into the next slide and take a closer look. If you have a 24-core system, how many database connections are going to be opened by each and every worker? So NOVA has actually three different workers that is OS compute API workers and the NOVA API workers and metadata workers. You configure all of these to be equal to the number of CPU cores, that is 24 into 3. Then just like that, you sometimes configure the number of neutron. So you sometimes configure the NOVA conductor has 24 into 1. And Keystone is 24 into 2. There are admin API workers and Keystone public API workers. So that means on a single node, you are going to have 264 number of workers on a typical, on a normal OpenStack deployment with the core services. And each API worker is going to open five long-lived connections to the database. It opens the connection and keeps the connection open forever. And there is no timeout, and it's a long-lived connection. It keeps that connection open every time. So that means on a single node, it's 64 workers in five database connections is 1320. And we have three controllers here. All of them are going to spawn this many number of workers. And there are also all the workers on all the nodes are going to open five long-lived database connections. That means a 24-core system is going to open 3960 number of connections. And on the right side, you see, it's not just long-lived connections. There are also short-lived connections by some of the services. We haven't really able to quantify anything about the short-lived connections, how much is open and how they are closed. But to be on the safer side, we say that you add 1024 to consider short-lived connections, other services that you may add in future, like Trove, Sahara, or anything like that. And assume that in this specific environment, this customer had 96-core system. So it's not really surprising that he has hit this limit very early enough, because obviously he has exhausted all the connections. All he doesn't have enough HAProxy connections to open connections to the database. So it's just not HAProxy that you need to configure with the maximum connections. There is also another variable within the database that everyone is familiar with, maximum underscore connections. So I hope you all know that this also need to be changed depending upon the value that you got. So if you want to get a better sleep, it's better you make the RabbitMQ file descriptors very high. And you also calculate, depending upon your environment, how many CPU cores are there, how many workers are spawned by each and every open stack API services. And you do this calculation and update the maximum number of MariaDB Galera descriptions. Then you sleep like a king or a baby like this. So again, we went into a stage. We created knowledge and support. We created an article for this to help each and every customer who may hit this in the future to help them. And that is knowledge and support, this kind of reactive support. If they hit the problem, they come to know this article and they solve it. And Dustin is going to speak about the next evolution on how we can make this reactive support a proactive support. So we see that sometimes the solutions are relatively simple and straightforward, even if it's elusive in the first example. A little bit difficult to troubleshoot where the problem was, but once you find that it's one simple value, you take a fairly liberal approach to setting it, set it high, and you're going to be in good shape. Sometimes you have other symptoms and they're going to require a lot more troubleshooting. The symptoms may look similar, but you've got a long way to go to actually find this new real problem. So again, this knowledge-centered support is really important to our processes at Red Hat because all of these new things that Sadiq discovered in troubleshooting this problem that looked similar ends up becoming yet another article that we can publish and that you can find as a customer that we can reference as support engineers to try to make sure that we are not spending the cycles doing all this troubleshooting again. And that has been extremely effective with us, very, very important for our support processes for a long time. The trouble is it's reactive. It's necessarily reactive. You only know to actually go find these things if you run into the problem. Hopefully part of the lesson learned here is never trust the defaults. And everybody should go and consider the scale of their environment, the kind of workloads they're going to be running, and consider what all of these tunables are. But there are a lot of them. There are a lot of places you can have these bottlenecks and these failures. So one of the things that we are moving towards, and this is kind of our next generation of improving our support model, is trying to get these issues away from the reactive and towards the proactive, enabling you as the customer, as the user, to try to find these problems before they occur. So this is just a quick little plug of the project that we're working on. It's in beta right now. It's called Red Hat Insights. And this is going to be a tool where you can have an agent that is going to help collect a lot of this information, compare it against our ever-growing knowledge base, and help find these issues before you ever run into them. It's going to be a really great tool for taking a look at things before you're doing deployments, but also along the way too, because we're always discovering new things. And as our knowledge base is growing, that's going to feed into the Insights system, and the Insights system is going to continue to be knowledgeable, so you may be able to get alerts and reports of new things that we discover that may help you prevent running into a problem in the future. So highly encourage you to check this out. We're going to be developing this more and more where lots and lots of new plugins are coming, more development across our product lines, and there's some good information available at the Insights website. Again, our contact information. Feel free to reach out. If anybody has any questions, we are welcome to come up to a microphone and speak. If not, you can certainly contact us anytime. Appreciate everybody joining today. Hi, how you guys doing? Good, great presentation. Great question. You talked about a knowledge base earlier in your presentation. Can you share what are you actually using as an underlying platform? Is it some open source platform for knowledge base or something that you guys have had for a long time at Red Hat that you just implemented? It's gone through a couple of iterations. Is my mic still on? Yeah, it's gone through a couple of iterations. Do you know what the current backing software is? It isn't, I believe it is an open source project. Yeah, it's going to be an open source project, but I exactly don't remember. Yeah, I can't remember. It changed a couple of years ago from one project to another. It's all tied into our internal support systems. So when you open a case, we get a lot of this backend information about your account and history, flags, and internal communications, but then it's all integrated between our view of the support system and your view of the portal. So we're trying to push the knowledge that we gain from the support case into the KCS solution and get it published to the portal as quickly as possible. I'm just curious if there is a quick open source knowledge base solution that people can use. Send me a message and let me get the answer to you. I don't know off the top of my head what the tool is. Any other questions? All right, thanks again, everybody. Appreciate it.