 Good morning, everyone. I'm going to talk a little bit about a couple of example cases we have had when it comes to troubleshooting, some real-world issues and how we identified them and fixed them. To start with setting the stage, we were formerly known as City Network and ran the service called City Cloud. Nowadays, we're known as Clura. We're running multi-region public clouds, a number of them, as well as some public cloud, private clouds, and a bunch of compliant clouds for users that have specific regulatory needs like finance, etc. We have been running OpenStack since IceHouse, and nowadays we're mostly on Sena in most of our environments. We use OpenStack Ansible for deployment, ML2 OVS for networking, and I work with architecture deployment and maintenance of our OpenStack setups. So, how do we get kind of reports? How do we know there are an issue to take a look at? Of course, classic monitoring. We might get an alert. Someone might get woken up in the middle of the night, and has to start troubleshooting due to our own monitoring systems, finding something. Also, there are a number of tests, of course. If we, for example, are doing changes, we can see that something happens during an upgrade, and have it to act on that. Also, there are logs that can both be monitored and also alerts, etc., can be generated from those logs. And user reports. We prefer to find things before the users find them, but in some cases, also, we get reports from customers that something is behaving in a strange way or not working at all. So, the first issue, the initial report here is that a user is trying to access a long Barbican secret from heat, and that leads to a heat error and an undelitable stack. And the user is using a heat template, injecting a Barbican secret, and writing that to a file, basically. Plain and simple. The client-side error that the user gets from heat is a data too long for column status reason. So, it looks like something is, that heat is trying to get to the client-side error. It's trying to shove something into MySQL that is too long. Then, we start taking a look at the service side logs. And there we can actually see two errors, two different errors. The first one there, data too long for column status reason is kind of the same thing we saw in the client-side error. And it points to MySQL problem or a problem that heat is trying to push something into MySQL that is too large. But we also see another error that is actually from Nova, which is a bad request. We're getting a 400 error from Nova that something is too long for the Nova API. And let's take a look at the limits there. When it comes to the status reason, that's a text. MySQL type text has a limit to 64k. And also documentation says that the provided user data, which we saw in the earlier slide that was, and Nova was complaining about being too long, it should not be basis 64 encoded because that will be automatically done. And the base 64 encoded result has a limit of 64k. Okay, so we're probably on to something here. So what's actually going on here? A request is sent to heat to spin up the stack. Heat attempts to create a Nova instance. Then, Nova fails to create the instance due to user data being too large, more than 64k after the base 64 encoding. So at this point, Nova will return an error to heat, which is pretty much what's seen there. And if you look at the value, that's the full base 64 encoded data, and then the full base 64 encoded data again within quotes. So heat in the intern attempts to shove this error that is now more than 130k characters into the status reason in the database, which as we saw earlier has 64k size limit, which doesn't work obviously. And MySQL complains about that and the stack ends up in a broken state. So what we actually had there was two separate issues. The user data that was provided by the user is too large, which is an error in itself. But while trying to report that error back, the status reason filled in the database is too small to be able to hold the entire error message from Nova. And heat doesn't truncate that error before inserting it into the database. At this point, we actually found an older bug that had been filed against heat, where the status reason field was actually increased a bit in size because it was too text, though, but still it's not too large enough for every case. So the solution in this case is just that the user shave off about 20k of comments from files in user data. So it comes below the 64k limit. And also we did report the heat bug that the reason field was too small and the status reason field was too small and probably should be increased in size or it should truncate whatever it's trying to put there before. So another one. At this time, we had an initial report after an upgrade from Victoria to Sina. And we were seeing, this was kind of seen from multiple locations. We were seeing it in our monitoring. We got customer reports and also as this was during an upgrade or right before people actually saw it in the logs real-time. So we were seeing flapping agents. Agents going up and down, both neutron agents as well as compute, nova compute, services. We saw intermittent API call failures, kind of what looked quite random. Sometimes just an API call wouldn't work or would time out or give an error. We also saw Octavia load balancers that went into error state for no apparent reason. And intermittent issues spinning up new instances, you could kind of send the API call there and try to spin it up but it would end up in error state instead of getting spun up as it should. So we start looking at some logs. And both in Octavia, the Octavia logs as well as nova logs, we start seeing it's database connection errors, often lost connections or lock-weight timeout exceeded, which kind of didn't make sense at the time but everything seems to point to MySQL. So we take a look at the MySQL logs and we see aborted connections, lots of them, from different service users. And with the error, we got an error reading communication packets. So it's basically lost the connection to the client and after a while times out and emits this error. So we kind of focus a lot on the database at this time. We're calling in our MySQL hotshots, looking into the database and the health of it, et cetera. We spend quite a lot of time on analysis there, performance tuning, et cetera. And we see quite a lot of long-running queries that are waiting for locks as also the previous error message indicated. We also see that those locks are generally held for the kind of default period of 50 seconds and then the lock is released and whatever was going on can continue. We also see some connections that actually time out rather than wait enough time for the locks. We try to fail over the database to another node. As this is a Galera cluster, we have three nodes available to see if it's some issue with a specific node. That doesn't help at all. Nothing changes. We start looking into HA proxy timeouts and try increasing those to see if that helps in any way to prevent errors. It doesn't. Right around this time, also something that is seen is an error in the logs of non-master nodes. We had kind of been focusing on the keep alive D master node at this time, but on the non-master nodes, we see intermittently it just enters a master state, directly receives a VRRP advert from the master and sees that it shouldn't be master and reenters backup state. So, yeah, it's going from backup state, master state, and then back to backup state. This kind of indicates that we're missing some VRRP packets from the master that the backup nodes doesn't see those as it should. And we start thinking there are kind of two hypothesis that either it's lost on the network on the way or it's not being sent in time by the master VRRP process. Keep alive the process. So, at this time, we fire up some TCP dumps. And this TCP dump is from the master node. So, we kind of know that what we see here is actually what the master keep alive D is sending it, and it can't get lost kind of over the network or the physical infrastructure. And as you can see there on the timestamps, intermittently we see multi-second delays in VRRP packages, packets from the master node. It actually doesn't send those in time. So, at this time, we kind of devise a workaround. We stop keeping live D on the backup nodes and only keep it running on the master node. So, the master node will always be master and nobody else is kind of able to be able to try to obtain master status. And this is also to test the hypothesis that this is the problem behind all the other issues. And there we see that, yes, when we stop keeping live D on the backup nodes, everything seems to be working fine. We get rid of the excessive locks in MySQL and we get rid of the lost connections. And, yeah, everything pretty much works fine. So, at this time, we focus on keep alive D. First of all, kind of the issue that occurred, what it meant when the backup node actually got to master state was that suddenly some of the packets from a connection that was ongoing went to the backup node instead. And the backup node then dropped the IP and you would probably get some resets of connections, et cetera, as it flapped. But, yeah, keep alive D. What was interesting here was that the keep alive D configuration had not been changed during the update. So, nothing there was changed. We also tried, yeah, and keep alive D was how we upgraded the version of keep alive D. We tried to reproduce the issue in a test environment and with the same configuration. But that wasn't possible. We just couldn't reproduce it. So, we tried to figure out what was different there. At this time, when looking at the logs, we also see another issue. For historical reasons, what we actually see in the logs is that we're seeing collisions, VRRP collisions, ID collisions, but with an incorrect password, which means that keep alive D should just discard those and, well, not care about them as the password is different for the VRRP. But for historical reasons, we do have a layer two connections between the management networks of two of those regions, which this was one of them. This had been going on for a long time, so it wasn't anything new and it had never caused any issues before. Yeah, the same VRRP virtual router ID was used in both regions, but the password was different. So, in this case, we tried to change the VR ID in this environment and restart keep alive D on the backups to see if that is the issue or not. And this issue doesn't reappear, so that was actually the issue. In this case, it seems like the keep alive D upgrade actually changed the behavior somewhat. And that is probably bugging keep alive D that we intend to report. But in this case, the workaround of actually changing the VRRP router ID was the easiest way out. So, the cause, the virtual router ID was not unique between installations. And the upgrade of keep alive D changed this behavior and obviously different password was not enough anymore. So, the solution is just to ensure that we have unique VR IDs in all environments that are connected through L2 or any other way, which kind of we should have had anyway, but as it hadn't caused any issues before, it wasn't that way. So, some general resources for troubleshooting, we often look at monitoring. I mean, that's kind of the generic classic style monitoring, different tests that are run, and we get alerts if something is submissed or something unexpected happens. And that's kind of one of the most common ways to both get the alert that something is wrong, as well as a resource during troubleshooting to be able to see what parts of our environments are not working as expected. Also, we extensively use logs, both kind of automatically analyze them and see animalities, and as well as using them as a tool during the troubleshooting phases. And we kind of most often find whatever is wrong there. Also, there is a bunch of metrics that we can use. I mean, everything from performance to a number of errors in different environments, et cetera, can be used to kind of see that something is going on in an environment that is not expected or normal. Another really useful tools for troubleshooting is, of course, the bug databases of the different projects that we use, OpenStack projects or operating systems or whatever. And also, and good resources are the mailing lists. Both searching the archive, see if someone has seen an issue before, as well as asking on the mailing list if there is something that hasn't been seen before and you need some help or assistance with figuring out what's going on. And people are usually very helpful. Also, the IRC channels can be really recommended for different projects. People there also seem to be very helpful and you usually can get a lot of help there even if you're not sure it's actually a bug or so, but to get another view on the problem. And also, a recommendation if this is something you should report as a bug or if it's kind of intended behavior or rather probably a configuration error or whatever. Yes, that's about it. And I would like to be able to take any questions from you. And there is a mic there where you can ask the questions if you want. Okay, thank you very much.