 Hello, everyone. Thanks for joining me today. Did everyone get enough coffee to go through some slides today? All right, we're good. So today, we're going to talk about how we auto-remediate or auto-heal OpenStack Clouds. My name is Mikita Gumenko. I'm an infrastructure engineer at Mirantis. I'm a part of the professional services department. And I'm focused on operations and site reliability. Since I'm part of professional services, my presentation will be based on our customer engagement with Symantec Cloud Platform engineering team. So let's review our agenda. We're going to talk about OpenStack environment at Symantec. We're going to talk how we monitor it. We're going to discuss how we actually get into the auto-remediation approach. Of course, we'll talk about what OpenStack use cases we came up with. And then I'll show a little screencast on one of the auto-remediation action we implemented in our environment. So to give you some context, here's how Symantec environment looks like. Symantec has the hybrid cloud. It's basically OpenStack plus AWS. OpenStack infrastructure consists of four regions across the globe, hundreds of racks, and thousands of high-provisors. AWS infrastructure is also rapidly growing already above the tens of thousands, of course. But today, let's focus on OpenStack. Monitoring is the essential part of every environment. So we have a bunch of monitoring tools. And we are struggling with wiring them all together. So our main monitoring tool is Zabix. We also have things like Prometheus, some legacy Nagios installations. We use PagerDuty for Pages. Volta is our login monitoring and monitoring platform. Under the hood is just LogStash, Kafka, Elastic Search, all the cool stuff, Grafana, Kibana. Symantec transactions are the functional tests against our environments. Basically, to boot up AVM, create a floating IP, and see how the whole cloud is doing. And the health dashboards are just a structured viewer on how the cloud is doing. It has the main metrics of how cloud operating. So since our main tool is Zabix now, we'll discuss it a little bit more. We have few thousands of monitor hosts in it, hundreds of thousands of items. We monitor things like processes, services, resources, APIs, performance and utilization metrics. Let's go to the emails, Slack, and Jira. As I said, we have integration with PagerDuty for Pages and on-call shift rotation. And recently, we added text storm to run our auto remediation or self-healing workflows. All right, so the boring part is over. Let's discuss how we actually get into the auto remediation. It was only a few of us operating this relatively big OpenStack cloud. And we were, in addition to our feature deliverables, like adding new data centers, adding new services, doing the upgrades, we also were on call all the time, which basically included reacting to alerts, fixed outages, create outage tickets, sending proper notification and stuff. And at some point, we discovered that these activities hurt our productivity in other areas. So we decided to identify operational patterns in our day-to-day operations so we basically can automate around them. So here's how incident response looked like before. Basically, based on alert severity, either PagerDuty notification will be triggered, PagerDuty will basically call you, and you will need to fix the outages right away. For other type of issue, just email or ticket will be created so you can work on them during the business hours. And all of this required a lot of time, so we decided to try auto remediation. And what is auto remediation? Auto remediation is approach to automation that responds to events with actions to prevent, help fix, or fix the issues. The easiest example of auto remediation is when one of the services filled up the disk space with the locks. Why do we need to wait or wake up engineers to fix that if the simple script can basically wipe all the locks and let the engineer know in the morning what he needs to fix, like some log rotation or whatever. Furthermore, even driven automation is an approach of executing action work flows based on operational events like alerts. So basically, Zavix will trigger low disk space alert, and based on the trigger action will be triggered. And this action is basically run the script to wipe the locks, if we're talking about our example. All the big players also do the auto remediation. The famous Facebook F-bar is already saving 17,000 hours a day in operations. It's like massive number. LinkedIn has its nurse. The name represents itself. It basically nurse in the environment. Netflix, Google, GitHub, PayPal, they all do the same with the same approach. So here's how we updated our workflow. So if the problem is well known, we just auto-remediated things like cleaning up the locks, service restarts. We don't need to wait for the engineer to fix it, especially during the night. If the problem is unknown or auto-remediation did not fix the problem, we fall back to our default workflow. But automation now will collect as much debugging information as possible and will supply on-call engineer with it. So he can basically resolve outages faster. And we call it assisted or facilitated troubleshooting. So when we think of what to auto-remediate, I like to refer to this picture. It's called the leftover principle for automation. The idea is that the task you cannot assign to the machine is basically left to the humans to carry out. So if the problem is frequent and it's easy-automatable, we should do it right away. Things like service restarts, like VM reboots. It's like low-hanging fruit for this type of issue. Things that happen rarely, but it's also easy-automatable. We should also do it at some point. For rare and difficult cases, I suggest to create assisted troubleshooting tools or some helpful tools that will help you resolve outages quicker. So auto-remediation has its profits. Basically, it will decrease your MTTR, which stands for mean time to repair. It will resolve your outages quicker or even prevent them using automation. You will improve your SLAs, of course. The engineer productivity will increase because you won't deal with outages that much anymore. The number of notification and pages will go down. And of course, the sleep-in time will improve. And trust me, as a father of little baby, it's like the crucial one for me personally. Some things to keep in mind. The bigger scale you have, the bigger profit you will gain. You might think, I have this one RabbitMQ cluster, which breaks once a year. Why do I ever want to auto-remediate it or automate its fixing? But Mirantis, for example, deploys OpenStack for hundreds of customers with hundreds of RabbitMQ clusters. So at a bigger scale, it pays over time. Be deliberate in detecting the exact issue of outage. You want to run your automation scripts against the problem they're actually solving. You don't want to restart my SQL replication if just single my SQL process die. You just get restart that process, right? Do not over-automate. Live things like database corruptions or network outages to humans. They're still better than the scripts for the most part. Set your maintenance windows properly. This one actually got a story behind. Right after we implemented the auto-remediation of the Nova compute services in our data centers, one of our engineers was doing the upgrade, which required Nova compute to be stopped. And since he said the wrong maintenance period in Zabix, Zabix was still firing the events. So when he stopped Nova compute in the whole data center, it all came back up after a couple of minutes. And after a couple more tries with the same outcome, he basically realized there's something going wrong. Be careful not to run actions when they're not needed. Of course, every alert matters. Every alert should be followed up with the Azure ticket, auto-remediation actions, or permanent fix. And as a reminder, permanent fix is still better if you got one. So while looking for the open-source solution for auto-remediation, we found the Stackstorm, which was used by Netflix at the same time. And Stackstorm is the open-source, even-driven auto-remediation framework. We use it because it ties together our monitoring solutions on one hand and infrastructure on the other hand. It's also the powerful framework to create auto-remediation actions. It has basically Mistral, under the hood, Mistral workflow engine, the open-stack stuff. It contains predefined building blocks called PECs. It has integrations with JIRA, Slack, Nagios, Senso, you name it. And it's easily extendable, so you can write your own plugins or sensors or rules. It's just code and some YAML. Here's the architecture of Stackstorm. As you can see, it has sensors to listen for the events. It has rules engine to basically classify the event and action runners or workflow to run actions based on the classified event. And as I said before, you can easily develop your own sensor plugins, action plugins. It's very easy. So here's how it fits our infrastructure. Basically, all the monitoring system we have at the top trigger the events to the Stackstorm listens for them. And based on the event, it either triggers action to the infrastructure, like restart the service, reboot the VM, or something else, or triggers notification of some sort to engineers like escalations or just notifications. OpenStack use cases. So this auto remediation approach can be applied to any system. And OpenStack as well, since OpenStack has many moving parts, and let's submit it like the shit breaks, Demons die, locks fill up disk, RabbitMQ cluster can fall apart. MySQL can get a split brand due to network outage. So here's the list of the use cases we came up with that can be auto-remediated. So the basic operations is very easy. They can actually be done in a few hours of coding. What I want to focus is more advanced operations. So assisted troubleshooting, as I explained before, is when you get an alert, and it's already supplied with some debugging information. So imagine you have your Keystone alert that increased the error rate metric. So you want to debug what's going on. And together with the alert, you can have in the same email information about some logs, information about load balancer status checks, information, for example, for related services like MIMCache. It also can have things like links to Grafana Kibana dashboards for easier identification. And we also can hook up chat ops to that. So we can trigger commands to the chat ops, which basically will also help us to get some information about what's going on in the environment. So that's handy. Rebin MQ and MySQL are the core components of every OpenStack installation. And they also can be auto-remediated. For example, recently we had a problem with the third party component of OpenStack, which was bombarding our Rebin MQ with messages that were never consumed. And it was debugging software, so we didn't have a permanent fix. So we implemented auto-remediation action to basically wipe all the messages from that queue above some limit. You can also easily rebuild Rebin MQ nodes or MySQL nodes if you need, because almost everyone now uses a virtualized control plane. So replacing a node is not hard, and it easily can be done using automation. For MySQL part, the split-brain recovery process, for example, for Galera cluster, consists of three steps. And why do we need to run it manually? We can do it automatically. Replication recovery process crushes also very easy to auto-remediate. Capacity planning. So we were constantly struggling with the amount of hypervisors we have for our customers to deploy VMs on top of it. So here's how, from my point of view, good capacity planning looks like. You have your capacity forecast, your hardware ordered. After some time, hardware received, your deploy it in the cluster. And basically, users won't notice anything. So it will feel like a cloudy way of doing things. But in almost every company I saw, even Amazon has this issue, capacity planning looks like this. Basically, users start to report issues with the capacity that they cannot boot VMs. Engineers in panic try to squeeze something. And after a long, long wait, they add some hypervisors to the capacity. And as a reminder, I'm talking about bare metals now. So here's how we resolve this issue for our environment. We create auto-remediation actions to basically auto-scaling our hypervisor pool. We put the capacity analyzer, which basically retrieves the data from OpenStack and counts how many VMs of each flavor we can boot in the current environment. And this analyzer exports this capacity metrics to the Zabix. If Zabix detects the low capacity issue, it triggers the event of this tech storm. And tech storm goes to the ironic we use and provisions the couple more hypervisors. And once it's done, it also uses a deployment tool to basically roll out the NOVA to it. So this whole process takes maybe half an hour. So I mean, it's very convenient and easy. As I told before, we use synthetic transactions, which are basically functional tests against our OpenStack cloud, like VM provisioning, image creation, floating IP. It checks HTTP codes. It checks things like being a cessation VM. So we will be sure that our cloud is operating fine. And if the problem occurs, the synthetic transaction will trigger the event to this tech storm. And this tech storm will basically will provide assisted troubleshooting to debug those issues, like gathering logs, as I said, checking the load balancers, providing some graphs. And the last case is hardware failure prediction. So we get a bunch of hardware health metrics, like disk health or smart encounters, ECCR detection, monitor CPU, monitor network interfaces. So if you detect one of these problems, Zabix basically triggers the event to tech storm to do the VM evacuation. And what tech storm does is just does live migration from the bad hypervisor to the good hypervisor. So this bad hypervisor can be fixed, like, for example, RAM replaced or network link is fixed. So we won't lose our VMs in case of hardware failure. And I actually created a screencast to show you of this implement detection. So since VM migration takes a lot of time, I decided just to screencast it and fast forward it. So we picked one of our production hypervisors and decided to do some sabotage on it, basically by removing one of the bonding interfaces by shutting it down. So first, let's see if the hypervisor has any VMs. So as you can see, it has four VMs now. Let's now SSH to the host itself and just check if they are actually running there. We'll just basically run the virtual list to see if VMs are there. As you can see, still four VMs. Now let's check the Zabix to see if it's all green. It's basically the host group that has this computing it. It's all green now. Stackstorm, on the other hand, has the rule to react on the hardware issue that was triggered by Zabix. So let's see how it looks like. As you can see here, we have action called evacuate VMs. And the criteria for it is when Zabix has the hardware that issue in its trigger body name. Let's see how the action look like. The action is basically mistral workflow to run some actions on the environment. This action has things like create a ticket, run the evacuate VMs itself, notify Slack, and notify email. So let's do some bad thing on the server. The server has the bonding with two interfaces in it. Let's see that both of them are up. As you can see, two interfaces both are up. And let's bring one of the interfaces down. Let's now check Zabix to see if the problem appears there. And of course, I fast forwarded this. So actually, just a side note, we have a grace period for this type of event. So we just won't have the issue that VMs will be evacuated because of simple network flap. So as you can see, we have the issue here in Zabix. And it already triggered some events to the stack store. So let's see what's going on in the stack store side. We are listening to the latest execution list. As you can see, there is action running right now. It calls evacuate VMs. Let's see what's inside this action, basically, what tasks are running. As you can see, some of the tasks are already finished, like Slack notification, Gira ticket creation, and current action is evacuating VMs. And as I said, VM live migration takes a lot of time, five plus minutes, depending on VM size. So let's see if we got the Slack notification. Here's the Slack. I blurred some sensitive data. As you can see here, the problem was reported and already some action triggered. We basically removed the hypervisor from the Nova pool, so no new VMs will be spined up on it, just because we don't want to kill the hypervisor with VMs on it. And another action was initiated is the VM migration. And we do it one by one just to not overload the existing link. Now let's see if VMs is really migrating. We're on the novel list. And as you see, one VMs is in the status of migration. And let's see if Nova service list status is correct. As you can see, the hypervisor was disabled. And the disabled reason is basically hardware issues. Now we need just to wait until all of the VMs will be migrated. We'll basically monitor the Slack to wait until all the VMs will be migrated. As I said, I fast forwarded here. As you can see, one of the VMs already was migrated, triggered another migration, another VM was migrated, triggered another migration. So basically now all of the VMs should be migrated, because no new action was triggered. Let's now see if the host still has those VMs. As you can see, the host is empty now. Let's see, on the hypervisor itself, it's also empty. So all the VMs from this bed hypervisor was migrated to the good one. So we now can either fix this server or decommission it if it has some bad hardware issues. Now let's see what Stackstorm did. As you can see in the right corner there, I already received an email from Stackstorm with the detailed information of what it actually done. But let's see first what Stackstorm has in its execution list. So as you can see, it executed four actions here. As I discussed, the Slack notification, the JiraTicket creation. It also succeeded with the evocation of VMs. And it also created email notification. So let's see our email. As you can see, it's like automatically generated email from Stackstorm. It has the JiraTicket link in it. It also states the problem. It states the action that was initiated. In our case, it's VM migration. And the result of the VM migration script basically saying that those VMs were migrated to the other hypervisor. And there's a final step. Just let's check if the JiraTicket was actually created. As you can see, the ticket was created. It states the problem. It was assigned to me. I just don't want to bother our SRE team with this test case. So that's pretty much it. We basically avoided the problems with the VM due to hardware failure. And this process can be done while you're sleeping at night. And in the morning, you will just get the ticket to resolve and to check and fix the problem. And you'll get notification of what auto remediation action was doing with this problem. All right. That's it what I got for today. Do you have any questions? If you have any questions, there is a microphone. So there is like a recording. So please use the microphone. Some of the stuff, it sounded a bit dangerous. And you mentioned those over automation aspects. And as I hear from you, it's all about night. So in the night, you can use fairly aggressive actions. But then in the morning, you kind of handle it properly, let's say like this. But when it comes to RabbitMQ and MySQL in particular, do you have actually published and kind of in the GitHub, it's available maybe for Stackstorm, those automated algorithms? Or can you describe in plain words, like typically what you just removed, MNAsiaDB, or that you really have a solid MySQL troubleshooting site that will allow fairly aggressive actions done in automated fashion? OK. That's a great question. So humans are very, like they have like very safe kind of nature. They always like afraid of the like new things and things that can like damage them basically. So it's hard to trust automation especially if you do not like control it basically at night. And I think it's like a more of a cultural change. But as I said previously, you need to be sure that first you deliberate with your why, what automation, what outage type you have. So you basically will be sure that the script you're running is solving this particular problem. So for example, for ebitmq, we don't have it published yet. But we have auto remediation action when we wipe like the MNAsia partition in case we cannot start the ebitmq. So we basically re-add this node to the cluster. And of course, first couple of months, you need to control this thing. So this text term now has this thing that allows you to break point the automation. So we basically, the event is triggered and it gives you the option, do you want me to run this? And like it's not done in the automated fashion. So it asks you, do you want me to run this? And if you see like it basically can fix your problem and it will do everything right, you just say OK and run it. And after some time, you're just used to it and you're sure that your scripts are working and you're identifying your problem properly and that's how you fix it. And another thing was my SQL. For example, we had like couple of split brain problems and the Galera, the commutation has like three steps for split brain recovery. Just stop the MySQL clusters, identify the latest and start them back from this node, right? And I mean, it's very easy to automate. But once again, I want to mention that we need to be sure that this is like the particular problem we are dealing with. So we run this automation action against it. I just have a question about the autoscaling. Usually people talk about autoscaling and you explain how to add more nodes which trigger some kind of alert. What about to decrease in the number of nodes? Yes, that's a nice question. Yeah, we didn't implement it yet, but it's fairly easy. If we have ironic, we can just do like a novel delete of the hosts. But we need to identify the hosts that don't have any VMs, right? And we need to make sure that no VMs will be spin up on it during our decrease process. So yeah, we were thinking about it, but we not implemented it yet. But I mean, I think it's fairly easy to architect those things. Any other questions? All right, thanks. Thanks, everybody.