 Hi, everyone. Thanks for coming. I know it's the end of the day and everybody is tired, so we'll just get started. I'd like to introduce ourselves. Myself, Pesad, Ajay and Pradeep. We work at Cisco Systems. We are part of the OpenSAC Systems Group. And what we would like to share today is some of the testing strategies and the experiences we have learned from testing and deploying on high availability clouds. So we can just get started. So what is high availability? Before we go into the strategies, let's just talk about what high availability means. So what it means is it's a continuous availability of your services in the wake of component failures. And here in this case, the continuous availability is in terms of the end user. So in simplistic terms, you can consider your availability as a simple ratio of your mean time between failures, how long your system is available to the user to the total time. And the idea is you want to minimize your service downtime because of course it all results in, you know, how much revenue is lost for the enterprise. So one of the key challenges as we talk about achieving high availability, and there are many actually, but you know, we have listed out some of the few. The first one is your single point of failures. Anything that is a single point of failure will eventually cause your system to be unavailable. So you have to assume that any component in your system will go down at some point. This can be hardware failure like your control mode couldn't go down, your computer can go down, your software failures are your API services going down. The second and this is an important one is your upgrade downtime. So unless you plan for your upgrade, you know, it's very hard to manage service disruption in these cases. So upgrade is crucial. Think of, you know, you're running at your proxy you want to even, and in some cases even just making configuration changes is critical. So what's your upgrade path that you follow actually heavily defines how your system is available in those situations. So that's the second factor. The next thing is long recovery times during disruption. So your rabbit empty node goes down and, you know, your compute nodes are unavailable or they're not able to communicate. How fast can the rabbit empty come back up so their services are working again? And finally, open stack clouds are hard to deploy and hard to debug as is. It makes it extremely complex to isolate issues. And this is where you need to have really good logging unless you have syslog or syslog and you're sorting a lock somewhere, it's very hard to debug. So these are some of the challenges when you want to achieve redundancy. Now this slide, what I think the important factor to understand here is redundancy does not really mean high availability. You know, you can have a spare but does not mean. So you know, redundancy itself, you can have every component in your cloud running in redundant fashion. But that does not mean that your cloud is actually high available. What is important apart from redundancy is how fast you can detect failures. And once you detect failures, how fast you can isolate and resolve them. Consider, for instance, if you're running HAPROXY in load balance fashion, what if your primary HAPROXY goes down? How fast does it take for keep an ID to migrate your WIP to the backup? Because if it's a long time, then you'll see service disruptions in your control services. Same thing with RabbitMQ. If your RabbitMQ node goes down, you would think it's in a cluster but if the RabbitMQ node goes down, how fast before your clients realize that your TCP connections are stale? So what are your TCP timeouts? And those are some of the factors that would affect how fast you can recover from it. So coming to OpenStack, what are the high availability goals? The first thing I think is you want to make sure that your existing cloud resources, your existing VMs that your customers have hosted on the cloud are not affected. And we say 100-plane data, 100% data plane availability but sometimes that's not possible and we'll see what are the limitations to that. Also, you want to make sure that your control plane services are minimized. Again, in the next section, we'll see the challenges to that. Prompt recovery of services, any services going down, you want to make sure that you have mechanisms in place which are monitoring it so that you can bring the services up. And this is where you have your services in load balance fashion to help you with that issues. So coming to this next section, typical HA deployments and these are deployments that we have seen tested. And if you see here, it's a very simple diagram. You have all your control services running on a single node, your control node. And the way you achieve redundancy and high availability is you're running them in a redundancy mode. So you're running three of them. And if you see here, your HAProxy, essentially an HAProxy here can be replaced with pacemaker in some cases and Corosync. You have HAProxy, which is load balance. And you have a RabbitMQ, which is in clustered mode. You have your database services, which is sitting behind a Galera cluster. And of course, it's sitting behind a load balancer in case of failures. And then all your open stack services and then your compute nodes. So one thing that we are not seeing here is actually, and this is kind of the key is to your 100% data plane availability is you want to make sure that your VM traffic, your tenant traffic is separated from your API traffic and your management traffic. Just so that it's not the bottleneck or the other management API does not cause the bottleneck to your existing cloud. And this is the second deployment where you have your individual services running on separate nodes. And this could be an over cloud where you have your infra services like RabbitMQ, MySQL and RabbitMQ running on a separate node. Your HAProxy are in separate nodes. And we have seen cases where you have multiple HAProxy. So you could have HAProxy per service, or you can have two HAProxies for external as well as internal. And then your network node is separate. The key here between the two is to understand that every design decision that you make in your deployment will affect how your cloud will behave in an HSA scenario. So just because you have lots of redundancy does not really mean that it's good. So let's come down to the bare bones test strategy that was our slide and we'll talk about that. So how do you really test HSA for a cloud? Or how do you really validate that your cloud is good enough? So you just don't start bringing down services because all you would have done is just brought down your cloud and not really learned anything from it. Also you want to make sure that when you're doing any of your testing, you want to replicate real life scenarios. So first thing you want to do is you want to stress your cloud. You want to load your cloud with enough control plane traffic that would reflect real users like VM creations, you know, image creations. And you also want to run data plane traffic where you have multiple VMs and they are doing L2, L3 traffic. The next thing you want to do is you want to do control disruptions. And I would suggest you start with only one disruption because the idea of doing a disruption is you want to first make sure what are your expectations when you're doing a disruption. So if you're bringing down your network node for instance or your RabbitMQ, first think about what are the expected failures that you're going to see. And in HSA testing scenario, you always want to monitor all your cloud resources because you want to look out for unexpected failures. Now these are something which are oversights in your case where you did not think of them or they are actually issues in your cloud which you have to go and fix. So the key and so in terms of disruptions, you can do node level disruptions. We're bringing down the entire node. You can do process disruptions and this can be graceful or grace less and then network disruption. And finally all this is meaningless if you're not monitoring as I said. So you want to constantly monitor which is the key for HSA testing where you want to make sure and keep look out for all the failures that you see. Excuse me. So coming down to the core and for our components, these are your RabbitMQ or messaging service, MariaDB, MySQL, your database service and this can be running behind Galera and HSA proxy or pacemaker depending on your deployment. So as we said, the first thing you want to do is you want to load the cloud and a good system level test for this is concurrent VM boot test because spawning a VM actually generates enough traffic. It's a very good control level test which generates enough traffic to all your components. The second thing you want to do is process or node level disruption. Now some things to consider for instance RabbitMQ. By default when you set up a RabbitMQ cluster, the queues are not mirrored. What that means is that the queues and RabbitMQ cluster reside on the node on which they were first created. So if you bring down your RabbitMQ node on which the queues were there, your clients will not be able to communicate. So that's one criteria you need to understand. So okay, you go to mirrored queues for instance. Again in mirrored queues, mirrored queues only work when you have a cluster and there are no networks. Basically you have a correctly configured cluster. So in case when RabbitMQ goes down and comes back up, the key thing to monitor is that there are no network partitions when it comes back up. Same thing with Galera and MySQL. So when you deploy a MariaDB Galera cluster, the way you start the primary is slightly different from the backup nodes when they join. So when your primary is going back up, we have seen cases where the primary comes up and gets confused and starts coming up as a primary again instead of joining. So the essential thing is to make sure that you go through the complete disruption cycle where you bring down the node and bring it back up again just to make sure that it joins the service and it's not disrupting other services. And you want to monitor HAProxy before, during and after disruptions, especially in the case of MySQL where it's running an active backup mode and HAProxy, you want to make sure that HAProxy is properly transferring over to the backup. Also in case of when you're actually testing HAProxy itself, the keys to things to consider are if you bring down your primary HAProxy, you want to make sure that your backup HAProxy has the VIP now. And when the primary comes up, the VIP gets migrated back. And then you want to monitor how long it takes and see the service disruptions. Of course, you want to make sure that your existing cloud resources are unaffected in this condition. So coming to the OpenStack services. So there is a common theme here where you're always making sure that your cloud is always validated before you start the test and after the test, so that it's a clean system and there are no left overs from the previous state. Again, in this case, you're doing process level or node level disruptions for all your services. The differences that we do is in case of Keystone, the good use case is you're generating multiple or concurrent tenants and creating multiple users while you're bringing down the services. Then the second use, the other test, the other services, Glance. And in this case, again, I'm going to go over the slides a little faster because, as I said, they all follow the same pattern. The only difference here is what kind of tests you're running against your service so that you're testing the correct things. So in Glance case, you're actually doing a Glance image creation, image deletion test. Here we use the rally framework to test. It's a good framework to actually do your test for concurrent tests. In Nova case, again, VM boot test is a very good case for Nova. Of course, in this case, we also make sure we get the NoVNC console. And finally, in parallel, we can do disruptions of Nova controller nodes or the Nova process and Nova conductor. And of course, Neutron. Now, Neutron is a little special because there are some things you have to keep in mind. For instance, Icehouse versus Juno. In case of Icehouse, there was no namespace migration, meaning that the node on which the routers were created, if you bring down that node, you will actually use all your IP routings. So you have to expect your existing VMs will lose their connectivity. Finally, you have your Cinder. Of course, when you're testing Cinder or Glance, implicitly, you're also testing your back end as well. And Ajay will cover some of the issues that we have seen in these tests. And Horizon, you can use automated testing using Selenium. Again, in this case, you want to make sure that the end... So Horizon is the end user application. So the idea is to make sure that the user gets a seamless view. So even if you bring down your Horizon, since it's sitting behind your retroxy, you want to make sure that he can continue executing his commands. So now I'll hand it over to Ajay so that he can talk, go into more details of what issues we had seen. So as Bezhar spoke briefly, right? So the general strategy that we have for OpenStack, high availability testing is three-fold, right? You generate a node level or a process level disruption, you generate a disruption, you force a disruption, and you monitor the cloud, you monitor various parameters like how available your API endpoints are, how available your application VMs are, and so on. And the final thing is you also run a relevant test during the disruption because no cloud when disruption is happening is going to be completely idle. So there's some control plane activity happening on the cloud, and you have to simulate that in your testing. So whilst we did this testing, most of our production and scale testbeds are in Icehouse. This is just to give you a flavor of the kind of issues that we saw. This list is by no means comprehensive. It's just to give you an idea of what kinds of issues we've seen. Again, as Bezhar spoke to you before, there are two ways in which to deploy these OpenStack services. One deployment is where you have an all-in-one controller model where all your control plane processes are running in one node and you have three of those. Or you have things like you have a separate node or a VM for RabbitMQ, more of the overcloud model or separation of these control plane VMs. What we have seen is when RabbitMQ runs as a service or a separate VM by itself, we have seen a lot more issues. And one of the primary, if you see the pattern of these issues, these are all really serious issues wherein the compute agents report down for a long period of time. In fact, when we ran the rally test in parallel by disrupting a Rabbit VM, we saw like 95% failures or something of that sort. So the cloud is kind of in a really disrupted state. And what we had to do is you'll also see this baked into some of the installers that are being shipped out right now that we had to adjust the keep alive timers. And we come to realize that in Juneau, there is actually some kind of a heartbeat mechanism in the Oslo messaging library to make this a little bit easier because now the messaging library can receive notifications about the other end point going away and the recovery can be more prompt. So what we've also seen is that the first three issues are, the third issue is about setting the TCP retry timer to a lower interval. What this was helpful in doing was RabbitMQ wasn't realizing when a NOVA conductor went away or a NOVA controller node went away. So this way we, RabbitMQ was realizing things faster and converging faster. We also saw that in process level disruption of RabbitMQ, we saw something like two to three minutes convergence time, which is not very good. The other thing that you want to expect is during a HA disruption, even if errors happen, you as an application expect another in a reasonably small time frame. You don't schedule a VM and wait eight minutes before you get an error. So what we have actually seen is if you disrupt a sender controller node in parallel with a scale test which is creating sender volumes, we have actually seen volumes going into a deleting state, creating state and so on and never recovering from this scenario. We've also seen during things like VM launching and you do parallel disruptions of the NOVA controller, you see VM stuck in scheduling state. So these are not, these are some of the issues that you see and the key point here to emphasize is that you start seeing these issues when you combine HA disruption along with scale. So you bring the two together, that's when you start seeing issues. So one thing to take away is the reason I put in that OpenVSwitch agent failure is, of course the NOVA scheduler is agnostic to the network state, but what is important is when you have single point of failures like compute nodes where OpenStack doesn't provide a high availability, you're actually testing that the monitoring mechanism you put in place is also working because when you're testing this from a black box perspective, you care about the fact that is my cloud highly available, whether it's being provided by some monitoring infrastructure which reports these failures or not, that's something to test as well. So just to summarize the HA testing issues, the key takeaways we saw is we saw that separation of services in multiple VMs or multiple nodes. In fact, degraded, we saw a lot more issues with high availability there versus the all-in-one controller model where everything is going down in one node and other controller takes over. We've seen a lot of issues with RabbitMQ and we believe that some of those issues will be alleviated with the heartbeat mechanism between RabbitMQ and OpenStack. I pointed to the patch in the previous slide. I think it may be in Juneau as well. What you've realized is that as Bezod went through these steps, it's a tedious process to do all of this, to be able to cause disruption, to be able to monitor so many different parameters and to be able to run scale tests in parallel. So we decided to build everything that we learned as part of this into a tool. The tool is still in beta version, but we'll talk about the tool and we'll do a video demo of the tool here so that you guys can get a flavor of what the tool can do and maybe it can be helpful to you to validate your Cloud for high availability. The most important part is need automation for repeatability. I mean, that's critical. It takes weeks to do all of this manually, so you want to do the same. So we'll talk about Cloud 99, which is a tool which we have developed very recently and it's basically our motivation for developing this tool was whatever we learned from HA testing, which was semi-automated and stuff. We wanted to build all those things into the tool and give you this thing. So if you're a network admin, you're looking at this Cloud for certification, what do you care about? You want to know what will happen if my controller goes down? Does the redundancy actually work? What will happen if my L3 agents go down? This is in the case of the new implementations of Neutron with VRRP. What will be my downtime? How fast did the Cloud converge? Is my Cloud highly available? So one of the key aspects of our test strategy is also, we load the Cloud initially and we actively monitor those VMs. We distribute VMs across tenants and we look at what happens to those VMs as our disruptions go on. So the idea is even with control plane disruptions, those VMs should be available from a data plane perspective. So you'll see that we embed some Nagios agent inside them and actually you do active monitoring of SSH, HTTP, and TCP connections. So this is what the tool will give you. You run a test, it disrupts services, it monitors the Cloud. Of course, we make no assumptions of any monitoring infrastructure installed by you. So we've written the tool in a plug-in fashion such that if you have any monitoring infrastructure that's already available, you can plug into the tool. The goal is to certify the Cloud for HA. So as any tool needs, you need a config processor. You need to tell us how your open stack Cloud looks. You need to tell us information about where your host reside, what is the access information into those hosts, and so on. Then we have an infra engine and reporting is critical because there's so many things happening that you want a concise report of what exactly happened during this interval and how do I quantify what happened on the Cloud during this high availability event. So we have different kinds of runners for now. The reason we chose to implement this with Rally as the runner is because the maximum amount of issues we saw in HA were with the combination of scale and disruption. Rally, as you know, is a very good tool for scale and performance testing, benchmarking. So we decided to use Rally, but we had enough requests wherein people wanted to run their own custom scripts whilst the disruption is running. They didn't want to be locked to a particular tool. So that's an example of, for instance, if I write a test which gets a VNC console and I want to do a disruption in parallel. So that's an example of something that you can plug in. The only requirement we have out of the tool that you implement as a runner is that you continue on failure, you don't just bail out if something is disrupted and Rally actually continues on failure. So it works perfectly in this case. So disruptors, we have built disruptors for, it's essentially something which logs into bare metal nodes, reboots them or power cycles them. We've built disruptors for bare metal process or if you're running services in a container or the service VM concept where you're probably running something like Triple O with your controllers running as VMs. So monitors, since we made no assumptions about existing monitoring infrastructure, if you have no monitoring infrastructure, we'll use Ansible to log into each host and get statistics about the host. Nagios, we use it if it's available, but we already prepackage Nagios agents into the initial VMs that we load on the Cloud and we continuously monitor them. So that's something which will always be there. And API endpoints, we simply use Python SDKs to actually monitor the state of the APIs. So as you can see, this is how the tool works and what happens is that the infra engine will spawn off processes for runners, disruptors and monitors. So what you'll actually see is when you run the tool, you'll see a disruption happening in one window, you'll see Rally tests or whatever tests you want to run in another window and you'll actually see the information about what were the API down times, how long did they go out, what interval they went for, how available my API service was and so on, right? So now, what's going to happen is that Pradeep is actually going to take over after this and the important thing is we also generate some results and charts. So if you want to look at a graphical representation of your down times and how the system behave through the HA disruption, you get them as well, right? So now what we'll do is Pradeep will give you two demos where he'll actually start with a system in non-HA, show you the single point of failure and how our tool will capture all that information and we'll go on to a HA system and we'll actually see how things improve a lot better. So Pradeep will take over from here. Thanks Ajay. So Ajay pretty much explained the infrastructure of this Cloud 99 tool. Now I'm gonna walk you through two demonstrations. So we are going to cover two different scenarios. One is going to be a non-HA scenario and the other is going to be a HA scenario. The reason why we are going to demonstrate with two different scenarios is to emphasize the fact that how redundancy is very important, how a single point of failure can cause serious issues and how a tool like Cloud 99 can help you in identifying the various parameters. So let's get into the demo. So the first scenario is going to, as I said, it's a non-HA scenario. So this is your typical non-HA setup where you have a controller node and one or multiple compute agents and you can see all the agents are running in the controller node. So for this demonstration what we are going to do is we are going to bring down one of the neutron server in the controller node and see what's gonna happen. So before showing you the video, just let's analyze what's gonna happen since it's a single point of failure. As soon as you bring down this neutron server, your API endpoint should not be available for the operator. Your agent that's running on the host should go down and the VMs that are running on the compute agent should not have any issues because it is in no way connected to the neutron server that's going down. So I'm gonna bring up the video and gonna walk you through that video. So let's see. So the user basically runs this tool from the CLI. So when he's running this tool, he basically feeds in two input files. One file is going to be your OpenStack topology file that basically defines how your OpenStack looks like and the other thing is going to be a scenario file and that's your executor.yml. And before launching this tool, I'm just gonna bring up my cloud and show you what is already existing in the cloud so that you'll have an idea. So I'm just bringing up the cloud and you can see I have two application VMs that is already loaded in the cloud. And the special thing to be noted about this VM is this VM has Nagios agent already embedded in it. This is just for us to keep monitoring the application VMs when we are going to disrupt and when we are going to monitor the cloud. And these agents will be mimicking the application VMs that is actually running on the cloud. So this is just a, we are just showing you guys the existing VMs. So what's gonna happen is again, so this once you launch this CLI, you're gonna see like Ajay explained external terminals getting opened up automatically. What these windows are going to be is your disruptors, your monitors, your runners. If you remember from the previous slide infrastructure diagram where the core components of this infra tool as an operator when you're like, as an operator or a tester when you're testing this cloud, it's very hard to actually like can see what's happening in the cloud. So what we have done is we are going to open up all the terminals automatically. So from the top left, it's your monitor and on your right, it's your disruptor and there exists a runner and just gonna bring it up into the front quickly. And right now we are just zooming in one of the window. That's your disruptor window where you can see the Newton server process is actually getting disrupted. This is just, in this case, it's just a process disruption, but it can be anything, it can disrupt anything. This entire tool is like a pluggable model. You can bring in your own plugins and just plug in into the infra and just can disrupt anything. Here in this case, we are just disrupting the Newton server but it can be disrupting your VM, it can bring down your entire node or whatever it is. So that's what we zoomed in and when we are, okay, in the far end of your right side, you see your Newton, sorry, you see a window that's actually a runner. Like Ajay explained, right now we have adapted to Rally, Rally is a benchmarking tool and this Rally, the scenario that we are actually executing is we are trying to boot some concurrent VMs on the cloud, the cloud that I showed previously. So what we have seen so far is we are disrupting the Newton server, we are also trying to boot up all the VMs on the cloud and at the same time, if you zoom into this window, what we are seeing is, this is a monitoring window. This window is actually monitoring the APIs. What this window is actually doing is, as an operator, is my endpoints are available all the time. That's what it's happening. But we have seen at the diagram before, this is a single point of failure. So since we have brought down the Newton server in the disruption window, your API monitor is actually saying, I can't reach your API, API, my API endpoints is not available and this is the valid thing for your single point of failure. You can see the Newton server state is going down and your API monitor is detecting the failure. We also keep constantly monitoring your host agents because as an operator, your API endpoints may or may not be available. But what's happening in the actual host? What we are doing is, we are actually using Ansible to get into the host and see the status of the host, sorry, service was running in that host. And as we see, again, this is a single node and you can see the Newton server has gone down and our Ansible monitoring to detecting tool has detected this failure and that's what you are seeing it in the red color. In the meantime, like I shown you, we have this Nagios monitor that constantly monitors your, monitors the status of the application mediums and looks like your application mediums are doing perfectly okay because like I said, it has no way depending on the Newton server. And in the meantime, your rally keeps running in the background. And this, an important thing to be noted about this disruption is, this disruption right now on a, based on a configurable interval time, it keeps disrupting the cloud. It brings down the process, then brings up the process, brings on the process. So that's what happening. I'm just speeding up the video. Right now the disruption has, the disruptor has restarted the process in the background. In the meantime, I'm going to show you the cloud. You can see the rally is actually loading up the cloud with all its instances. So that's what's happening in the cloud. Okay. We have, we forgot to zoom in into the disruption when it restarted, but you can see the monitors actually reporting the state of the server back to be in stable state because your disruptor has restarted the process. And both your monitors, both your API and your health monitor are actually reporting the status to be in stable state. Again, just we are going to speed up the process. So we're just waiting for the rally to finish execution. So how this tool as aligned itself is, it just waits for the runner to finish its execution. So once the runner completes, all these guys just reports its results back to the infra engine and the infra engine just process all the results and gives us a nice summary. So let's quickly go to the infra. I mean, infra report summary. So we are just waiting for the rally to finish it. Looks like rally has completed the process and it's just signaling all other windows to terminate. Okay. So this is what you finally see at the end of your run. So once, sorry, just one. So the first table that you see is your disruptor summary and you can see that what we did. So at what time we disrupted the process and what time we restarted the process. So that has been summarized in your first table. The second table is just your monitor table. Your Ansible is saying that during that time range your Neutron server was not reachable and that's what that information is. And here is your application VM monitoring table and it's saying that, hey, during this entire run my data plan was not affected and I was doing fine. The thing that was actually monitored by this Nagios was your SSH ping and your HTTP, both inward and outward. So your application VMs were doing perfectly fine during the entire disruption and you see that allowed downtime range. That's the downtime when your API endpoint was not available. As an operator, if you have a single controller node this is the downtime you're going to have. Here we actually control the disruption and that's what we know. This is the downtime of that the API was not reachable. And other than that, this disruption should not affect all other processes and services and that's summarized in this table. All other regions are doing fine. It was completely okay. Finally we are showing you the rally results. You can see the failure percentage in this. We were trying to boot up the VMs, right? What happened is the failure percentage is quite high in this scenario. It's because single point of failure and that's the point we actually want to stress out here. And you can see it's actually quite 50% have been failed which is not really good actually. So that you can summarize in this table. And finally I want also to show you that the rally cleans up all the instances so your cloud is unaffected. This is how we started. And finally at the end of the run this tool also creates a graphical representation of all the table that you just saw. So you'll see. These green blocks basically indicates these are the servers on this process that are completely okay during that time. But if you come to your healthy API you see a small red block, right? That's when, that's the time range during your neutron server was actually down. That's that's that indication. And if you come to an ansible you'll also see the same. Yep. So this is pretty much like a workflow of this entity tool. I'm gonna quickly run through the other scenarios so that you guys know what's the key difference between these two scenarios. So yeah. So the next scenario is going to be a HS scenario. This is again to stress the difference between your non-HG and HS setup. In this HS scenario it's again a typical setup where you have three controller nodes. And again we are going to replicate the same scenario that we did last time. We are going to bring down the neutron server in one of the controller node and see what's the difference actually between a HA and a non-HG setup. And okay let's quickly get into the video. I'm gonna speed up this video really quick because we are kind of running out of time. So again I'm launching the CLI and we are showing you how the cloud looks like initially. And once you launch this CLI again it's gonna pop up all the windows quickly. And that's your disruptors, monitors, runners, everything running parallely and everything is handled by your infrastructure. You don't have to worry about it. Also again the key thing is everything is pluggable so you can write your own plugins and come and plug in into the infrastructure. Infrastructure's going to take care of it. You can see initially all the processes are running fine but as soon as the disruption starts happening you can see in the bottom window on the left hand side that your Ansible is detecting the failure first. Why? Because Ansible is actually monitoring the host directly. It is actually getting into the host and detecting that the neutron server has fed. Whereas the top window it's not detecting the failure because this is a HA setup. Everything is behind HA and your API endpoints are available throughout all the time. And that's the reason your API endpoints are available all the time and your HA. And again this is, we are constantly monitoring the application VMs. Your application VMs are doing okay, speeding up. And at the end of the run you get your nice fine summary where you see that your disruption started and stopped. Again Ansible, like I said, it's a host monitoring tool. So it's reporting the failure when it happened. Your application VMs were doing fine. We have no issues with that. And the key difference here is going to be your API endpoints. So when it is in HA you can see as an operator your API endpoints are going to be available all the time. And that's the whole point of HA. And other than that, all your services and process and other agents should be doing fine all the time. Let's quickly go to what Raleigh is saying. So let's look at the Raleigh. If you can see the Raleigh has reported a very high pass percentage. It was able to put a lot of VMs even though when one of your neutron server was going down. The reason why you see this small 9% failure is also because there might be some messages that were already scheduled to that neutron server or there might be some messages that were in already transit. And that's the reason. And also this scenario is actually trying to boot up like 40 VMs, but if you have booted much more VMs the pass percentage would have been much lower. Sorry, the fail percentage would have been much lower. So that's the reason and yeah. So this is pretty much what the runner has summarized. So let me quickly bring up the graphical representation. So your graphical reporting should be fine and you should not see any downtime in the healthy API. Remember from the last scenario we had a small red blob in this healthy API but here it should not be everything should be fine because it's in HA mode. Whereas in case of Ansible, if you go to the Ansible you will see a small red block because that was the downtime of the neutron server agent. So guys, sorry, we're running out of time. So I'm gonna stop this video. Before handing off the presentation to Ajay to finally summarize the presentation I just want to quickly summarize this demo part. So what's our summarization on this demo? So basically when we executed this tool against this non-HA and the HA what we saw is whenever it's a single point of failure always your failure percentage is going to be higher compared to your HA setup. But other key thing is your API availability. Your API availability totally depends upon your HA setup. If it is an HA your API endpoint should be available all the times whereas in case of a single point of failure you might see some downtime. Other than that your data plane like expected should be available all the time but there might be an exception. Let's say if you're running the tool by disrupting the entire node and if you're bringing down your entire compute node down you might see some loss in the data plane also. So that's pretty much summarizing this tool and various scenarios. I'm gonna hand it off to Ajay. He will just quickly summarize the presentation. So what's the key takeaway from this presentation? So one of the things to keep in mind is different kinds of deployments that is all-in-one controller versus services running in separate nodes has an impact on HA test results. We saw that system and open stack configs also plays a key role. For instance, adjusting the TCP keep alive timers or running rabbit in cluster mode versus running it behind a HA proxy node. 100% control plane availability is not a reality with API replay. You did see that the rally test even in the case of complete HA were seeing some failures. The reason is those API requests that got scheduled are not going to get replayed and we need to have an infrastructure to replay that if you want to get to a higher level of availability then. When you test HA, there are several things you do. You don't just run a sanity, do a disruption and run the sanity again. That does not validate that your cloud is good for high availability, is highly available. What you want to do is you want to test HA with scale. You want to do control plane or data plane activity. What we have done is we have done a reasonable amount of control plane scale using rally and we've also done data plane activity through some of the throughput tests that we have written. Monitoring the environment to quantify failures is the key. You need to monitor the heck out of it when you do this HA because you need to quantify what exactly happened and by getting all the data from different nodes and also from the API endpoints, you actually can make sense of what's going on. Automation is the key to test repeatability and fixes. This is a painstaking process to do manually so you want to automate all your findings in a tool and want to make it easier to go forward. I'd like to thank some of the Susie who's not here who's one of our co-developers in this tool. I'd like to thank my immediate team, the open stack system engineering team at Cisco and also a lot of the testing inferences and all the testing that I did was basically in Cisco cloud services group so I'd like to thank them for their continuous support. This is a link to the GitHub for the tool and we also have an alias where you can mail us. The tool is in beta version as I spoke before but I think it should stabilize in the next few weeks. We have invested a certain amount of people to work on it so but it does work with, today we have only the rally plugin in the backend but we're trying to develop other tests which can basically be used with the tool and you can also plug in the tool to your own test framework. So what we'll work on in the next few weeks is more of documentation and that kind of stuff to make it easier for somebody else to plug in stuff into the tool. So we'd like to take questions now. Any questions, comments? Actually the session has just ended so if you have any questions just go see the presenters at the front. Thank you.