 Okay. Well, thank you for attending today's session. Today, my name is Rob Young, not just today, but every day. I'm with Red Hat. I work on the OpenStack team. Most recently, I've been promoted to lead our virtualization business at Red Hat, so I don't know if that's a promotion or if it's, you know, condolences are in order, but I'm here with a friend of mine from Dell, JT Williams, and today we're going to be talking about highly available instances running on OpenStack. And just to level-set and set the expectation, we're not going to be delivering a technical talk about this. We really just want to raise the awareness of how the feature works, how it becomes available, what you can do to kind of turn it off and on within OpenStack, and just talk to you a little bit about some of the things that you can do with this and why you need it and why it's important. The other thing, we've got a demo of this as well, so it's a live demo. We're doing it without a net, so if something happens and comes crashing down, hopefully nobody's injured in the process. I flipped you already. Well, thank you. So, before we get started, let's kind of define what an instance is. In this case, we're talking about instances running on nodes, and a node is a physical server or a machine. Everybody should understand this. And within nodes, you have instances running or virtual machines. And on these instances are applications, databases, you name it, but things that typically are associated with making you money or making your customers internal or external money. So, some of the things we're not going to talk about, because these are also covered under some of the technologies we'll talk around about Nova and Compute. Node maintenance, this is something where node availability is important, but it's kind of controlled. So, here you'll be adding hardware, updating software, addressing imminent failure in the case of some properties. Or the consolidation or the spreading of instances for power conservation, avoiding resource contention, et cetera. So, some of the Nova enablers, for those of you that operated at a low level, you recognize these commands for evacuate, which is completely misnamed. But there's other things here that you can do proactively to kind of move instances around within an open-stock environment. These are all the things that you can do proactively. I just wanted to kind of show you these things. One, I wanted you to see that I can use grep. But the other, just so that you know, you can do some of the things we're talking about, automating here in a controlled way and not just in a chaotic way that we're going to show you. So, here's the problem that we're trying to solve. As open-stock, you know, becomes more mature, becomes an enabler for customers or for users that want to take Mode 2 or Mode 1 applications, convert them, migrate them to Mode 2 or Cloud Ready. You know, these environments are now becoming more dependent or users are more dependent on them for Dev and Test, which needs to be up all the time. Production and business critical applications. And the thing that users are going to expect from open-stock as it continues to mature is that, you know, these environments are always up and always available. You know, it's something when it's in POC status to not be up, that's okay, but when it's up and running, even if it's developer productivity, it's still costing you money if it's not available. So, as open-stock continues to mature, the components continue to mature, these are the things that people are going to expect. You guys are going to expect from open-stock. And this is a good problem to have because open-stock continues to grow in popularity and production-ready use cases. So, the requirement as this happens, one, Mode 2 apps must meter-exceed Mode 1 SLAs. So, I grew up in the 80s and 90s when mainframe applications were huge. The big thing there is, that was the cloud back then, by the way, applications just always had to be up and we had this thing called service-level agreements that, you know, I had this little thing called a pager. Most people in here probably don't know what that is, but I wore it on my belt on the weekends or after hours. This thing would go off when one of my applications running in that environment was down and I had to fix it immediately. So, the benefits of running something on open-stock have to outweigh the cost as well. And a cost in this case is when you have an application or a system or a partition in a database or something that's unavailable, every second that that's down is costing your company, your customer's revenue. So, it just has to be up there. And as open-stock continues to mature, these things have got to become, you know, first and foremost on everybody's mind. We have to mature in this way. If we stagnate, we're going to die. So, where are we at today? I'm going to turn it over to J.T. Williams. He's going to walk you through Instance HA as it exists today. Thank you, Rob. So, where are we today? Well, open-stock provides many of the core services we're going to need to configure a cluster that's going to not only monitor the controller nodes, but also monitor the NOVA nodes. And some of the services that it needs or open-stock provides are Pacemakers and SystemD for monitoring the control nodes, NOVA APIs for post-evacuation and recovery services, HA proxy for distribution of the service requests, networking services, and database redundancy for state recovery. All of this goes into making the PCS cluster. The cluster is really what high availability is all about. It's monitoring the controller nodes. It's monitoring the compute nodes. And the reason you want a cluster is you want to minimize the downtime for both the controller nodes and the compute nodes. And also for being able to restart a compute node or a controller node if a failure were to occur on those nodes. And the reason we need high availability is because our apps, they live in instances or on instances. And some of these apps may be routing software for trucks like Cisco Foods has. And if one of these apps goes down, Cisco Foods loses roughly half a million dollars a minute when their apps are down. So the big reason we need this is for making dollars. We don't want to lose dollars. We want our applications up all the time. So in high availability instances, this cluster is configured correctly. It requires Pacemaker remote on the compute nodes. Pacemaker needs to be able to talk to the compute nodes. So Pacemaker is running on a controller node, talks to the compute nodes to monitor and control the compute node services. It uses Stoneth to isolate the nodes. So if there is a failure, we'll demonstrate a failure here. Not a failure, but we're going to actually shut down a node. It uses Stoneth to isolate the compute node and then uses APIs or Nova APIs, I'm sorry, to evacuate and recover the instances on that node. And we'll show you how that... Evacuation in this sense, though, is not really evacuating. What happens is on the backend storage system, it's a shared storage system. So if a compute node goes down, what ends up happening is it isolates that node, it takes that image that's on the backend, and it relocates it to one or the other or reschedules that instance to rebuild on a different compute node. And the last is... I'm not sure why we put this here, but users must manually configure instance HA. It's not something that comes out of the box. It's something that requires a lot of configuration. So I just wanted to point that out. So it might be a good time to talk about filtering. Huh? Filtering. Filtering. Oh, yeah. Part of that back in another slide. Now I remember why. Manual configuration is for filtering. So you can actually set up different instances... or different instances to be evacuatable, and other instances not be evacuated. So you might want all your production instances to be evacuatable, but you don't want all your dev tasks or your QA tasks or your dev tools to be evacuatable. They can live till another day. You can attend to them later. Which brings us to this crazy scenario. We have this customer, Cisco Food, who wants to see their new cloud running on an OpenStack cloud, our OpenStack compute system. And because they're wanting to see this, one of the things they always like to do is they like to play around with the system. And although this ain't the best scenario in the world, someone pops over there and they hit the power button on a compute node. Well, this really doesn't happen, but it's a scenario I just wanted to demo here. And so what happens when that does occur? The node goes down. Pacemaker, via the Pacemaker remote, tells that node, go away. We're going to take you out of the system. We're going to let the cluster continue on as it is. And it initiates a stone of command which fences that node and shuts it down. While that's happening, Nova begins the evacuation process, but it's not really evacuation process, as I said before. It's really rebuilding those instances on the other compute nodes. And then once the instances are moved, it continues on with recovery action of bringing that compute node back into the cluster. If it does come back up. If it goes down for a hardware reason, it will just stay outside the cluster. So here's the little demo. And I'm going to kind of, on the left-hand side is three windows, and those will be our compute nodes. And I'll bring those up here in a second and show them running. On the top, we actually run a Nova list to display the instances and the state they're in. In the middle, we actually have three sections. The top section, which is the controller and the compute nodes and how they're going to be running the cluster. On the bottom, we're going to have the hypervisors and the instances of running on each one of the controller nodes. And over here on the right is the Dell iDRAC window, and that's what we're going to use to shut down the node. Whoops, let's go back and run that demo. As you see, over on the left, top is running, it's showing all the compute nodes, everything's running. Over here, we have some user who's going to power off our compute node, and it's going to be the middle compute node, compute one. And in a second here, you'll see that it goes dead as a result of being powered off, and the iDRAC actually starts rebooting and it loses its signal. Well, this is going on. Here's that controller and compute cluster setup I was talking about, and down here in the bottom will be the hypervisor status, number of instances run on each node. And at the top, you can see all the instances running. They're all active, everything looks good. Now that the compute node hit me, if you notice, it went offline, it became isolated. So it's out of the picture now. It's not part of the cluster anymore. And in a second here, and I'm fast-forwarding through this video a little bit, you're going to see a rebuild status at the top. These are the instances that are going to get relocated to one of the other two compute nodes. As the system continues booting here, you'll actually see that it boot, and then you'll see the instances have already been relocated. So this whole thing took about 90 seconds to get the instances off, not off, but from the back end reloaded in the other compute nodes. But right now, everything's back up. The system should be ready for rescheduling and everything should be working again. So, which brings us, what does Dell and Red Hat actually provide to the community? Well, we provide all this open-source work. Red Hat, the development team that Andrew and Fabio, who worked on clustering services, they've developed a pacemaker remote. They work on the patches. Dell contributes to the patches. The upstream integration. The upstream integration for pacemaker and Nova APIs. All that work is already been done for you. You can actually pull that information down and do it yourself, or not do it yourself, add it to your configuration, or you can get it from us. And filtering, which allow you to set up certain, or certain instances to be evacuatable and certain other instances not evacuatable, is all part of that configuration. What Dell has, and I work for Dell, and I work on that product right there, Dell EMC bundle for Red Hat OpenStack. What we've done is we've taken our reference architecture, which you may have saw in the keynote this morning, and we bundle it with a bunch of automation tools that we call Jetpack that we're going to release to open-source community. It was just announced last Tuesday, I think, at the Red Hat conference that we're going to go ahead and provide that offering back. So we've got all these tools which allow you to install and configure Instance AJA, allows you to install the whole stack using Ironic and everything else. So we're going to give all that back to the community, and that's what this is talking about. And there's the scripts there for command line interface. You can actually install it without it, and then we have a configuration script that will add Instance AJA to your cluster. And some of the future work we're working on is automation scripts that enable the flavors, filters of nodes and instances to opt in for evacuation and that. We're going to automate that. Right now it's a manual process. You have to manually configure it. And containerize OSP services. This is what Rob's team is working on right now. They're looking at how to containerize a lot of the OpenStack services, especially for Red Hat. And the last thing we want to work on and we'll probably need some ideas on is what kind of filters would allow you to have different workloads to support various SLA agreements, like for instance, you have large production systems. And you have small production systems and medium production system. Would that be useful to have filters to allow you to tie to different service level agreements so you can charge different prices for those instances? So that's what we're at today. And Rob's going to tell you where we're going. Okay. Thanks, Chaiti. Thank you. So Instance AJA is near and dear to my heart because the team that I work on, you know, the names Andrew Beekoff and Fabio Donito probably sound familiar to this crowd. If you're familiar with Pacemaker and Pacemaker Remote and high availability for, you know, not just OpenStack, but Lennox, you probably know those names. They work with me directly. And a lot of the community-driven work that Chaiti just described are coming out of a collaboration between Red Hat and folks like Sue Say and others as well. So the things that, you know, we've observed and we don't have all the answers. You know, we just have a lot of good questions. But, you know, from the keynote yesterday, if you didn't take anything else away from that, you should know that the number... one of the number one OpenStack use cases is around NFE, Telco, and, you know, MEC, the edge computing. And this is a huge problem because Pacemaker isn't well suited for these workloads. It's very well suited for applications that, you know, can survive and take a minute or two to recover. You know, an instance or a node or whatever. That's one of the reasons we put the filters into the current instance HA implementation in Upstream OpenStack is because we know a VM can be running many, many instances, but the ones that really matter may be running that Galera database or may be running that application that, you know, has to be up. The rest of it can come back later. So, you know, that's why we built these filters in. But we know, even as the authors or the maintainers of things like Pacemaker, we know that it's not the right tool for the jobs that Telco demands. So, you know, this is where we're looking at other projects and, you know, we're looking at the gaps in Kubernetes as an example. If we want to containerize OpenStack in OSP... or not OSP, but in... where are we at? Pyke and Queens. We know that, you know, Kubernetes or some other container management system is probably going to be the fabric in that. And we want to reduce our dependence on the things that aren't going to move forward. And Pacemaker has been a very nice, nice tool for getting us to where we are today. But Kubernetes or whatever that project to be named is, it's going to have gaps in it that needs to fill things like fencing and, you know, auto recovery and spawning new instances of a virtual machine or an instance quickly. And it's got to be at the workload level, not at the instance level or the node level. Right now, we're at the node level. We can pick and choose which instances within that node we carry over, but it's at the node level all the same. We need to drill deeper down because that's where the telcos want it. And the other use case, if you look at the footprint that is going to dominate, I believe, the OpenStack, you know, community going forward, it's going to be this concept of distributed compute nodes. Basically, you've got a central, last controller or set of controllers with many, many compute nodes around the peripheral that all are going to have a dependency for high availability. And this use case is being pushed by the telcos and the NFE use cases. But if it works for them, it's going to work for the enterprise use case as well. So, you know, if you're not paying attention to what the telcos are requiring from OpenStack, your users are going to require that from you at some point because it's going to satisfy that need as well. So that's kind of where we see things going and the efforts that engineering-wise we're putting into OpenStack at Red Hat, not just our commercial product, but upstream, everything we do at Red Hat goes upstream. So if you wanted, you can have it, baby. But from OpenStack, we see that being the use case, we see that deployment being predominantly adopted going forward. So we're investing in the projects that we see coming and we're interested in Kubernetes. OpenStack seems to be open to Kubernetes interfaces and working closer. So we're working on that now and our partners from Dell, when we perfect something upstream or get it to POC quality, it'll manifest itself through our partnership on hardware coming to a shelf or rock near you. So if you want to learn more, we're just a couple of guys who know something about InstanceHA and OpenStack and hardware and software and open source. But if you really want to learn about what we're talking about today, look at the OpenStack docs. Everything we've talked about is there. It tells you how to use it. You don't have to buy anything from Red Hat to do this yourself. We make it easier but have at it there's a guide out there, the HA guide. It's very detailed. It can probably be fixed so if you want to contribute make it better, please. Please do. The other thing is if you want to contribute to OpenStack, which my guess is everybody in this room is doing so, there's a link there. Learn how to do that. We'd love to have more people looking at the HA stuff. We love it. We love being a main contributor there but the more, the merrier. We don't have all the great ideas. We've got some good ideas but to make it great it's better to have people tell you you're wrong and we're wrong every day. So the other thing if you want to learn about products JT talked about, how Dell is doing interesting things with OpenStack and their hardware, commodity hardware, there's a resource and then finally the Red Hat stuff is there as well. With that, there's a couple of microphones here. If you've got any questions, my favorite three words are I don't know followed by I will find out. So if you've got any questions for myself or JT we'd be happy to entertain those. If not, you get 10 minutes of your day back. Okay. Thank you. Thanks for attending.