 Yep, I guess we should get we should get started. So first of all, good morning to everybody Thank you for coming to our Session is the first the kickoff session for the for the summit As a quick show of hands kind of how many people have already Set up OpenStack have played with it Okay, so about it's about half and then I assume the rest of you are Relatively new to the environment What we wanted to do today was run through some some thought discussions around around high availability And I just noticed the keynote that Intel has has issued a challenge of coming and setting up high availability clouds Which is something we've been focused on at Suza, but but this isn't a product pitch This is really talking about why ha is important what you should consider as you deploy OpenStack And then we'll show Adam will show a demo Of one way that we that we have set up High availability So first of all, this is why everybody's looking at a private cloud computing lower cost increased agility and the ability to respond more rapidly to Changes in the business environment and then obviously control and security is why people want to do it in-house as opposed to go to a public cloud But the important thing is once you Provided a more agile solution for your for your your line of business and you start looking at what work workloads You want to move into the cloud it becomes critical that the that the cloud is always available now. This is some work that Accenture labs had done to look at what kind of workloads you can move into the cloud And you can see at the upper right is sort of the ones that are easiest to move the ones that are the most The most value to the and the most value of the business of things like Setting up a brand new business. I don't have to buy infrastructure to set the business up I can just use virtual infrastructure through the cloud batch and data intensive Applications so things that require lots of capacity, but perhaps are fairly short periods of time to think numerical modeling if you're in the if you're in the engineering business something like that and Then peak load demands so I need to spin up a bunch of servers right before the holiday season when people want to come Do online shopping as you get over to the left and theory these are I'm not sure why they say less value to the enterprise because most of what we see initially that customers are interested in doing is actually development and test Give my developers the ability to spin up virtual machines to do system builds To generate load but by creating lots of virtual machines in the cloud But one of the things that we're starting to see from our discussions with customers and is an interest in moving more mission critical applications and certainly no application that's running in the cloud is going to be Not important to the business that's that's the whole point You're going to do anything that's going to be important to the business But as you start to move more mission critical applications into the end of the cloud it becomes Important to to have it things up and running all the time now part of the point is if you look at an enterprise at best maybe One quarter of the applications that you run in the cloud are not considered critical to the business Whether it's mission critical or business critical every workload that goes in the cloud is going to be important as you go forward in time and So you need to start thinking about okay once I deploy an open stack environment How do I make sure it's ready to handle? critical workloads So what are the things that we look about when we start talking about Open stack and what are the considerations? The we and our customers went through as you start talking about how do I make that highly available? So the first question is what am I trying to protect and the obvious choices am I trying to protect the control plane? So the services the open stack services, or am I trying to protect the guests? And our view is that the control plane Is is something that's important you can have hot standby for the control nodes? And the real the real key here is that it makes sure that the cloud is always running Irrespective of what might happen to any particular given workload when you start to look at at workloads You kind of have this tension between what I call cloud 101, which is The best way to describe it as failure is not an option. It's a feature I mean you just assume that your cloud is going to fail You know you treat all your your servers are our cattle, so you don't care. They're going to go down I'm not going to I'm not going to try and make them redundant necessarily But I've talked to customers and I say well, you know the cloud model of high availability is you just assume It's going to fail on it, you know you just take care of that and they look at me like You're not you're not serious about that. Correct. I mean we need to build highly available Infrastructions we need to be high and and ultimately we want to get to highly available guests So you can look at high availability for guests and we'll talk about that and we've actually Stratus technologies is here and they've got some interesting technology to To provide ha in the guest environment where you use If you're running Linux you can use high availability tools at the VM level Multiple availability zones, so I run copies of VMs in two different physical environments And if I do it correctly, I should have minimal impact on my on my application Versus sort of the cloud 101 model you probably have to rewrite existing applications to take advantage of multi-tier load balancing Which is kind of how the you know how people develop modern modern applications scale out applications in an ha in an ha environment But when it comes to compute nodes, this one's actually we think a relatively straightforward problem to solve all of the services running an open Stack are Linux services, so you can use traditional Linux high availability tools to build clusters that provide excellent availability for the services you can have You can have Multiple clusters for the multiple services, so that's kind of what our focus really was was on the on the control plane Now if you look at it in an open stack distribution I mean this does happen to be what ours looks like Susanix or Susan cloud You have an admin node, which is really just an installation service Pretty much all of the distributions have that today where you install an admin server and then that deploys the rest of the cloud Then you have control node. You can have multiple control nodes That have the various services on it Then you you know you have compute nodes and you have storage nodes, so our focus Initially was really on how do I provide high availability around the control node and the reason for that is if you look at the If you try to look at the impact of failures So if the admin server goes down that just means I can't add new physical nodes to my to my cloud But and I might not be able to rediscover nodes when they when they come back But it won't have any impact on my actual operational cloud Users will still be able to log in startup virtual machines all the virtual workloads that are running in the environment will keep running There's really no no impact and any impact it can actually be mitigated through standard Backup and recovery Solutions so you back up the state of the cloud if your admin server goes down you bring it back up you read Reinstall the state and the way you go The control services go down if the troll goes down At a minimum you can't start and stop Guest images, but in reality if you're running since you're running sender you're running Nova or running neutron If if you lose the control services you lose the control plane Pretty much your your whole cloud is going to go away Even existing workloads are going to stop running Which is originally back when we started looking at this there was actually no impact on Deployed instances, but now they now there are now if the compute node goes down You're obviously going to lose all the VMs that are running on that cloud And you have to restart Re-provision the server again. This goes back to how am I going to? To deal with that sort of the cloud way is that I have a multi-tier environment And I I assume that my physical servers are going to go away So I can mitigate it through correct application design So just as a way to introduce so that's kind of the upfront non-technical discussion so to introduce Some of the technical considerations, so this is the very high-level view of what is it? What does the cloud look like? You know I have an orchestration layer sitting on top of a bunch of Physical servers with a hypervisor, and then I've got virtual machines running in that environment and off to the left-hand side I've got a control node, which is what actually is driving The the orchestration layer for the VMs So kind of the first the first pass at making things a cloud more available as I set up a cluster With my controllers in it so it can run all my open stack services In a highly available cluster and then I start setting up availability zones so I can put my VMs It doesn't really matter how I do it almost But the sort of the cloud traditional cloud architecture is I'd have Multiple workloads multiple VMs in the workload move and make sure they're not in the same availability zone then put load balancing on it to make sure that Things are running and I can run HA proxy so that when one goes down I automatically move workloads over to Over to the other side so with that I'm going to turn it over to My colleague Adam who's going to talk about sort of how we approach The problem and then run through a demo All right Is this thing on yeah, you can hear me at the back cool, okay? So To go into a bit more detail about the approach we took really we could Use this as a kind of generic way Set of generic points about Implementing high availability because I think that it's kind of best practice across the board to some extent a lot of Other vendors have gone down this route or very similar route as well So the first best practice is to fully automate your HA configuration Hands up who who has ever configured a cluster manually before Of those people who enjoyed it Right, so that that demonstrates my point. I think it's you know setting up a cluster you need to do it but it's It can be challenging. So having something that automates it and is well tested is important We used Really standard open source Components such as pacemaker DRBD SBD HA proxy Components that you've probably heard of before other vendors are doing this as well It's what the open stack project if you look at the high availability guide it recommends that You use these components there they're proven their open source So we went down that route it seemed like the obvious choice for us and in fact we already have a highly high availability product, which is our Susan Linux Enterprise HA product that uses those anyway, so that was our decision and We had to extend our deployment tool crowbar to set up the cluster and we did that by adding a New plug-in to it and in the crowbar world the plug-ins are called bar clamps So we created a pacemaker bar clamp that takes care of that automation and We adapted our existing open stack scripts and deployment configuration management code to To allow HA deployments So we use post-gres, but of course my SQL is also MariaDB or Other databases are popular. That's just happens to be the one that made the most sense for us But again using DRBD distributed replicated block device technology for Having a master slave pair for the database and for the message queue so that if the master fails then it You can fail over to the slave. That's one option that we've provided, but also of course shared storage is a standard approach to that So that the the stack In a very simplified view looks like this This makes it look like you can only have two node clusters But as you'll see in the next slide that is not the case in our implementation But it's you know, it's an okay starting point, but it doesn't scale So just a single cluster with the some form of shared or replicated storage and then the open stack components on top of it What we recommend is something a bit more like this we really With our approach we decided it was important to allow as much flexibility as possible in deploying the clusters So you you're free to choose how many clusters you have and what you put on them and how many nodes you use for each one so that this is a Fairly standard configuration that we recommend so you have your open stack services in one cluster which can scale out very easily it's all active active and uses HA proxy for load balancing Is then a network cluster for all your neutron? Components and then your database and message queue On a third one and you could even you know You could split those out into more clusters if you wanted and grow and shrink The only limitation there really is that if you're using DRBD instead of shared storage then it has to be a master slave pair generally So I'll show you this in action Another quick poll who has Given a technical demo in front of a large audience before and of those People who had it work flawlessly Yeah, you're lying So I'm slightly superstitious based on past experiences and traveling with the cloud on your laptop is It's not the best thing. So you have to basically sacrifice the chicken to the demo gods before every demo. So sorry chicken, but Hopefully that will help Yeah, so I'm doing it from here I Can I can just look on there, okay? So this is our crowbar web interfaces is again our deployment tool that we Used to to automatically deploy a cloud and it's the tool that we've extended to deploy a highly available cloud So I mentioned earlier that it has these plugins Which crowbar calls bar clamps I just will show you all of them. So there's some core bar clamps for just setting up basic infrastructure services on the On the clock on these infrastructure machines But if we scroll down we start getting to the open stack related Plugins and the top one here is pacemaker So we can have a look in a bit more detail here, so here we only have a single cluster But if you wanted at this point you could enter multiple Cluster configurations and it would deploy them automatically just by hitting create. So I'll show you the one that we've set up and and What we've tried to do and I think this Generally makes sense it for the specific use case of highly available open-stack infrastructure services is You know enough about the workload that automating the deployment of it to a cluster Can be fairly opinionated so that we've only exposed the Options that really make sense to be modified and the complexity is hidden. So hopefully, you know, you don't have to worry about that so much So one example of a parameter that really needs to be exposed is the stoneth Fencing device so clusters need fencing devices. I won't go into the details now We have a session later where you can learn more about this this afternoon That's the stoneth if in case you don't know stands for shoot the other node in the head and it's a mechanism that Protects the cluster against data corruption effectively So in this case, we're using a shared block device in common storage between the The two nodes in the cluster is a communication medium and out of band communication medium for the fencing so that's one option and you can see that's been configured here and Few more parameters dRBD setup if you scroll down to the bottom here Then this is how you define the cluster membership is that you just simply drag Roll available nodes from here into here and it's obviously it's already in there So it's not it's not gonna work now So so assigning nodes in the crowbar infrastructure to to clusters is as simple as that drag and drop and then you hit apply And it just automatically creates the cluster So once you have the cluster up What does it actually look like with all the services running? So again, yeah, so the first step, maybe I'm getting a bit ahead of myself Is Deploying the cluster, but then of course you have to deploy all the other open stack services and That's just the standard process, but in a cluster aware fashion and This is a view of the cluster with all the services running So we have a column for each node in this two node cluster In this particular view and you can see the various services Started most of them just say started, but the DRBD replicated block devices are in a master slave pair So What I'm gonna do and this is where why the chicken had to die is to Kill some services and and then actually kill one of the nodes and See what happens so Let's see if I can move this Turn all over we make it a bit bigger. So here I am on the the main Admin server and I can SSH to one or other of the controllers in the cluster And I can run a pacemaker utility for getting a quick view of the cluster I'm not sure how much house small I can go without It becoming unreadable at the back, but this gives us pretty much the same view as what we just saw in the web interface So Here I have services like Keystone running for example Hopefully you can see that at the bottom down here there so if I If I kill the well, I just copy it So if I go kill the process ID My five six one Then go back to this view and let it automatically refresh. I'll try and Maximize this window Is it on the screen or is it scrolled down? Well, anyway here we on on the web interface is a bit easier to understand We see there's been a failure, but it's actually already started it up back up. So you can see it's running here was the Here's the little icon saying that the monitor failed, but it's already started it back up and if I let's quit this and Run the Keystone then you can see it running now. This is the first one here with a new process ID So that that happened very quickly. That's not that amazing because all it has to do is restart a single service on the node so now let's do something a bit nastier and Let's see so Let's kill the This one the second one. So that's the one that the interface ends in 8f, which I believe is controller 2 Yeah, E6 8f Great. Okay. So controller 2 I'm gonna go to this is all running virtualized. So what I can do I'll have to do it on this machine actually so I'll what I'll do is I'll leave the This interface up and I'm just gonna literally power off the virtual machine for controller 2 on here And then we'll watch the the mayhem begin so controller 2 reset and The by the way, this web interface runs on all nodes in the cluster So obviously if it was running on the node that died then that would not be very useful for seeing what's happening to your cluster So hopefully this is running on the right one and the web interface will still stay running right here. We go reset Usually takes a few seconds Because I think the default monitoring interval is 10 seconds. So on average it you'll have to wait five seconds before it notices So, yeah Everything stopped actually In in that respect I Suppose it's it's not Because this is this is largely an active active cluster. So There's not so we have h a proxy as a load balancer front end for most of the services So in this case most of the services There's no failover because they're already running on both nodes. So the cluster is now just in a degraded state Oops, I'm not very good at using this pointer Since when did buttons on touch pads become out of fashion Just scroll down Yeah, so Perhaps I should have Killed the other one. It might have been more interesting But it gives you an idea of the experience that you get If you want to see more of this like I said that we have a session It's a hands-on 90-minute lab a workshop at 420 this afternoon where We let you Build this entire setup from scratch on your laptop assuming you have enough memory Which really is a 8 gigabytes minimum ideally 16 or 32, but Yeah, so you can actually Build out this afternoon if you come along and and then give it a you know have a proper play with it yourself and Sacrifice more chickens. So just to just to wrap up one thing we wanted to do was Oh, do you want to switch back? so One of the things we talked about the end was okay, so we've talked about the control planes But my contention is that as much as I think well, let me ask a question How many people are hoping that they never have to run a pet in their cloud? If you're familiar with the pets versus cattle discussion, which I sort of don't like the whole idea is that The way enterprise IT has historically worked is every server that's in the environment you You need to take care of it. You need to make sure it's fully patched. It's running you you you give it a name you know Star Trek names used to be very popular Whereas when you're in a cloud you just give it a number and if the server dies You don't care. I think that applies to physical servers in a cloud environment certainly But our view is every workload is running in the cloud as somebody's pet They're running it to get some specific function done So when we started looking at we start talking people about, you know, what's the sort of what's the approach? So we kind of said this is sort of the the first step is that I have availability zones And I have some of my host servers or and this could be as simple as I've got two different racks with two different powers power sources Typically you want different network sources and it can be in the same data center And there's certainly there are people if you don't want to set up your own data center There are hosted data centers that can guarantee you this kind of a this kind of a control and the idea is that if I lose Availability zone a or I lose availability zone B. I have I have Servers that can pick up the workload again as Adam said using h. A proxy things like that you can you can automatically balance your Balance your workloads across multiple multiple physical servers and We talked about the control node and getting the services So there's a couple different ways that people are looking at potentially doing high availability for workloads One is you actually create an h. A cluster so effectively If you do this then Nova sees this as one big super cluster or one big super node And as you deploy VMs onto that cluster then it automatically takes care of how you do the balancing and that's actually fairly challenging task Because you need to make h. A work you need resource agents you need you need to know how to monitor the individual workloads But there are that is a solution that people are looking at the other one is to do it at the VM level So you can actually use all the Linux high availability tools assuming you're running Linux workloads To build h. A clusters out of virtual machines And that I think is is probably the the simpler way to go I mean you still have to worry about things like like resource agents all though You can potentially do it at the at the hypervisor level And so you just treat everything that's running on top of the hypervisor as one big H. A workload that you need to take care of And as I mentioned earlier Stratus technologies has got some stuff that plays on this And I you know urge you to To kind of talk to them As Adam mentioned we do have a hands-on session that Adam and Florian Haas from Hostex over or presenting it This afternoon, so if you I guess it's I guess 240 ones the other side I don't know. It's a bigger room if you bring a laptop You actually can stop by the our booth downstairs and get all the files You need to actually set everything up in a virtual box environment so you can spin up the virtual machines and configure the h. A We've also just got a couple presentations one of our Technical sees Simon Briggs is going to be presenting at the marketplace theater At 2 o'clock and then at 5 o'clock. We actually have Jason Anderson from Stratus coming to present their solution in our in our booth I think that's he may actually be doing a demo not sure So with that we'll stop If there are any questions That we can answer I should just mention about the Before yeah before we take questions So that if you want to come to the session later that we did blog Last week about the the prerequisite Hardware and software that you need for that like I mentioned you need quite a lot of memory 8 gig absolute minimum But you can feel free like if you don't have the the hardware or Yeah, if you do then come along to the booth and you can get the vagrant boxes, but You you need vagrant and virtual box installed We will I'm not sure whether we have those downloaded as well to to give out It may be difficult to download those from over the conference Wi-Fi if you don't have them all already installed but even if you That all aside feel free to just come along and just watch because we'll be showing it on the big screen and That all the resources are available so you can try it out at your leisure anytime after the event as well But if you come along you'll get an idea of how it works So yeah questions So one at the back right, so the question is what it what is the recommend a Point I will forget the Right recovery point objective and recovery time ejects so that on the on the time aspect That that's kind of up for for tuning but our Goal was basically that you don't need any manual intervention for anything That is some kind of critical outage which could impact cloud services and that includes you know, they're not just instances Well, it's not about Instance a chair. It's about the services. So from that perspective Our time outs Probably would result in a recovery Within the order of a few minutes. So I think something like a four nines Level I think is realistic and if you get aggressive with it and tune it in the right way and you have the right Hardware resources then five nines might be possible for for those services For the the admin server like Pete. I think already mentioned It's more relaxed because if that goes away, it's not going to impact anything other than your ability to scale out the cloud further So that does require manual steps and would be of the order of probably hours or maybe a day Does that answer your question? Yeah Anything else? Yeah That's for the workloads Yeah So that's that's for the workloads And that's actually Stuff we've taken a look at but haven't a solution yet Stratus is also taking a look at it and they have a little bit more so I suggest you talk to them to understand And what how they're approaching that That's monitoring and right so So I think the question is about monitoring. I just say some explain monitoring quickly in general so pacemakers resource agents will monitor all the resources in the cluster every ten seconds so And but additionally for the active active the HAProxy Thing HAProxy proxy is running as a resource in the cluster. So that itself is active passive and can fail over And so HAProxy obviously has its own you know monitoring Values and time out values. So it's it's a two-pronged thing effectively So I guess the worst case would be if if your node running HAProxy dies Then you've got the timeouts for failing HAProxy over and then you've got you know, whatever HAProxy needs to do to To start redirecting traffic to the sorry Sorry, I couldn't Yeah, yeah, we go to the microphone great idea Let's say hello So let's say there is an existing cloud running, you know two controllers in active active mode I want to add a third node in okay. What's your mechanism? Sure? Okay? Yeah, so I mean that's actually that's just a drag-and-drop. So you go to the the crowbar bar clamp screen well, so first obviously the That extra node which can be deployed at any time needs to register register against crowbar So it can be deployed from bare metal with pixie boot or you can register it running a script And it will register against crowbar once crowbar has taken an inventory of that node and is aware of it And it's allocated to the crowbar hardware pool. Then you you simply You just drag it into the existing configuration for that cluster. Yeah, so it's at the top Scroll up a bit. I can't do this looking backwards on There yeah Up one more It's not just me And then the edit button and then you have to scroll all the way down to the bottom Yeah, so there would be a new on the left hand side There would be a new one appear there and you just drag it in you hit apply Then crowbar basically controllers three here and then just drag it. Yeah, so and then crowbar orchestrates the Whole configuration management run So I'm not sure what I can't remember whether I mentioned but crowbar uses chef for configuration management behind the scenes so there's a chef server running and It runs the chef client on all the nodes that are affected and that will you know install the pacemaker packages the Required packages and then configure them and do everything basically so it would probably take about Two or three minutes. I guess once you have the once you you know and then another however many minutes It takes to provision the node from bare metal in the first place Any other questions Yeah, um, I have to admit I'm not a rabbit expert. So I mean it's I believe there's ongoing work Is there ongoing work too, yeah, you can have you can have marriage or it's depend on shared stores. All right. Yeah. Yeah, yeah But okay. Okay. Yeah, so there is and it is that supported in like ice house or Yeah, so so there there are other architectural options for rabbit and then of course you don't have to use rabbit you can use, you know We either shared storage or replicated storage. Yeah, but we may look at supporting other options in the future as well depending on demand Okay, I think we're supposed to wrap up now. So obviously if you have any more questions We're here all week Thank you, thanks