 Okay, welcome. We're going to continue in the tech deep dive track talking a little bit about high availability There is a trigger warning for this presentation if you have been traumatized by natural disaster There's imagery in this presentation that might bring back unpleasant memories So I've been doing this high availability Update or some form of talk about high availability for three open stack summits in a row It's becoming sort of a fixture in the program Still we have very many newcomer for how many of you is at the first open sack summit Plenty, right? So you will be asking yourself this question and that's perfectly fine If you're wondering who the heck is this guy up on stage? My name is Florian Haas I am one of the founders a principal consultant and the CEO at Hustexa. We're a professional services startup. We do The whole consultant consulting nine yards from architecture to implementation to troubleshooting to performance tuning and training for high availability distributed storage and open stack the first URL up here on the slide is my sort of official corporate bio if that's what you want to call it in the second one That short link links to my Google plus page So if you want to participate in my ramblings about Cooking or gardening or brewing ginger beer or kids and family dogs and occasionally open stack by all means Please feel free to connect There's my email address. I don't I'm one of those strange few holdouts that actually don't have a personal Twitter account but there's one for a company and We also happen to be exhibiting here. So if you want to talk to us, you can find us at booth CA Okay, I Want to talk today about high availability in open stack and when we talk about that there's really four different things that we need to look into and a Few of these things have already been mentioned in the previous talk and I presume in other Talks and sessions at the design summit at the conference Those four areas are The infrastructure layer that is the stuff that sort of underpins an open stack cloud there's two categories of that some of those are actually services that are a part of open stack and Some are services that open stack merely consumes but does not develop Then we have the compute layer Open stack Nova and in the compute layer There's been a few interesting changes in the run up to the grizzly release and with the grizzly release And there's also going to be some interesting changes as we make our way into the Havana release So there's interesting stuff coming there in the next few months Then we have the open stack sub project formerly known as quantum open stack networking There there's been actually some fairly significant changes open stack networking kind of trail behind in terms of its HA features in the Folsom release and a lot of that has changed a great deal for grizzly and then finally The sub project where I'm personally thinking there's been the greatest amount of change in terms of new features added in new functionality being incorporated Into the sub project has been open stack storage more specifically open stack block storage the sender project and I'm going to touch upon all of these and a few little Odds and ends and extras that didn't quite fit into any of these categories And I want to start with the infrastructure layer Now as far as the infrastructure layer is concerned There's actually not that much that has changed from the Folsom to the grizzly release a lot of what we can do in terms of infrastructure high availability in grizzly We were already able to do in Folsom with much the same tools the changes that we've that have occurred are not insignificant But they're not exactly overwhelming As far as the infrastructure layer is concerned There are as I mentioned earlier There are some services that are part of open stack itself that are crucial to the infrastructure such as for example the registry services the API services the Keystone authentication service and so forth and There are services that open stack merely consumes So they're very very critical to an open stack private cloud to an open stack public cloud an open stack hybrid cloud an Open stack cloud running inside open stack All sorts of interesting things that people are doing But they're not fundamentally part of the open stack project itself open stack merely consumes them and the examples for that are the relational database That we use for data persistence in an open stack cloud Most people are using my SQL some are using Postgres So maybe using other relational databases that are supported by SQL alchemy Which is the ORM layer that pretty much all of the relevant open stack services use Then we have a MQP our message bus and there we have several options We can be using rabbit MQ. We can be using Apache Cupid We can be using zero MQ. So those are the services That we're talking about there and there is a few extras as well such as for example the Apache that runs and maintains the open Stack dashboard and other things. So these two categories, but they're the kind of stuff that Make an open stack cloud actually work and basically keep things moving and When we look at that when we look at the infrastructure layer the interesting thing from the HA perspective is We really have to consider five different types of nodes in an open stack cloud in terms of their HA capabilities and in terms of the things that we need to do to make them highly available and This list of node types is not authoritative. It's just something that Quite a few people actually agree on But it's not something that is written down in each charter or bylaws or that sort of thing But it's just a practical Categorization so for these five node types and I want to go through them one by one and explain They're the specific HA considerations that we have to make for them these five node types are The cloud controllers the API nodes network nodes compute nodes and storage controllers now What do we mean by all of these? So the cloud controller? It runs the services that underpin an open stack So it runs things like a relational database. It runs things like an AMQ P messaging server But then it also runs Important registry services and that sort of thing as for the cloud controller It very much depends on what the actual back-end services are that we're using Whether we can make our high our high availability Essentially active passive or whether we can use active active and get a certain amount of scale out Generally speaking when we consider all things most People will be deploying their cloud controllers in active passive failover pairs That's just the the type that is most practical For the IP API nodes that is vastly different. I'll explain briefly why a cloud controller the stuff that lives on a cloud controller has persistent state now This is particularly true for the database. So this is actually this is using Short storage that is either shared or replicated that data goes into and that data needs to persist And that is sort of local to the to the individual hosts by default And we need to make sure that it isn't so we can actually achieve failover the API services. However by contrast our Fundamentally stateless locally that is to say if I have an API service instance all of it does is it Interacts with the MQP message bus for data that is essentially volatiler as a lifetime of about 30 seconds or less And anything that needs to be persistent goes into the relational database It does not go into any local file storage or anything of that nature So as far as the API nodes are concerned these API services We can at least in theory and least for most of them have as many as we want and then we'll scale out pretty much Automatically what highly available or what high availability management services help us do is things like Making sure that we have say for example three instances of this specific API service running at all times in this cluster of seven Nodes just for the sake of argument just pulling numbers out of thin air And that is something that a highly a high availability management suite can help us do The network nodes these are very interesting So the network nodes are the ones that if you are deploying a cloud that makes use of OpenStack networking formally quantum You have to have this node that Takes care of routing between the tenant network the management network and the external network so that's basically kind of like your upstream router or it could have several of these in theory and It also does other things a network node typically also runs the quantum Sorry the OpenStack networking DHCP agent which provides IP addresses to virtual machines and tenant networks through DHCP it and all of that at least up until Folsom was essentially active passive only and There were a few things that Were that is actually not such a great thing to have that in an active passive configuration only But I'm gonna get into the network node in more detail when I tackle networking and Then we have the compute nodes. Those are the ones that are actually our virtualization hypervisors Those are the ones that run our services. I'm sorry our guests are virtual machine instances and Those are of course also pretty much naturally scale out, but if I want to enable a System where the cloud in fact takes care of high availability of these virtual machines of these guests that is What I wanted to do is if an if a machine if a virtual machine runs on a particular host and that host goes down I lose it for whatever reason. I then want to auto recover it on another one I can't do that, but I need to do a little extra homework and Then finally the storage controller This is a very interesting one because it depends greatly on what kind of block storage back end We are using to determine what kind of ha we can do with it so we can do The the the the block storage server the sender server can be essentially just an API node that Actually does no local storage locally, which would be the case for example if we're running If we're running Cinder with an RBD back end for example a stuff block device back end Or it could be highly stateful because it keeps a lot of data That specific data that needs to go into the volumes locally, which is for example the case in the default Cinder implementation with a local LVM and ice-guzzy back end So for these five Like I said the kind of bad news is we need to look at them separately We need to look at these types separately when we when we talk about ha the good news is we can essentially use this We can we don't have to but we can use the same high availability stack for all of these just with different configurations and that ha stack Is the pacemaker cluster stack and pacemaker is essentially the default or state-of-the-art high availability stack on the Linux platform what pacemaker has going for it is it's been around for a really long time and high availability The management the communications and high availability in the management of cluster resources is hard It is not an easy problem and what pacemaker has done is basically over a period of well the better part of a decade has sort of Banged out all of these corner cases that we've that we we've seen in the in the project and fix them Which is why even though some people may consider it overkill or some people may hate its usability Makes it in my own opinion the best suited tool for the job to achieve high availability for infrastructure services we have reference implementations for pacemaker and The underlying chorusing cluster messaging layer for all open-stack infrastructure services and we actually have Two people here in the room today that were very instrumental in building these reference Configuration Emilia is up here and Sebastian. Where are you? Where'd Sebastian go? You were he was here a moment ago. Anyway, they'll be following me after this talk So they will be telling you more About that as well So we have that it's there I will readily admit the documentation is still lacking that will totally surprise you when it comes to open-stack. I'm sure But it's there it's doable it works. I know I've personally deployed H a open-stack private clouds in production with pacemaker and I know others have to it's not rocket science You can do it. It's perfectly possible this is an example of a pacemaker configuration a schematic example of a pacemaker configuration for a no-type that Requires the management of stateful data That for example would be the case for a cloud controller What you what you're doing is you're putting your data on some form of storage that is either shared Which is what you would do in a classic sand setup or we're We're replicating the data in some shape or form and that replication can be one of several types We can do block based we can do Database replication all sorts of things and then you have a bunch of other open-stack services That basically just talk to this cluster as if it were a single note And the way that we do that is we just have these Services which are all IP based listen on a virtual IP address and they fail over along with the rest The IP address fails over along with the services So when a service actually needs to talk to the cluster at and it hits one node One of the physical nodes and that node then goes down the service just reappears on the exact same IP address with the same data And so to the consuming application it looks essentially like a network hiccup or as if the application was gone for a few moments And that's it Generally speaking by and large the other open-stack services don't really notice interruptions like that And this is an example for a stateful one. It's essentially the stateless one This is essentially the same configuration except it's simpler because we don't have to worry about shared replicated data What we're making sure is we've got a cluster management service running on a x number of nodes and then we can create Configuration such that we are always keeping in instances of a of a specific service alive And we can do this in in several ways Um, okay Let's talk about compute Compute is more interesting here because like I said most of the stuff that we could do for infrastructure HAA We already could do in Folsom and there's not really that many changes in the grizzly release now here That's very very different We've actually had guest ha addressed in the grizzly cycle and here's one thing that we've always been able to do except that many people didn't know about it and That's a little hack for Nova compute What you can do is you can override the host name that Nova records in a database and Nova has a flag That's called resume guest state on host boot. So what you can do is you have two nodes Let's call them Alice and Bob for the sake of argument and you define a And Essentially a compute cluster. Let's call it compute one for the sake of argument so you can fire up a machine here of a guest here or or dozens or hundreds and They will report into the Nova database as coming from host compute one as opposed to host Alice And then if Alice goes down we cut over to Bob Nova comes back up and it says okay. Well, I'm compute one I now need to look into the database What are the virtual machines where the guests that should be running on this node? And if we have this resume guest state on host boot flag set what it will do is it will compare that? What it gets from the database with what libvert tells it and then figure out wait I should be running 20 virtual machines here, and I'm not running any so I'm gonna fire them up and boom, there's my There's my guests and they're they're still available That's a bit of a hack Because it has a few issues namely it breaks live migration between those two hosts Because I can't now say my live migrate from Alice to Bob because Nova doesn't care. It's like okay. That's compute one and here's nothing So I can't live migrate from Alice to Bob that way However, I can live migrate from host from compute one to compute two another cluster that I have in the same system and it has some safety issues with volumes That is to say cinder volumes. However, those can be very easily mitigated by use of proper fencing Which is what you should do anyhow in any pacemaker cluster or in any cluster for that matter no matter what? cluster management infrastructure it uses Host evacuation this is a new one in In grizzly We have a syntax that now goes Nova evacuate Name of a virtual machine and then the target host that we want to evacuate it to So the use case here is a note has gone down a physical note has gone down and now I want to reassign those nodes to another host Okay, and we have a variant of that and that is the on shared storage flag the difference between those is the for the first one assumes that the That the storage for the device is essentially ephemeral So it just recreates the virtual machine on the on different node from the same image with the same configuration and so forth And and and then it creates a new password for this thing and spits it out in the command line If we do on shared storage it just what I what I've told Nova with that is All of its data is on shared storage or a varlant Nova instances is on shared storage We don't need to recreate we can just fire this thing back up Now this sounds kind of great But and it may grizzly which is awesome By the way guess which zoo this grizzly is from San Diego But the problem that I'm having with evacuated is I think it's a bit of a misnomer Put yourself in the shoes of an emergency management official and you were You were faced with impending natural disaster When would you evacuate the city that you are responsible for? Would you do it when the storm is still 200 miles out to sea? Or would you do it after your city has been leveled? Most people will prefer the former nobody vacuum it actually does the latter you can't evacuate a host. That's not down Which is a bit strange? No, it will actually tell you sorry this host is not down if you if you if you type it right so this is not quite fully baked I mean it's great. It's much better than the stuff that we had In Folsom which was basically get into the MySQL database and hack this column, which is kind of bad But what we would of course like to see later on is to be able to say Nova evacuate this host and then Ideally don't do it per guest per host, but make it just per host as Intel Nova I Don't care where these machines go, but I need them recovered now and At best schedule them wherever the scheduler determines and then fire them up there, right? So that's what I'd kind of like to see with that But for now it has these limitations. It is per guest. It is per host It is only supported from a down host and there is no automation that goes okay Take everything that's on this host and move it somewhere else you can't script that of course You could enumerate all the host all the guests that are currently running on a host go through those and then And then reassign them So that's an interesting one And this is something that you probably guessed we don't have support for it yet in the open stack dashboard But that's actually relatively common whenever we get a new feature in compute It first makes the JSON APIs and then it makes the CLI and then it makes The dashboard and since we have timed releases and we don't do things like We're gonna drag out the release because we're waiting for a feature That just didn't make grizzly and it's gonna be in Havana. Oh, yeah Here's another interesting one VM ensembles this is an idea that I really like and the idea here is that you can Instruct a layer in Nova Such that you can group guests in a resilient fashion consider this Suppose you have an application that you want to deploy together. It's an it's a three tier application you've got six virtual machines in total two database back ends to middleware servers and two front-end servers and Now you would like to be able to tell Nova It would actually be kind of cool if you couldn't put the two database instances on the same physical node right and ensembles will allow you to do that and also add a bit of a Convenience switch in there is such that you can actually do a Nova boot for a full ensemble as Opposed to just a single host Did that make grizzly? not quite okay, it It just barely did not make it. It was pretty close, but it didn't quite make it But it's gonna be in there in in Havana It's currently in review at least that's what I checked yesterday The the the Garrett change was in was a review in progress and we can I guess expect it for now We have a workaround in grizzly It's not quite as elegant, but it still can be done if I need to do what ensembles do for me I can do that by using a filter scheduler in Combination with the affinity filter or the different host filter in which I can say don't run these two guests on the same host Networking Generally speaking we can use the same approach that we use for everything else for active passive failover with pacemaker it works for Well in Folsom it's still called quantum server, so I can call a quantum server And it works for the L3 agent and it works for the DHCP agent It had some limitations for the DHCP agent in Folsom which I'm not gonna get into in too much detail You're likely not going to care there were some minor limitations for it and we Did not get very very good scalability from the active passive model specifically for the L3 agent because But the only thing that we could do with the L3 agent is essentially do it Do an active passive failover like from one node to another? That doesn't change the fact that pretty much all of the upstream network traffic still goes through the L3 agent And so that means the only thing that you could do for the network node for scalability is scale it up Whereas all of the rest of open stack is all about scale out Which wasn't quite pretty in in Folsom it was it was workable But then again, there weren't that many people actually deploying Open stack networking as it was released for false for Folsom for large clouds in production because it had other limitations as well in Grizzly we have this thing called Quantum scheduler and that is a patch that allows us to run multiple DHCP in multiple L3 agents And it removes that scalability bottleneck We can now scale out the L3 agent and then there's other stuff other things that were improved such as for example We now finally have quantum security groups. We can have per tenant router networking That doesn't break the nova metadata API service because we have a quantum metadata API metadata proxy and So forth and that did make grizzly yay so that's in there and you can use it and you can employ it and that's wonderful and Now the really really interesting part in my opinion In the last few minutes here Storage, I think it's fairly safe to say that whatever kind of storage you're using if it's not Supported by cinder in the cloud that you're running That's your own fault. You should upgrade It's nothing short of jaw-dropping the the added functionality that we've got in the cinder project for block storage for the Folsom release we had We essentially had the the ice guzzly backend with local LVM that supported to ice guzzly servers IET and TGT and We had Ceph RBD and And then there were a few drivers and you know various shapes of fusibility and production quality and now there I Think it's like 14 or 15 new drivers that we've got in cinder and cinder now becomes Essentially a pluggable API service that actually doesn't care about the actually storing data at all anymore It hands all of that off to Storage back ends and it can use literally a boatload of enterprise storage back ends on top of everything That it has previously supported So this is a just a very short excerpt of the list of new drivers that we've seen in Cinder for the grizzly release We now have a generic file Backend so you can actually put q-cow images that you then serve up as Volumes for cinder that's it's kind of strange that that didn't happen before because it's it almost sounds trivial That we can use for NFS that we can also use for Gloucester FS There's a driver for HP left hand. There's a driver for three par There's new drivers for EMC etc etc etc. So there's a bunch of storage back ends that are That are very popular among people building highly available clouds that you can just now plug into them and plug into cinder and use them that way We we still kind of have to use pacemaker in this setup for the cinder volume API node For the simple reason that it does something rather silly I mean now it seems silly it seemed fine when it was originally designed and that is cinder volume actually records in the MySQL database, which was the host that made a volume available? to the Nova cloud and that's That makes sense if that volume is actually physically stored on that host It really doesn't if it lives back in the sef cluster or in Gloucester FS or on a net app filer using NFS and and So there's a there's a little hack that we need to that we need to implement there to make the cinder volume service highly available and Interestingly, it's pretty much the same one that we implement for Nova compute. We override the host name and We put this thing on our pacemaker management have it listen on a virtual IP address and then we can kill a cinder volume node and it will happily continue to serve these volumes to Existing VMs and attachments never break etc etc. So That works very well that worked well in in Folsom as well. It's just has gotten a lot easier now and There is a patch pending. Well, no, it's there is a bug filed against Havana In which John Griffith basically says Well actually recording the host name in the database is kind of silly. Let's do away with that and I'm hoping that's a a trivia fix and that's That that's coming in in time for Havana And then there's a few random bits and pieces That changed or are changing We we we're getting Libvert watchdog support in Nova and Glance. That is something that some HA cloud providers like to see that enables you to define a Guests that actually has a watchdog service running and if it in if it encounters essentially the equivalent of a Non-masculable interrupt or a machine check exception or something like that It can remove remove itself from the cluster if you want to do that And that is also a flag that is being enabled in Glance So in Glance you can have images in image templates essentially where that is already enabled which is also very useful Heat a lot of stuff related to high availability happens in in heat these days heat is of course our Or orchestration layer in in OpenStack Brian mentioned in the keynote this morning as well. It's a very very hot sub project that during and just just before the grizzly release emerged out of incubation became an integrated project and We'll see its first fully supported release with Havana the same thing by the way is true for the cilometer metering and billing sub project So a lot of interesting stuff related to HA also going on in heat Rabbit MQ has Not Rabbit MQ, but the library that we use to connect to Rabbit MQ, which is kombu has gained the ability to Be configured with not a single rabbit host, but a list of rabbit hosts And so if we're using mirrored queues in Rabbit MQ We can just tell all of our services that can that consume the MQP bus That they should be talking to Well a single host and then can automatically do client-side failover to another host So that's kind of cool. That happened in the I think it was in the kombu 250 release that sort of coincided with That was like, I don't know last Late last year something like that if I recall correctly Zero MQ also very interesting zero MQ completely does away with the idea of a brokered Messaging layer, which is kind of nice both rabbit MQ and and cupid are broker to to a certain extent Zero MQ is completely peer-to-peer Zero MQ had interesting limitations With open stack networking in Folsom specifically There were some interesting bugs or interesting problems appearing with the DHCP agent and zero MQ we're gonna see how that holds up in the grizzly release and Another thing that's very interesting in the database space is that my SQL Galera is firming up my SQL Galera is an extension to the mysql database that implements WS rep writes at replication Which is finally a means of doing synchronous database replication in the mysql Database and that's being hardened. We see a lot of industry uptake for that Also in the database in database land that is completely unrelated to open stack So that is also a very promising technology We do not have yet in SQL alchemy or in any of the open stack database layers an equivalent to a Combo with multiple rabbit hosts So we can't use lists of mysql connection strings or mysql connection strings with lists Instead of a single host name and then do automatic client-side failover, but maybe we'll get back So that's also promising technology and then I'm absolutely certain That there is some real something really really interesting that is really really important for ha that I just admitted because I didn't have More time. There's super exciting stuff that is happening in RBD specifically when you use it in combination We when you use RBD back ends both for glance and cinder and there's really really interesting stuff also happening in the network layer and several other things so I Just can't tell you more in the time that I that I have here But what I can assure you is that we have a lot of stuff going on in the HA space In open stack, which is really a good thing because a year ago when I did the first one of this talk of the first one of these talks We were still discussing do we actually need a chain because a large a large well now you're laughing But because at the time a large number of or a large portion of the open stack user base was essentially interesting in building You know massively scalable architectures that Are essentially all comprised of cattle in the pets and cattle analogy And then it was about a year ago that people started looking at open stack for something that would just Completely reorganize their existing data center and there they don't have the luxury of having One application that they need to scale out to thousands of nodes but can rewrite from scratch if they need to But instead they have hundreds or thousands of applications that they need to run unchanged and they need the cloud to provide High availability for for them and quite frankly. I don't care if AWS can't do it I want open stack to be better than AWS in that regard and not just in that regard No doubt some of you are going to be asking for the slides the bottom at the top URL is Just the slides themselves and the bottom is the source on github This material is all CC by SA 2.0. So feel free to reuse it Just credit your source and will be fine I will not send you a horse head or anything of that nature So please by all means feel free to use this and reuse it and take it to your your user groups your meet-up groups your companies wherever you would like to take it and for those of you who are now using their phones to Photograph this and then maybe OCR it or any other geeky stuff that you want to do. Why don't you use this? You know that that also just works Okay All right, how are we doing on time? I have three minutes for questions Yes back here I'm sorry. I did I Acoustically didn't could you just come up a little bit you just talking a little quiet or run to the mic that'd be awesome Then I don't need to repeat That'd be great. Yeah, I've got you. Oh, okay. Yeah, could I use a could I use a monitoring tool in conjunction with the vacuum? Yes, absolutely, of course It's just that and what I would like to see in fact is something that's actually monitoring whether a node is is there or not And that is actually something we could build into into a pacemaker resource agent And then if it's not evacuate that would be really kind of cool pacemaker so what about heat or siren matter as a monitoring tool to combine with well Cilometer can basically generate the event that You that it that a node is down although for that I would also rather use and An HA sweet, but that's preference Yeah, and then we could have something that reacts to that event of course, okay Okay here Okay, so so yeah, so the common was we should we should distinguish better between The the availability of Essentially the cloud itself so the infrastructure underpinning the cloud and the availability of the services No Yes, it is so restarting a failed VM would be an HA service. Yes. Yes, exactly. So yes, so yes So declare. Yeah, absolutely. So to clarify Absolutely, AWS definitely has high availability of the infrastructure and we've got we've essentially got that covered now The part and that's where we can match AWS Right, but the part where it would be interesting to actually get better is to add, you know The the support for for automatic virtual machine recovery Okay, if you want to call it high availability services, that's okay we have Two hard problems in computer science caching caching validation and naming and off by one errors All right. Okay. I thank you very I'm sorry We were out of time for questions please feel free to grab me in the hallway outside and I'm also gonna be here all day tomorrow you can also find me at booth C8 and You can shoot me an email and connect with me on Google plus and whatnot. I'll not be hard to find I thank you very much for your attention and Enjoy the rest