 So, let's start it Good morning. My name is Daniel Jaff. I'm working for a Deutsche Telecom. That's the biggest tech provider in Germany and I'm giving today a talk about a building a very high available cloud and together with half and open stack and Give you an insight into the failures that could happen if you set it up and how you mitigate it. So a short overview first I will speak about what the motivation for this talk was and then we speak I speak about availability and survival agreements about data centers after that about OpenStack and self explaining the architecture the HA setup and some other interesting topics and after that they main question of this talk if OpenStack and self can work in a HA setup together So the motivation for this talk is Deutsche Telecom is building an NFV cloud Based on OpenStack KVM and self so we have a data center design Out of two types of data centers one are the backend data centers in the setup There are only a few of these as a classical data center with a high level of it Usually they have two cores meaning two buildings or two fire compartments And they have a high server level agreement for infrastructure and service and we host their special services and Have all private and customer data there and then we have a lot of front-end data centers spread over the Countries in Europe and Germany They are very small only a few racks They are very near to the customer and hosting service that I should be there. They have a lower SLA And we host their NFV applications Let's talk about speak about availability usually availability is measured relative to 100% of the Operational time you expect so you all knows these Nines from Three nines to in this case six nines what we are usually aiming for is Something between four and five nines as you can imagine. It's Would be really hard if a tackle servers a core servers your phone is failing that often so This talk is mostly about the five nines in this case You would have then something about five minutes a year a failure, but yeah, we would like to be better than that so What is high ability? That is usually the consistent availability of the system you build in case of a failure and With a failure I mean hardware infrastructure. It could also be software. So Which ability is the most interesting for us, so it's you have the ability of the server you have the ability of the network the ability of the data center and of the cloud you build and for us The most interesting since we don't host a public cloud. We host a private cloud for our services Is for sure the end-to-end availability, so the ability of the service to the customer Yeah, the calculation of this Ability is quite complex It includes a lot of components and each of these components is contributing to the overall ability of the services So that is the infrastructure we built the hardware the software the processes we have all that contributes to it and The it's also influenced of the likelihood of a disaster of the fair scenarios. We have that's very complex I don't want to go in detail because that would is not the main topic of the talk and Usually depending on the SLA's you agree with your customer For example the planned maintenance time is maybe excluded from that so it could be more time down time than you may expect but in our case we Yeah, we don't have we'd hope to have no influence of the maintenance process Yeah, let's come to the data senders Usually you have some fair scenarios with data senders one could be for sure the power outage That means external and internal So the incoming power and also the distribution over inside the data center could be broken and also the back up UBS or diesel generator could fail and Would be a problem so the other side is for sure the network That's one of the most important for us That means external connectivity and also the inside so failing misconfiguration Cables which is raw or does which is everything related to that and for sure the failure of a server We consider it in a highly distributed HAs setup as not a problem the Problem the failure of a server shouldn't affect us and yeah on the other side for sure the failure of a software service As you all know human error still the Often leading cause of outage There are issues like misconfigurations of the network or other software services accidents like the Some guy is pushing the emergency power off button by accident because the Button was not protected or somebody is cutting cables or the usual Did the all-known epic failure from the switch where you put in the the network cable and resetting the switch right all that could happen and Yeah, also the case of the disaster is often external disaster is part of the calculation That would be issues like a fire or an earthquake or Even part of the calculation usually is how near the data center is to an airport and if an airplane could crash into it and also Unfortunately also calculations like how near the next power plant at a nuclear power plant You probably know the classification for the data center tears from the uptime Institute Usually Or what we are using is something between tier three and four Meaning that parts of the tier four Requirements are part of the tier three data centers Under that it's not that interesting but as you can see even with a tier four data center You don't get five nines. So That is something at the end you need to do on the software level or through different ways so How do you mitigate such cases? Yeah, the first one is find your single point of failure, right? and You have usually a lot of redundant components that meaning power and network Servers services everything should be redundant and it includes a very careful planning Related to network design and power management, but also to topics like fire suppression and Processes that are in place if something happens or all the disaster management and monitoring so as I already said Usually the reason why we don't use Here for data centers are there are much expense much to impact expensive for our They usually work case and it's probably not worth the money to Run it here for data center if you can do it on our other way so our idea is to use Instead of that an H8 concept on the cloud and application level to solve the problem So one example what we do in our data center only as an example is We knew users been spying leaf architecture on our internal Network meaning that we have redundant leaf switches and also management switches separated from the normal traffic and have the spine and the DCR the data center router switches redundant and Also each server has a redundant heads multiple redundant nicks and redundant power lines and supplies. That's yet the usually way So let's come to seventh open stake You have seen this Slide for sure more than once the self architecture is built on radars You have three main components One is RBD Rados block device The other one is the Rados gateway providing object store and It's half as but we for example don't use it because it's not production already, right and What's most interesting in the all the discussion is the Rados block device? So Seth consists usually in this case if you only take a look on the Rados block device out of OSD's in general the The storage device demons for itself. You have a lot of them and usually it's one part drive one OSD and it source the object on the disk and Handle all the replication and recovery The other important part is the monitors used in the system. They maintain all the membership states and also the crush map and they need to have a chrome they built it through a Paxos protocol and They are very small and lightweight and usually you should use an odd number of them So where come tough and open-stack together? these are the cases Open-stack projects that are using tough or tough parts that are using open-stack is One is the Rados gateway for object store Cinder glands for sure and and the end Nova to host the file system for for DVMs and there's also Manila using to have for example The stuff of S or Rados block devices so Which components need to be H a right? It's on one side the control plane of your cloud All you need for provisioning management all the API endpoints from the different open-stack services the Upment node if you have one in your concept or the control nodes and only I have the data plane with Yeah, network on store storage You have a different type of Services in your Cloud one are the stateless services there. These are these services where no dependency with three courses So if you send a repair And a request and you get a reply after that there's no attention required by the service anymore These are usually the API endpoints or the Nova scheduler And you have on the other hand the state for services like database or a rabbit and Q where an action Usually A subsequent call of an action depends on the former call So you have a state and you need to make sure that you don't lose this state How do you handle an H a set of you have in general two concepts one is active active and the other one is active passive So stateless services are the most easy to handle because you simply note a neat load balancing for example true H a proxy for both concepts and In the active passes case you bring for a stateful service usually simply a new instance up You the active one failed and to bring a new one up and on active at active you have redundant services running at the same time and all of them Zinc between the states on themselves that would be a Simple picture of how you probably would set up your H a set up on the open stack side you would have two nodes or more nodes and What did to distribute instances of the services you want to have in an H setup and would have an Virtual IP and handle the over H a proxy the access to these nodes and underneath you have these set up for your SQL database or Or also the rapid MQQ for example, so The most interesting part of my talk or the dignity core topic is is cro-room Usually if you have a cluster And Have an H a setup you have components They have a membership in the cluster and you need to decide which one is the is the leading part of this and To prevent the data and service corruption examples here are for example the usually the SQL galera setup using Quorum or MongoDB or Cassandra DB On the other hand also pacemaker core Zinc using or build building up chromes and Related to self you have the seven monitors that build a quorum through the Paxos protocol You should have as I already said an odd number of Monitors and at least three in the H a setup right one is not enough and What happens if you lose the quorum on self? Yeah, that's You have no chance anymore to Handle a cluster membership of the components like adding new months or new OSDs, but also the clients can't connect anymore since they don't get Information where they should get the data So the good big question is is open stack and it's have together H a ready For this I assume That you have an H a setup that has no single point on or failure on your open stack side and for sure the Chef Setup in general is H a ready if you have enough months And there is no single point of failure in the set up And the assumption here is that the availability of the rudder's block devices is really critical for your cloud So you host your VMs directly on our BD if that's not the case the talk makes no sense for you. So On the other hand the ability of the rudder's block the rudder's gateway I Would classify as not that critical because it's an Gateway like an API endpoint. You can simply load balance it and distribute it. So that would be Very simple and would be out of scope here So What happens on if you take a higher look on the H a setup if you Distribute your cloud over and data center with multiple cores or with two cores or two fire compartments What is happening if one of these? Courses really failing or if you have some Disasters or failures like misconfigured network or lost network connection between between these people those cores or also the case they lose power for one of these cores or fire compartments And then it's getting really interesting To explain how what's happening? I take this simple picture on top you have to your H a cluster set up and The second part are the compute nodes in this case using the R B D's and Anani's the self cluster and we have for this case two rooms I called them in this case fire compartments But it doesn't matter if you have fire compartments or you have a core a physical core in a little center like a building or so The first case is for sure that the one side is simply failing you powers off In this case In this special case where you have to decide failing with the less Safe months everything should work at least on the other side the H a setup is switching over the Open stack side and self is also deciding that one side is still running and The fire compartment B in this case would simply work. That's somehow. Yeah, let's say the best case scenario and Yeah, what happens if you side with the more more monitors is failing in this case Seth is not able to build a quorum anymore. That means the self cluster is going out of service So in this case, it doesn't matter what open stack is doing since the VMs have no Block device anymore the next time they try to write or read they go down They have a failure and yeah, basically your complete cloud is lost it's down and In this case is also Even a little bit hard to Make fix that by hand If you can't bring up to the second core again very fast You should for sure somehow fix it there because you want your cloud back, but That is a manual task on the south side. You need to extract the from all monitors the All maps and need to find out which of the maps is the latest one and has the latest epoch and then you have Manuality bring up additional months Or you have to bring down one but that's a manner of process because you that they You cut you could try to automate that but usually automation of this case would fail at the end because It would be really hard to find out which of the map states is the right one Yeah, you have usually also the split brain case where basically both sides are still running and have probably connection from the outside and or Are reachable, but the network between us failing for multiple single reasons if a switch or is failing between them or cables or Maybe misconfiguration of the network to your automation or whatever Then you could run in a split main situation and what's happening here is in this first case is for sure the same as Before was the first case the self cluster would still decide that the side with the more months is still available And in the best case scenario Open-stack would do it the same depending on the setup So you would have as in the first case where the first side is completely out of power or failing You would still have a running cloud You would have impact on performance And you maybe would lose some VMs, but it's would basically be still working And that would be in a split case a split brain situation the best case scenario What happens if the side with the last month is failing is somehow similar to the power outage situation in this case, but the self side would still run and what now could happen is that Your H a set up from OpenStack is choosing the wrong side So currently there's no connection between usually H a set up for OpenStack There's no description about this problem in the H a guide For OpenStack that you simply have the case where it's OpenStack says. Oh, I am I think fire compartment A should be still running and self is saying B. So again your cloud is down Other things you have to take in consideration here is You how to distribute your replications in On the self side in this to room set up usually It could be hard They normally the people run And set up with 3 replicas to reach to get a higher reliable data distribution But two rooms and three replicas. Yeah, how to split, right? To have it If you do three replicas and put one on the one side and the two on the other side Then you have always to risk it decide that has only one is the remaining that you one single failure one failing This could cause a massive problem and would cause daughter data loss possibly You would Maybe we need four replicas But that has also impact Right reduced performance for example and a lot of more traffic and at the end also more cost The alternative would be to use a razor coding with open stuff with this stuff, but also Basic erasure coding would also provide a reduced performance But you would need less space So you could also mitigate that with an cash tiering in this case You should consider that as an alternative On the other hand If you have an h a settlement really think about the failing of a complete compartment And you also need to care take care about spare capacity because the self cluster would immediately after the failure start and would try to replicate and Backfill everything is to recover the site. It's still living So you probably need a lot of space in your storage cluster on each side available to fix that what you could do pro is manually reduce maybe the the replication level for the time you are running in such a case but You should be aware what you are really doing in this case and how fast you can bring up simply your second core maybe The best way to mitigate this would be to have even in your two core or half a data center was more than two cores or Two fire compartments, right? Then in this case the failure of one compartment wouldn't be a problem but As very often and as also with tier three versus tier four it implies a lot of cost additional costs and It would be more resistant against such failures And you would also have a better replica distribution because you would have simply in each size one Repair replica, but you would also have more east-west traffic on your Systems so you probably need to change your network setup What we are doing is Each of our data centers we are aiming for to use Have usually some backup rooms. We are simply backup sources are hosted small rooms only a few servers can be hosted there and what we Put on it is one part of the h.a. Set up from safe from safe and From open-stake in this case we host some additional months in the third room But also some of the databases from open-stake We don't host data there because the rooms are not large enough for that That's less than the three cores or three fire compartment cases and Depending on your layout and your setup you could even simply mitigate the the Split-prane situation because you could route the traffic to over the third room with the connection between room one and two or fire compartment A and B is failing that Is something you could do? So if we go now to the upper level again Up to the applications you all know the discussion about pets versus cattle and in to get more High availability in your system. It's very important that you don't have Not clutch ready applications don't put cloud not cloud put the ready applications on your system You need applications where you can kill each VM and can bring up another the service should be able to handle that it should be able to scale and Should be able to have an h.a. Set up on its own. That is very critical in the Teco world with current NFV applications that is the really the hard part Because the most of the applications are not Are more pets than cattle But that's very essential also for NFV when laws that's providing services for example for takos And want them have to virtualized on for example our cloud environment. They need to be Cloud ready. Otherwise you Don't need to take a lot of care about your data center because you have the problem in your software As I said You need failure tolerant applications what we are doing is we build this cloud out of multiple backend data centers and We don't spend one open stack over it But the applications have can build their parts and different data centers And in this case a failure of a data center should Shouldn't bring down the service for us. So The requirement depending on the the importance of the applications and of the SLA of the applications are hosted in more than one data center The issue here is that the For sure is data replication. How do you get the data from one to the other side? You can't do that with this half because we don't spend us have over all these data centers Here the requirement would be in general to have store no state data on an OBD Or to take The all the application has to take care about the distribution of the data so what we offer for applications is to have simply object store to store data and That could be then Federated over the data center. So That would be the best way to synchronize them and put the replication On all these data centers the issues maybe that it's not synchronized For all them, but the application should be handed it And for sure it doesn't solve the problems if you lose a database usually The other problem is that many applications don't support object store right now. So my recommendation for everybody that's writing applications For a cloud is don't use object store if you have to store data If it's possible if you have a database that's another story Then you maybe need an hs setup for your database if you can't depend on it But if you have to store files or objects then use object store that is easier to handle on our side What can I give? an outlook on how to connect Open stack and stuff in a better way then It's currently done is the idea of having something we call open stack follow storage That doesn't mean that we have to implement it in open stack core services it's Would be possible to have an hs setup with open stack that maybe uses abd's as an fencing device to detect if the abd is still available on on the side you Have your hs setup and depend on it and then choose the same site as as safe for example And the other idea we currently discuss in the community in the safe community is to extend the Existing monitors for self to include information about where the they are physically hosted Currently in a self cluster. You only know the placement of the osd in fact because you have to crush map and you have your topology There but you have no information where your months are hosted You know you get the information which months are still running and which are part of the chrome and where the chrome is But you have no physical information Except you build something some logic on your own side and your hs setup So we want to do something similar to the crush map and put information on Demons so that you simply from price maker could ask the Self cluster which site is still running and which ones are and then you cannot very simply map on your setup and force Pace maker to do the same for open stack Um The now and other discussion as I said before it's very hard to get an additional one running If you have no chrome because you need to manually to Shut down the monitors extract the maps inject the map in a new one and in all the still running months and bring up the cluster again So you would have a downtime the discussion is currently to have Standby monitors to ease that up that you could afterwards bring Up any easier way there's currently an Rupren and a discussion ongoing and the our topic that maybe could ease it up is if you have something like a generic library or Engineering the chrome device where probably it's half or open stack could Delegate the the decision for an for a chrome To So that's basically the end of my talk. I want to simply give a short summary Open stack and stuff can provide ha together and you could also reach maybe five nines if you Planet very carefully if you have multiple data center because if one data center has less than Five nines you also have five nines less than five nines simply if you have multiple you can Get higher numbers You need to be very aware of the affairs and areas you have You maybe think that these scenarios don't happen very often That's probably true, but for sure sooner or later they happen that that's the only question when so You need to make sure that all your possible quorum decisions are answering To get this running and I would recommend or it should be recommended tools use a third room if it Really is third core or if it's simply as an hour case a room where you host additional services That is required to get your single Open-stack self-cloud Running in an H a setup and you need to as I said Take care of the data replication and the spec capacity in your self cluster and we will extend stuff to provide more information on that and the Target for five nines is the pet in our case is end-to-end and not the availability of the data center or the cloud and in one data center and Since that is very expensive on the data center level for that No pets, right and the distribution of your service about over multiple data center so the last one Get involved if you want to help, you know all the Open-stack parts here are some more information how you can also work in the self-community There are some ways we have many lists that we have IRC and also we have for example the opens the Open-self developer summit where you can engage us and can Submit footprints to we can discuss together. So you are invited. You would be really happy questions somebody gets the You mean forcing stuff to decide To take another side. Yeah, basically, that's not possible. All right. I mean you can't force stuff to From the outside because if you have only two months left or one month left on one side You can't force anything because it's not enough Yeah, that would be the easier way. Yeah, currently it's I would consider it as not possible to force stuff in one side No No Yeah, that would be suggest. Yeah, what? For us, it's the same if the application takes care about distributing that over the different Locations on their own that is for us also fine but in general if you the application don't want to take any Care about it then it should use objects. No, we don't want to do that In our setup, it's not an an idea That highly depends on Two things your security in our case. No never ever I mean that is part of the self-cluster and of the H8 said I'm I don't want to have that in a public cloud I mean we have a very strict security department for some very good reasons and we wouldn't do it But you could maybe do it but that in this case it was highly depend also on the latency and the distance and Yeah, you you probably could do it. I wouldn't do it. Yeah, then then Yeah, yeah, yeah, sure That you can't reach that if you take the applications out of the calculation So you can only reach it if the have multiple data center and the application is really aware that it should distribute over it And it's aware that it needs to be distributed and h a over that then if you don't have that then you can't reach it without a Lot lot lot of money Because tier 4 is not enough if you In this case in the setup as shoulder latency doesn't matter because all is in one data center, right? I mean if you put it on the public cloud, I Don't have a number but latency could be an issue in this case In if you have everything in one I saw maybe I didn't get it right But if you have everything in one core and everything is physically on one side on one place once that you whatever Then it shouldn't be a problem But if the distance between data centers is too high then you could probably in one and issues I mean 20 kilometers is maybe okay, but over that I would say Depending on your network the latency could get too long. Oh we had in such case, right? Yeah, I know even if you play with a You have for sure impact on on performance you need to be aware of that if you take a look on one single data center because all the data application and you need to bring up additional VMs for the applications at co have Impact on performance, but it would still be running, right? Yeah, for sure Yeah, you need to do that right It also depends on how full is your cloud and if you have even enough compute power on the on the side left So you need to care plan very carefully to on not only spare Storage also spare compute if you take a look on only the one data center And also if you take a look on the complete data center set in this case I mean you have need to have enough space left and enough resources left to host your services Okay Thank you the people. Yeah, that's I Mean bringing in general bringing traditional telco services from the usual black boxes you get from a vendor to using an open stack cloud and get rid of the vendor lock and at Usually requires a lot of changes in your Organization and that is probably the other biggest Work you have to do you need people that are interested to change and To be open to the change From black boxes to and real cloud Yeah, that is a lot of work. Okay. Thank you