 Welcome everybody. So we are here to hear more about Theory of this reality part of open-stack H.A. We have good first man Somalia Tahir Kali Nikolaou and Sri Ramthabramanian. Hey guys, when you go and introduce yourself We can just go in order. Good. So if you want my name is get pushman I'm cloud architect with Deutsche telecoms in three years and I let the development team that Have set up and develop the first open-stack production platform of Deutsche telecom and put it into operations with the operation team Hi, good morning everyone. My name is Schmalz Tahir. I'm with EMC. I work in the office of CTO as a cloud architect and Based on how filled this room is and based on my shirt We're clearly here to talk about the elephants in the room, which is high availability for open-stack My name is Kallan Nikolov. I'm a cloud engineer at PayPal eBay together. I've been in a kind of DevOps role in most of the operations and I've been involved in multiple Open-stack deployments and upgrades And I am Founder in Clothpush list at cloud on I've been with the open-stack community since cactus tableau days and currently I'm helping the enterprise workgroup and also the high availability documentation So we'll have a brief introduction to this layer of land here, right? Like what what what this mean by HA? What is the what does it mean the contest of open-stack and What are the high-level things? How do we get HA enabled, right? Like yeah, we pushed a button yesterday that you can it's a pink button if you click that on then you will have HA all set of course, it's not so The real meat is what the experience that from DT and eBay, but to understand a little bit more, right? So HA could mean multiple things whether it's your API endpoints or your services or your applications running, right and Anything when you talk about HA you're trying to avoid single point of failure the point here is that that there could be multiple SPU office here and Generally guidelines as you try to provide redundancy wherever it's possible Slapping a load balancer or try to have your cluster set up there, right? But these are all theoretical aspects The real difficulty is how do you get it done? What happens and what happens when the failure happens? How do you debug? What are the challenges there? As a high-level guidelines, right when you're trying to provide You target multiple points like whether you are trying to give take the services Are you trying to make them highly available? Are you trying to provide redundancy the databases or your message cues? It's easiest that them done. So I'm going to pass the mic to GERD and then let them let them take over We'll follow up with like a call to actions later Thank you Yeah, I was trying to say why is it complex right at the outset it looks very simple It's all like beautiful looking things on all all simple things looking but really it is this and then Now you can understand right that you can identify multiple points of failures here And then you can go attack each and everything separately, but Point here is that that it is a very complex thing, but don't be scared. We have ways to we people have done it There have been use cases successful use cases, of course, there are challenges here. We are here to walk you over What is that reality part here? What you can expect when you take this journey and finally how we can help? Hope you can enjoy so I will start with the Technic with the theoretical part about active active high availability with API service and ponds and Database and networking at the as the most important parts Thinking about high availability in an open stack depends heavily on the Technologies that you use in your platform. So for example, if you use different network Virtualization technologies then the HA concept might be looking completely different to another one This is true for the storage back end or database systems as well The vendors for example or the distributions offer different technologies or combinations of technologies to achieve high availability or Availability of services for example combinations from pacemaker crossing or HA proxy plus keep alive VRRP Galera and stuff like this The following description that we present here From the theoretical part is derived from the open stack HA guide So what's the target of active active high availability? In most cases, you just want to have more redundancy more resiliency Against single node failure single service failures. So you try to have all the services HA and that's not really true in in Open stack, but it's possible for most of the services. So for example for the stateless services They could be load balanced. You could deploy multiple of these services and you could use HA proxy plus keep alive To have an HA setup of this For the stateful services like rabbit MQ or the database There are individual technologies at hand that you could use they are all different You could load balance these systems as well And then there are some services or agent where no HA features available Let's come to the API endpoints You could deploy all the APIs on multiple nodes for example on a multiple API nodes or controller nodes and then configure the load balancing via HA proxy and Configure the resources and HA proxy and then only the virtual IPs of this proxy will be used For the registration of the API endpoints and at the identity And all the configurations files. You will only use the virtual IPs For the schedulers, they will be configured to use a cluster at rabbit MQ Broker or multiple nodes For the databases, there's a well-known solution my sequel or Maria DB with Galera cluster with a right set library right set replication library extension This works for the active passive concept as well It's a deployment with multiple master nodes and You need at least three nodes to get a quorum because you need a majority of the nodes for a quorum So if you only have two nodes, this is only 50% for a quorum if a node fails So at least you need three nodes in case of a network petition for example You can read and write to any any node, but this is optimistic concurrency So if you write to two nodes at the same time Maybe one of these transaction will be deadlocked and the application has to handle this event There are other databases available as well So for example Pacona extra DB or Postgres with another replication technology would be usable as well rabbit MQ it would form Rabbit MQ broker consisting of multiple nodes clustered and configure mirrored queues in this In this rabbit MQ Cluster this is done via policies and all the services use these rabbit MQ nodes Networking is Very interesting part for H.A. You could deploy multiple network nodes the network nodes at the end is represented by Multiple agents running on this Network node. So for example a new turn VHCP agent you could Configure multiple agents then you have H.A. on these agents There is a configuration item to configure multiple agents per network for the new channel three agent who forwards the traffic From external networks and does the netting for example. There are two options available The first one is just a failover feature that you could use the other one is to use VRRP But then you would distribute the virtual routers on other nodes as well and there are three agents where no High availability feature is available The neutron layer two agent the meter data agent and their load balancer agent For this you could use the pacemaker or coursing solution for example that will Shamile talk later in a few minutes about this This is a deployment example how it could look like this is just one example So as you can see there are two controller nodes and we put all the APIs on the controller nodes as well the rabbit MQ as well and we have two load balancers to distribute the traffic and realize the high availability via Keep alive with a virtual IP and we have three database nodes With this I hand over to Shamile There we go. All right. Thank you good Okay, so the way I'm going to approach the active passive session is actually Based on how the open stack H.A. guide today reflects high availability configuration And so this is a part of theory and we're kind of walking to the lessons learned of the current approach to the H.A. guide as well But with that we're going to cover general H.A. guide lines based on on the guide itself and then we'll cover the tool So, you know good mention pacemaker coursing from an active passive perspective Those are really the the hallmarks of how you're doing H.A. So I think it's better to talk about those tools and then apply them to which services They're applied to with an open stack So in general a lot of the requirements that Gert mentioned from you know the fact that you know component should leverage a virtual IP Still exists here as well. You should use multiple nodes obviously and at the same time as I was mentioning just a second ago a Lot of the active passive tooling is not open stack specific It's actually Linux specific more or less. So using standard if you're a Linux admin or sysadmin you've been using these tools Probably for years, you know, so pacemaker coursing and DRBD for volume replication are probably the The known standards if you will in that world So coursing can pacemaker kind of go together So quick brief on what coursing is coursing is basically the messaging layer used by the cluster management system So pacemaker it kind of uses it to maintain things like cluster membership and messaging for the cluster itself Coursing uses a redundant ring protocol So you have two networks and you know basically it'll use one or the other and you can actually set it to active active So basically use both or you can set it to active passive basically when one one ring fails use the other ring and then likewise from a Deployment perspective as you get ready to configure firewalls if you have firewalls in your organization as such as as much larger Many larger organizations do from a port perspective You set the mcast port which is the receiving port and then the send port is always that port number minus one as as the the sending outgoing port Pacemaker is what is actually your cluster resource manager? So pacemaker is where you define your resources and and what's the the resource that you're protecting basically so from this perspective Pacemaker has a few components as well The main one is of course that you know the cluster resource management Damon But then the CIB or the cluster information base is kind of what represents the current state of the resources as well as cluster Configuration that data that's an XML format is shared across all members of the cluster itself And then once that data is available to the cluster instructions to you know for for failover and state they're sent to the policy engine to the local resource management Damon And basically that's where your local resources are managed So alert the local resource management Damon is actually outside a cluster where the CRM is actually part of the clustered process itself Likewise, if you run into situation where you're close to split brain or you have other scenarios that a node is at misbehaving the stonent is basically the fencing mechanism that can basically power down a node and You know most of you are probably familiar with stonent But if you're not basically stands for shoot the other node in the head is is what it comes from so basically that's its job Is if something's been behaving just turn it off for now And then likewise everything that we're doing is defined as a resource agent which are standard interfaces for what configurations and what options And how do I manage the state of a resource? So when we do things like keystone API management through pacemaker excited Those are all defined as resources that we're managing and the last but not least especially for the database side is DRBD which is distributed replicated block device and basically what this does is it uses underlying block storage devices and Creates a logical block device on top of them IE you know dev slash DRBD number x and Basically the backing volumes of the ones actually have the IO what happens is whenever a right IO occurs It's sent simultaneously to a secondary node And then it's committed to the back end backing volumes of the DRBD volume itself on the secondary node When a read operation occurs the read operation is serviced locally itself and again So when using this approach you can anticipate a delay on right IO because effectively you have to wait for the other side To get the IO as well Whereas with reads you're servicing locally and you're generally okay from performance perspective Having said that how does this look together right so from an active passive perspective and the database side what we have is we have my SQL processes and resources which are protected and managed by pacemaker and Then we're using there. We're using that my sequel resource or database backed by a DRBD or a replicated volume And that replicated volume is what's making sure that you know our Data is available on the other end and then likewise core sync is keeping health between them And then so this is how you know the combination works to kind of protect the database in an active passive manner Gert mentioned Galera, and I wanted to kind of just take this moment here that you know in the HA guide Even though Galera is as we've seen and heard and we'll see again is widely used from an HA perspective for databases The way the HA guide is written today It hasn't been updated so the HA guide says DRBD is the standard recommended approach and Galera while works isn't really tested or Recommended yet, so you know we'll come back to how we're refreshing the HA guide But these are some of the things that when you know if you're new and you're using the HA guide to actually Deploy and you know design your HA solution for open stack some of the information there needs needs refreshing And I think this is where the reality part really comes into play of you know If you follow that guide you'll probably get HA But you might not get the ideal HA and you might not really understand implications of your design by the guide the way it's designed today Likewise for message queue services the design is pretty much the same they're relying on things like DRBD Etc to back you know to provide the same services as the SQL database. So it's very similar a configuration manner And likewise from the actual guide itself the way the guide describes active passive is you know You have virtual IP interfaces leveraged by all services in your configuration files You use the virtual IP address and then basically you configure pacemaker resources as we described earlier for all these different entities effectively To do HA for your cloud and as we'll see probably in the next section is you know In in large-scale deployments and as we actually go into reality Some of these things will be different from how the guide describes them and the last takeaway Lessons learned I guess I want to say from the guide is and the way we've been approached this session so far as we described active active active passive But the world doesn't work like that It's not you know You're gonna do all active active or all active passive most clouds will actually be using some components and active active Some an active passive so the blending of the two is is is what the reality really is so with that Gert so the platform I'm talking about today is Called the business marketplace of Deutsche telecom. It's a software as a service offering to small and medium-sized enterprises And we are offering software there for from software partners like ISVs and from DT itself The platform is based on OpenStack and SAF and is in production since the quarter one 2013 I'm currently we are running a scale out project. This is ongoing So I have to admit that not all the stuff that I'm presenting here is currently running on this system in production Some of the stuff is in testing other stuff is running since two years on the system the requirements The target of this Project is to migrate the whole production from an old data center to a new one to a high-tech data center and The one of the requirements is to scale out the capacity with respect to compute and storage and to eliminate As much as possible the single point of failures Specifically we are setting up this in two different fire protection areas in two physically separated data center rooms In fact, this is a single region OpenStack instance running with highly available Services all services will be distributed over the two data center rooms and the compute capacity and the storage capacity Will distribute it equally over both rooms All services as far as possible will have HA and we will distribute all the operational support systems and services as well in both rooms And we created a system to deploy the instances on this on this platform With the system from of for availability zones multiple host aggregates and scheduler filters So that when an instance will be started with a specific flavor It will be placed in a specific security zone in specific availability and placement zone in one of the both rooms This is necessary because we have a lot of pet applications running on the system And they have to be distributed exactly evenly in both rooms So what did we do? We have load balancing with HA proxy and keep alive for MySQL for the services and we're going to queue in the APIs We are currently testing engine X as well because we want to reduce the number or the The amount of different software used in the platform on other and other levels. We are using engine X Already, so we try to eliminate Software for a specific use case that we already run We are using Galera since Yeah, two years now and it's it's running very good On three nodes in the data center and rabbit mq with clustered with a class a cluster to rabbit mq with mirrored queues On neutron we have multiple DHCP agents started and we use pacemaker and Coruscant as well On the API endpoints. We have load balancing with round robin distribution and For the storage we use Seth clusters for our BD for the persistent volumes and for the object storage S3 So what are the experiences so far the load balancing in general works very well with a database we have the issue that Multi-node writes doesn't work very well. So one node is the master and the other two nodes are backup machines This diminishes the HA capabilities of Galera significantly So if you lose this master node you have to promote another mode another node to to a master and cover this problem that's Problem with the open-stack multi capability to to achieve multi write nodes multi writes currently And then we have specific issues with this deployment in two different data center rooms If you have two rooms and you have to deploy three different services For for Galera, for example, then you have an uneven distribution and this is exactly for Galera problem with the wrong room with two nodes Fails because of a network partition, for example, or a human failure in the network configuration Then the third node will deactivate itself and then you don't have a database anymore. This is obviously a problem Then we have a storage this specific problem. We Configured our self clusters with three replicas in the past But if you lose one of the rooms and then you will lose also a Significantly amount of the replicas for for a lot of data and then you only have one replica left. So you have a lot of Traffic on the self cluster to replicate all the data And if you lose not only one single disk during this recovery time, then you will lose data So we raised the replica level on the self cluster to have four replicas two in each room We adapted the crush map to distribute the The data evenly and then this is this problem is mitigated In case of a network failure for example failure of a network node or a layer three Agent we need up to 15 minutes to recover from this to spin up a new layer three node or move the The router and to spin up possibly a new machine This takes up to 15 minutes currently. It's okay because it is not a public cloud This is a software as a service offering. So the operators are the only people working on the on the API for example With respect to the SLAs that we have 15 minutes is the uppermost Limit and then we have the problem that we run a lot of pet applications on this platform and they may suffer from a major Disaster anyway, so These are not cloud native applications. We have for Smaller deployments and in one tenant we have for example to web servers load balancers in front multiple database servers if Exactly 50% of this installation goes down then some of the applications suffer from this failure anyway depends on the structure and the architecture of this application and we saw DHCP agent failure sometimes our plans for the future we would like to use distributed virtual router to make our network more resilient and Make it more elastic But this would require us to to upgrade to a newer open stack version for example to Juneau and the third data center room would be desirable for us to distribute their separate cars and For example the Guerrero nodes more evenly one node in every room for example with this I hand over to Kaleen Hi again I'm gonna cover quickly The scope of eBay PayPal opposite implementations that we have I Can see that we have PayPal and web and mid-year is running 100% and open stack Most of the deaf QA clouds that we have there are running open stack The number of hypervisors we have so far it's 8,500 but this number is growing rapidly. So it's probably more than 70,000 virtual machines all of them are KVM's They're spread to in 10 but she probably more than 10 around now availability zones We also have several thousand users in the dead QA so service clouds The PayPal and eBay a chain limitation I can't find Several solutions that we use but first of all I want to mention that we're constantly evolving we Experiment with Solution when if it doesn't work well, we try something else usually in the last we try something it works fine but when we got production large-scale thousands of hypervisors we found we very often find problems and then Try to mitigate these problems or try to switch to something more stable Another thing that I want to mention for is that we try to use VIPs a lot bounce VIPs for every service real possible So for for database We mostly use my sequel with multi master replication Currently we're trying to switch to Galera and we already have switched to Galera and For some services and the reason we are Kind of cautious with Galera is that there's some issues with Galera for example On tables without primary key. There's some of the deletes We want to go to Galera and we probably go to Galera but more cautiously For rabbit in queue we watch would change a couple of times the solutions Currently we're using Rabbit in queue again. It's behind VIP with a single note persistence fell over And we also have Implementations with three nodes with mirrored queues We are trying to move toward that direction three nodes with mirrored queues with behind a VIP with List connection For Neutron JCP and for albass we use core sync and pacemaker Where we have some issues there as well I'll mention those issues later on for the endpoints as I mentioned we have VIPs for every service With either around robin or at least connection we use three nodes three controllers for the open stack Services for storage we use As a shared storage with either NFS as ice-cazi Nothing fancy there So here that The most successful implementations we have with HAA As I mentioned a lot balanced HAA using VIPs for every service That's going to work very well We also use a single note fellow or persistence profile for Example mysql VIPs and robin queue VIPs that tends to be working very well except in some cases But I'll mention those Also, we are switching to Perkana glare with the Perkana for identity service. That's for a global Identity service it seems to be working very well so far And also yet the density was a global as a global load balancer Where we have failures One of the most annoying failures that we've been having is with a coercing pacemaker and these are with basically for the Neutron JCP and LBAS agents One of the problems we have here is that There's a lack of advanced health checks for example of a service is moved to another node basically pacemaker Knows about the service where it's up and down, but it doesn't really Know whether that's the service really working. It needs more advanced like ECP Checks to basically verify that the service is really working And with Neutron DCP we have issues where we have to do some manual cleanup on the most on the namespaces. We have to kill DCP Sorry DNS mask Processes For RabbitMQ we have kind of mixed Success with a single node follower persistence in some cases That doesn't work well and Creates issues to RabbitMQ where we have to clean up RabbitMQ But we are working on that basically moving toward Trinode mirrored gear For MySQL replication Again, we are trying to use a single node persistent follower basically we're talking to only one Master at the time if there is if something happens with that master it goes down the fellow persistence switching to the other node automatically and Someone is working on the other node to fix it. However, we have again issues with some cases with the fellow persistence It turned out it's in some environments is working well some doesn't for example, we found that Some hardware I'll be load balancers We have issues, but it most of time it works with the software implementations But again, we are trying to and the the work around what we have is that we have external monitoring and is disabling the Felt member externally, but we are working on getting Galera implemented For the Vips Not all of the Vips that we have currently are using ECV checks plus enhanced content verifications Sometimes when we have a VIP and We use only we just monitor the the port using TCP and that doesn't that doesn't always tell that to the services up and running service might be running, but It might not be talking to database and there might be some other problem. So we need this ECV checks to be implemented the future direction that we have set for HA debating PayPal is that We're trying to go a chain global original services For example, we have one like in each availability zone We so far we've been using that for a keystone albus swift and it's it's working very well We also as a machine with trying to move to three nodes mirror cues with our team cue We have already implemented that and it's so far so good We are trying to get we've been using the shared NFS for glance. We're getting rid of a shared NFS We've been using a swift cluster for that purpose So here's the I just want to show that the global service identity block diagram so we have Two global whips one for Keystone one for Galera whips and the global load bounces basically is talking to three HZ load balancers whips one for actually In each load balancer there is a VIP for a keystone in a key for Galera and behind us well, but also we have a cluster of control nodes for Keystone and cluster for nodes for Galera and what lessons we learned with this this implement the HA implementations we this is just just general ones that Usually to try to not to over complicate things usually this simplest thing work better than most complicated Once again spread simulate failures because of the lack of simulation failures we have these failures if possible place your Services in different Availability zones or at least in different fold zones and different racks Different networks. I will make backups In case of my sequel with up on I Think that's probably the yeah, that's the last site can guys take over I Hope you all got a preview of what challenges that you face it practically right Also one thing that we want to highlight we talked a lot about the infrastructure and then we were dealing about how we've had success in Isolating the failure single point of failures across multiple as servers and points are network networking components, right? But HA could mean different for different people and and we haven't we stayed away from you are a high availability of applications still the honest is on on the application themselves How can how can they handle the failure and also you can understand why? HA as a feature is not a push button feature or RSA is built in by default, right? You have a lot of moving pieces you have a lot of different components and then you have to deal them separately having said that We as a part of the HA guy team We are refreshing our guidelines on how to have a successful implementation based on our experience based on the user cases, right? And Also Some of the distributions some of the commercial distributions try to build that in in bake it in their available in their offering So that as a customer you you may probably have less work to do in terms of enabling HA for more guidelines you can always refer to the HA guide Sorry, I didn't include any link here, but please feel free to talk to us Also, if you want to contribute if you want to share your experience if there's something to add Please join the enterprise work group. The link is provided here any any possible way that you want to share share your experience That'll be great Sorry for more references. Oh, you have the link here So you can always check these references out and as always as a community anything that you give back It's going to be valuable. You can always ask the community you can find anybody Overall the community is being very helpful So if you have any questions, please use one of the use the mic over there or we can pass it on the mic And the experts will be here to answer You as well on stage just in case they have questions One a few questions for good Impact of the level the layer three agent failure. Did you experience it or simulated? And what because you spoke about it and so what was the impact and the recovery? We experienced it because we lost the node for example the complete node. We had to restart it and Then it took up to 15 minutes to recover all the networks in the virtual waters And in the meantime the VMs still spinning, but no access on the outside basically Okay, the other question was Sorry, how do you accommodate the pet applications in the cloud? We are running enterprise Applications on this platform and as we started the platform two years ago. We had the impression Oh, we will onboard enterprise applications. We will have a lot of cattle applications. Unfortunately, that's not true Nearly 100% of the enterprise applications are pet applications. So What we do is we offer Consultancy to our ISVs for example or internal projects if they want to up onboard an application on our platform and We try to Change partly the architecture or the application for example some of the applications right to the local file system We try to convince the ISVs to use object storage for example something like this We put load balancers in front of the services. We try to spin up more services On the other hand, unfortunately, we have to offer kind of legacy service So for for some of the applications we offer in the tenants NFS just to have the To be able to onboard these applications and unfortunately, there are some applications on this platform That are only using one web server because the ISV is not willing to invest money to change his application But these applications might be Market leaders and their section and their branch and then it's interesting to onboard them anyway The problem is that these applications were built five ten fifteen years ago And there is no use case for them to just rewrite the application because it's hosted on a cloud platform So you have to find a way in the middle and You have to change processes as well installation automatic installation configuration management all this stuff Thank you. We can probably take one more question. We are running out of time. So so Sorry a question on the peacemaker resource agents. I found it quite outdated and For the HA guide as well So what's your experience with those resource agents? Do you have to extend them for them to to be useful? Given example like the neutron error 3 agents and do you need to update them those resource agents for each Open stack release Thanks We are using the default resource agents. Yes So you you don't have to update them for full releases No in the past not Thanks everybody. Thank you so much