 Good afternoon everyone, welcome to protecting the galaxy session. It's a multi-region disaster recovery with OpenStack and Ceph and The reason we picked this fame is not just because we have a cool spaceship to showcase today But I wanted to actually to focus on the stars in this slide And the reason is that once in every 14 pretty much 14 OpenStack releases Stars are get to be aligned and we hear actually to showcase you this formation of OpenStack maturing To a point where we can actually discuss delivering multi-site and with Ceph a new innovation cycle tool to aligns an hour crossed and without further ado, let's go ahead and start So also with our savers for today's talk We'll try to do it as light as we can there's a lot of Technology and content today in our talk So we're going to live in a lot of room for Q&A and allow you to adjust and digest hopefully And if not come just come to see us after that So I'll start with me. I'm a product manager for OpenStack I take basically look at everything storage in OpenStack and I've been Monitoring disaster recovery from day one of OpenStack pretty much. So this is a very close and warm topic to my heart Sebastian. Yeah, I'm Sebastian working on several topics like OpenStack, Ceph and more recently containers Always been focusing on integrating Ceph into OpenStack And from time to time I do blogging as well Federico. I'm Federico Tredi. I work in Redhead storage business unit as a product manager on Ceph All right, so we'll start with a quick disclaimer before you take the tag from the new shiny OpenStack and Ceph This presentation only focuses on data disaster recovery The reason is disaster recovery in OpenStack is a huge topic, right? It involves every any given service in OpenStack because your whole site with your whole OpenStack infrastructure goes down Right, and you need to basically be able to resum for that However, we're going to touch about because we're going to talk about different architectures configurations Of course, we're going to talk about what it takes But the focus is the turnkey that we're looking today is actually how we can enable you to survive a disaster Working with OpenStack and Ceph So we'll start with a mission the mission is pretty clear right that's also in the title The idea is actually to seamlessly and transparently back up your OpenStack images and log devices from one site to another So in a vase of failure resources basically in site A can be manually brought up in the site B Right and when we talk about images will let you know exactly what we mean by that And here we are so we have pretty much design principles Some of the design principles actually are good for OpenStack in general not just for application right as a practices So while you're designing your cloud environment, you need to make sure that your images are templates of your application So that's is what we're basically replicating to the other side and everything needs to be in that in that template level Application data is always hosted on cinder block devices, right? Why we need persistent storage when it comes to a formal storage We basically you're looking this is where all the VM or formal Life lives on typically a machine route this if something goes down there and it's corrupted or whatever, right? This is not what you want to back up So this is the only probably Configure a storage that you don't need to maintain on a persistent storage and we actually not Care about the formal storage. We're not going to deal with it at all in the replication failure domain Finally the application stack is managed by heat we using hit in this example But you can use your any favorite automation tools such as Ansible or whatever And in a failure scenario the user basically simply reboot strap the application stack using hit in our case and Basically configures it using the configuration manager system that basically you can start the application on this target side Now keep in mind Side a is not an integral to side B Network topology may change Etc. So where you need this is where the configuration management comes into play to allow you to retweak the Heat templates accordingly so you can replay the configuration in the target side Of course, you have floating IPs to allow you to live it But again, there are changes that needs to happen So when you do this fail over to another side you may need to make sure that you have this configuration hand What it means that don't wait for a disaster had to happen to prepare for that that you need to maintain and record your Topology in the other side be able to prepare such templates in event so you can actually do this fail over to the other end All right, so where we are right now This is a long-standing effort basically to build our galaxy And as you know, we're looking at open second and seth today And if you look at just the diagram today where we are we are the point where we have a fully integrated system Right we're all the way from Keystone, Swift, Cinder, Glantz, Nova, Manila now with the new CFFS driver that was introduced in Mitaka and Cilometer right so we all of the services that we need to protect are already there and We have a very solid starting point to extend now The good news is a lot of you started with one open stack deployment, right? You have a single site and now that it matures you move to production. It's time to basically Take care of how do we basically back it up how we basically have a disaster recovery plan for it Now the good news is you can start small and grow up you done this you don't need to re-architect your cloud Just to support multi-site. That's the beauty of it All right, so we start with single site and then another one later And at the end if you don't need to basically re-architect your cloud or application or anything just to support it So it's it's an organic disaster recovery and with that Let's start to zoom one by one to the use cases architectures Thanks Sean so in the next couple of slides I'm going to be presenting some architecture example that aim to address the multi-site use case So the first one we have is rather simple and it can be basically called the recovery site where you Effectively have a single open stack site with cloud controllers Computes and a self-cluster where on the other side you just have a single set cluster Then why are we doing this? I mean and it's really fair to ask why would it would it be easier to simply stretch the cluster across both locations? To answer that you have to remember that Self-by-design is synchronous and strongly consistent That means basically for a client to get its right Acknowledged all the replicas must be written so if you do multi-site stretched Multi-site, but with a stretch to have cluster you will definitely introduce higher latencies and potentially a lower bandwidth as well so Remember here. It's a block story use case So it's really counterproductive to do this because we are aiming for low latencies and really high performance so Basically in order to overcome this Synchronous design of self we had to come up with a new functionality that was added in the latest release of self Jewel came out last week to that is called RBD mirroring, which will allow us to simply Synchronously replicate images and by images. I mean just like all the safe images not the open-stack images Within self we call a block device an image, but it's from an open-stack perspective. It's block devices and glance images So to seamlessly replicate all of those images from one site to the other one so now if we get back to the design it's We take the assumption of course that We have we use the same L2 segment just to make sure that you can properly do the replication and set it up It's not a real challenge, but the fail-over procedure is If you follow the the properties of the design should be really easy if we have to let's take the example you're using your Safe monitors the self cluster goes down and and you don't have any storage available on site one what you will have to do is to Basically promote the second site which effectively means Pulling all the images that were on the second side as primary then it will be able to receive IOS So perform I over operations and then read writes and stuff like this Then what you will have to do is to Reconfigure and reconnect all the open-stack services that were once connected to the initial cluster This is quite transparent if you follow the rule of having both clusters built with the exact same FSID It's just really convenient for glance because the way we store glance images within the way we store SAP images within glance is by using a URL Which takes the form of RBD colon dash dash some cluster FSID Slash pool slash image UID and in the snapshot So if you do this will be able to transparently restart your services And then they will find that the cluster has the same FSID and that the image is just simply accessible same goes of course for the L2 segment, so it's It's fairly easy, but Depending on how you deployed both sites depending on the distance between both sites You this is not a permanent relocation of the services, which means that at some point you will have to Roll back the images that are nor that are now being used on the second cluster into the initial one So what you will have to do is once you're losing the first step cluster you promote the second side you restart all the services and then you right away start rebuilding the initial step cluster with the same FSID again and You start backfilling images once you're done We know it's just this will probably include some downtime and Operational measures, but you have to do this because otherwise you will be running all of your workloads from another site So it's just like temporary relocation so now moving on to the next two designs, which will Which which fold into a complete different category where the previous design was really focused on storage multi-site on the fold on the next two examples We will also be providing guidelines for doing the open stack multi-site implementation So we took the approach of using multiple isolated open stacks So we basically have a single site an end site after and we just do a single open stack environment We do not stretch anything across regions and or anything like this The principle is that we keep on reusing the RBD mirroring feature So we just have this cross-replication from set a to side B and all the images are available on both sides at the same time So we we decide to of course replicate glance images and pseudo-block devices And then even in an event of a failure. We will simply we will have the ability to recover data from the other site So let's jump into the first design Which we call regions with no shed keystone So as mentioned here, we just have it's a completely different than dimension It's a real open stack multi-site where we Have just isolated open stack environments. We don't share anything The only thing that is being configured is just the RBD mirroring in between So it's just like see arrows from one side to the other. It's just cross-replication Here keystone sits on The cloud controllers just just as usual from an operation from an operator perspective The only thing you have to do is just to Create and register like regions and endpoints So every users can just easily log on one region and the other using their own credentials same principle again Use the same cluster FSID When it comes to challenges now when we have to recover from a failure it's a little bit more a little bit tricker because We don't have we currently don't have any ways to Replicate open stack meta data and I've been meta data I mean everything that Everything from glance to sender image properties image IDs and sender volume snapshots and stuff like this So when we have to recover what we have to do is once again promote the second side promote the second side So we have the all the images available But then what you have to what we will have to do is to it's kind of a hacky way to do it But as I said at the moment, there is no other proper way to do it. We will have to append Database records from site a to site B. So we will be using the tables from site one and import them into site B So the open stack on the second site will have the impression basically that all the images what it's not an impression because images and Volumes are already there. So it's just like we are registering All of those volumes in the second site so moving on to our third and last design Which is called the shared keystone with regions in its It's really an announced version of the of the second one Because here we are trying to make everyone's life easier by having by basically extracting the keystone database from all the controllers and Put that database on dedicated servers that then we will have a replication happening Across both sites and the principle is that keystone servers will be connected to this bicycle cluster Obviously this comes with a couple of challenges since we have to store a couple of things into the database where keystone owns like Users groups Endpoints assignments, tenants and stuff like that, but more importantly all the tokens and And the issue with that is that we are currently we're just like generating tons and tons of tokens And at some point this might be what just a little bit tricky to to to have that replication happening over one over long distances but We have to do this because we we cannot currently use the furnace tokens yet And this is the ultimate gold you all the ultimate gold using furnace tokens for those of you who are not familiar with furnace tokens Basically furnace tokens are a new way to generate tokens within keystone using the symmetric symmetric encryption mechanism Where we make the usage of a private key so we can encrypt tokens Basically keystone generates the tokens payload and put tons of information into that payload like user tenants and stuff like that and we use the private key to encrypt that token we deliver it to the user and When the user wants to authenticate to would just send that token to to the servers the service will go to keystone It's okay. Well this this token is valid The good news with furnace tokens is that we they don't require any persistence. So we don't have to store any tokens in a database so that's a Really good thing, but we're not there yet. They we the goal from the keystone is That's funer tokens should be Available by default from Newton. That's what I heard. We currently have a couple of issues with furnace tokens such as key keystone V2 trusts and and And forgot the other one, but well, yeah with and also the blah blah Yeah, forget it forget it Forget it. It's trust and okay. Well, I'm gonna understand on this. Yeah, it's revocation. Yeah happening too frequently so since it's an announced version of The previous design the recovery procedure remains the same Just promote the first to promote the second site append database records and And it should be good to go So now I'll hand it over to Federico who will be introducing and doing a deep dive into the next or be the feature or be the mirroring All righty, so as Seb was saying this was introduced to it the Sage's jewel release last week So it's brand new RBD mirroring is easily the most requested feature we have for block storage So it should prove popular the problem is most users that Come from certain sectors like FSI require a secondary disaster recovery site They're happy with the resiliency of self or with the fact that it's a chain But they have to plan for the fact that primary site might completely disappear With that in mind we have introduced functionality that Asynchronously mirrors from a primary cluster to the disaster recovery site This is done through a new demon called RBD mirror that synchronizes the images The volume images from the active site to the dr site These are synchronized in a point in point in time consistent fashion So that when the primary site abruptly disappears you still have a consistent image at the secondary not a corrupted one This relies on two new RBD image features Well a feature and the setting One is journaling of images and the other one is The mirroring setting to indicate that you want this image actually mirrored There are two ways that mirroring can occur. We can mirror Anything that's in a given pool or we can mirror select images in which case you have to flag which ones Images of states primary and on primary as Right now the system is being tested with only two sites However, there are no architectural reasons why Why this wouldn't work with end sites? It's just not something that we haven't tested yet and that influences the terminology so that We don't have primary and secondary but only primary and non primary the non primary image is non writable so What's the the right path looking like and now operation is generated it goes to LibRBD as usual then it's written to the journal the right is acknowledged to the client and Afterwards it's written to the image The RBD mirror demon running at the remote site Accesses the local rados cluster pulls the data changes from the journal and Reproduces these changes in the remote site images The important thing to note here is that this is a pull from the RBD mirror demon at remote site so the Mirror demon needs to have a Complete routable connectivity to the full rados cluster at the primary site And if you are configuring things in a bi-directional Fashion which is entirely possible if you want to load balance and some images are primary in one cluster and some images are primary In the other cluster and they are they're crossing each other and you have two demons They mutually must be able to access the whole cluster Here are the details of how this is actually set up You have to have different names for the clusters. They were saying you must have full routable connectivity between the the two rados clusters You deploy the demon at the secondary site You create pools that have consistent configuration at the primary and at the secondary Then you instruct the pool With the peering command you tell the pool where it is peering to so that it it discovers the secondary and Then depending on whether you have set the pool to be Every image replicated or only some images replicated. You might have to flag which image will Will be replicated or it may be redundant if you're replicating the entire pool You also need to enable journaling on the images which you can do at image creation time or you can do later Setting using an RBD setting What are the limitations today? Well as I was saying previously, we're only supporting two sites We don't have HA coverage just yet and this is live RBD only at this point We don't have support in K RBD majority of users are using live RBD, so that shouldn't be too much of a limitation This is in the jewel release From last week and it is going to be in the supported red hats f product which will label 2.0 coming this summer and Now my illustrious colleague will tell you how we merge this into open stack going back Okay, thanks Federico so Now let's briefly explore some of the new capabilities that we have from from Mitaka single folks have been working on the new version of the replication API 2.1 Which will basically allow us to it's just a replication API. So it's just kind of an enabler for us to To say when we have to trigger a balancing in the failover. So Basically, it works with single types where we will have an extra capability of it on a type and you will say, okay This is a type that is being replicated It works the main difference now from from 2.0 and 2.1 is that This replication works at the pool level or at the type level Because what the type level which means from a self perspective the pool level? So if something goes wrong, we can completely fail over or send a comment to do the failover from one type back end to another Which will effectively be a different self cluster in our case This is definitely the foundation of all the work we have been doing. I mean underneath with from from the self perspective to to be able to perform disaster recovery so Now I'll just oh, yeah, but I have only two slides only one So now let's do a final kind of a final recap about the gap analysis one is about keystone, but This one should be Should be solved by by Newton. That's the goal of the keystone team No, no real production readiness for finna tokens yet then we About open-stack metadata Currently no ways to properly replicate glance As for no wine cinder meta data from one side to another Fortunately, there is an initiative from the open NFV guys that is called Kimberd Which aims to implement a centralized service For multi-region multi-side deployments for open stack This would basically take the form of a new agent that will be responsible for synchronizing Several meta data from the open stack environment I think the scope for now is to do Nova, but we would like to extend it to cinder and also to glance just to to make sure that Both sides know each other's before we are triggering a recovery. It's one of the only corner case we have because we have the ability now to do the replication at At the safe level, but we need to find a way to tell the open stack that everything is happening at the same time We might be also working on implementing kind of a lock and a consistent mechanism to make sure that an image that is being copied is Also available on the open stack meta data on the other side So we make sure that we have both for example glance the glance image ID in the glance Catalog from the other side and that image is also being copied or just copied because it's an image on the on the other side so this is This project is a real enabler and we We really invite you to to push on it We will be investing on that project as well because we believe it's the right way to do multi-site Because We also think that things like sales and others are way too complex. It's it has been here for years and barely no one has as it implemented and it's For full of caveats and things like that. So we we took the approach of finding a much easier Way of deploying multi-site environments As we explained like isolated open-stack environments easy to deploy easy to maintain easy to upgrade No fancy stuff. No stretched things and no additional component This this one is probably the key to enable us to do real and easy multi-site deployments So I'll let Sean Put all that all those things together. Thank you. Sebastian. Yeah, so you saw where we are in Mitaka and And and but at and starting the Newton release, right? So in a sense, I want to take us back to the beginning. I talked about the stars and formation So we talked about Cinder for example So Cinder has now the almost starting the fifth version of replication API, right? And the reason is until now we didn't have a proper way to do it Which is across different backends as you heard from Sebastian the support to do at the pool level was just introduced in this cycle We actually Disabled in this release the v2 replication in Cinder and now all the two One API's we're starting to basically appear right now from the vendors So in a sense, we have a maturity point with Cinder Cinder By the way has a great Cinder backup capabilities as well that the only Service in open-stack today that has a backup capabilities to for the made a data of the service itself So when you create a backup of your problems You can actually capture the made a data of the service as well when you do replication as you saw We have steel gaps for example around glance, etc This is why we need a project like a king bird to come and and look at open-stack as a system level, right? We're talking about so many services, but we need finally wait open-stack as mature to a point We need to do things At the system level, we know things that doesn't work such as sales right in terms of adoption I think we believe that this is the way to go some of this principles already implemented Today and and and we're glad to see that finally we have a step back in the community So we can drive open-stack forward and really support multi-site configurations. So As noted for this release Newton that's starting right now We have already a blueprint for the RBD drivers in the replication to make use of the new RBD mirroring that Frederica covered With all the necessary changes Cinder replication itself The next version actually we started to have the design talks already on tiramisu tiramisu is the next version of the v2 replication Till now we have We excluded the point that we need to actually group Different volumes when we do replication because at the end of the day we're protecting workloads, right? This is what we serve the applications and the application doesn't reside in a single box, right? So we need to make sure that when we create a failure domain, it's complete friend and consistent point from application This is why we need to deal with the group level this not yet finalized. There's Actually tomorrow and Friday still discussions in the center community if you're part of the center community Please join us for this discussions and be part of it and able to ship the next versions and finally a project like in Berg if you Looking to contribute and again this started by the open NFE folks But we believe this is larger than just one Project in open stack because it's every every service in open sack needs to be represented and be able to Served and at the end of the it's it's up to the operator, right? We're putting all of our eggs in opensack basket It's time to protect them well and be able to resume and with that I want to open it now to some questions The slides are available using this barcode so if we free to scan it They will you get a copy and free free to see us if we are not able to address your questions in time We have a microphone, so if we free to step up and if not, we'll just try to replay your questions All right Yeah, but but but He bit you already You're not all right go ahead Is the question about the stretch or about the replication There is no issue with with how long the delay is in fact We're looking at cases where you would actually be replaying the log hours later for another use case So the latency is not the issue obviously you have to have Connectivity and there will be retries if the if the connection is not good That's right. Yeah, the transaction log is being replaced and replayed at the remote side The question yeah, the question is like how how far behind is the the log when we are like replaying in it Actually, it's not that far because when we send an IO the IO goes into the journal and then into the image of the back end So we are right away replicating that that blob So it's it's not like yeah It's it would of course depends on the network throughput and the latency that you have between both links But it's kind of it's really happening at the same time of the original IO So we're not delaying anything There are a few seconds behind but I think you're asking how can you tell how many seconds behind right? I don't know the answer to that, but maybe Jason can tell you So to repeat for the rest of the audience you would use the RBDC ally And you would get an answer of how many transactions behind you are Yes So if Everything goes into the journal that means if for some reason the replication falls behind if it's interrupted You can potentially be in a situation where you run out of space. Is that does that sound right? In the journal because there's it's it's not deceptional. Okay, it's It's a kind of you have to picture it's like another image basically another step image So there we don't have we don't have any space limitation and that it's not deceptional So it goes into the same pool can be it can be a different one. Okay, can be different. Okay. Yeah So we can maximize this of course Can we maybe just like get the mic and just because otherwise? Yeah, just thank you Just doing this Thank you So if I enable the journal so that means Already it's writing twice for the considering it's a five-story and journal So basically it will be now four times. It will be writing the same thing. There will be more IO The acknowledgement will go out after it's written to the journal so the performance penalty is minimal or it's not even there and If there is a client side cash the acknowledgement would go out even before you write to the journal But this is correct, but there is more traffic on the end because it's two different images and then in a different pool as well Two different pools. So it's yes for Ios Yeah, well, I'm gonna keep it someone Yes So I did I hear you say floating IP addresses? Yes, so there's no way of doing this without using floating IP addresses No, we said floating IP addresses when in the case we have you have to recover from one side to another one It's just when you have to replay your your application stack on the other open stack environment, but it's It's not related to how we do the recovery or the Maybe just go again because when you restart your application images at the secondary site They will have different IP addresses. Where did they get those IP addresses from? It's just the neutral internal DSCP. It's private IPs and so If I want to if I'm a user and I want to log into now to that application Then I have to do get that into DNS. How does that work? Oh, yeah, well, this is not covered. It's it's yet again Disclaimer, it's we're more focusing on the storage part. But yeah, I agree that at some point you will have to find kind of a clever Mechanism on top maybe relying on a global balancer. Just what I Haven't really developed that because on the picture. Sorry You can use kind of a sort of global balancers to root request and just make sure that Floating IP failure over went well as well okay, and You're talking about transactions. Sorry. This is just This is transactions at that database level or just On which context context the well, okay, so my application has got multiple Servers, it's got apps servers. It's got web servers. Yeah database servers When you're talking about transactions and in what context are you talking about them? We're talking about the rights to the libRBD journal. Okay, so it's transactions at the step level just like when you're writing and then goes to libRBD to the back end to the back end and I guess it's Not sure at which time we we did transaction So I can't really remember but I believe it's when we are doing the replication that we are talking about transactions So it's not database transactions or anything like this, okay? And then I'm still just a little confused on how those The images are brought up in the remote site, okay, so it's Basically, we do a replication of the set level So the RBD mirroring is responsible demon is responsible for replicating the entire pool So all the images are present in the glance pool to the other site And then we have to find a way currently it's not available, but you have to also Make sure that the other opens like environment is aware that those images exist on the other site So if you lose site one you have to take the database records and then append them into the second site later when we have Kimbert we will be able to just just synchronously maybe have images Metadata available on the other site that that's the idea So okay, depending those database records. How do those? My primary it's got a hacky. It's got a hacky But you have to do like set Story that I'd said area so we're using heat as a configuration management So he basically capture all of the configuration including IPs etc when you restore the images on the other side You are actually able to do the changes before you restore them So that way you'll be able to restore the same application on the other side That was our disclaimer. We can take this If you only at the data because I really the whole application really happy with the answer So we can take this outside. There's some holes. Okay. Well, let's let's discuss this outside And all right any other questions or are we running out of time last one? Okay? How to keep how to keep the I'll be the mirror demon always work file Always always work well. Oh work well What I mean by work well is just like a functioning. Yes always running running while so they got how can we? Overcome the initial limitations of the non-HA capabilities Jason It's just like I don't think we can do any pacemaker because we're transferring data Thank you. Thank you everyone. Thank you very much. Thank you