 Hello everyone Thank you for coming to this session Today, we'll be talking about study clouds and how to create Resonant and software defined data centers. I'll be Co-presenting this session with my colleague Christian if you words about about myself So I'm Daniel. Yes. I'm me. I'm one of the cloud solutions architect working for me. Ranties. I've been working for me Ranties for four years now and before that I Worked in the cloud space in the telco domain and I've always been an open source advocate question Do you want to say some words about yourself? Yeah, I've been working from around us for over nine years now. I've started with an open stack was still Let's say more than a little bit buggy and I learned more in those first few years than Eventually came up I'm principal architect. My specialty is hardware infrastructure and storage and I occasionally also write code, but this is more incidental to what I'm doing than than a profession Thank you question. So the usual disclaimer as this is a public presentation. We have the mirante's disclaimer We'll just change also this one. We also have a scavenger hunt So if you want to scan this QR code, we have some goodies that you can obtain at the beer garden I'll just put it back at the end of the of the session if you don't have half-time. Yes, you do want to take it So good That's good. All right What we have on the agenda today is we'll just first start with looking at why we want to design for The cloud for failure right and why this is important and even with the extra cost in terms of man and hardware Why we want why it's still worth this effort? We'll also look into then how to do it on the infrastructure side how to do it on the software side How applications are also important in that space and then we'll end with operational excellence and support Okay We want to treat study study nest study clouds the same way as security So that's why we took the the parallel with an onion and we want to peel out each of these of these layers So we start with The base which is planning and planning is very very important across all these all these different layers We also want to take care of the security of the cloud We want to also want to take out the applications back up and and disaster recovery deployment infrastructure and Keep that as an ongoing ongoing process right and what we'll do that now drink this presentation It's go and peel out each of these of these layers. So question Just explain to us why then this is so important to be So I have had I go to customer planning sessions a Several times a year and I always get the question. Why are you planning for failure? Why are you? Why do you think your cloud will fail? Well, I don't but I don't want to face the consequences if my cloud will Develop a problem if there's failure. I want I will wish that I have to have played a plan for that failure It made it made my cloud resist Common problems that we that that can crop up Catastrophic loss of business data is obviously this is something no system administrator one ever wants to Experience you will probably be beheaded by your by your bosses and this is something that I personally do not relish Infrastructure down for a week. This is another major thing if you have If your environment is not built in a way that you can recover quickly from from failure You will lose customers simply by not being available paid by customers going to the competition the need to rebuild everything is about Coming coming up with the way with a way to be operational while you are fixing your main cloud meaning disaster recovery and Then finally destroyed the business and the company's image. We've had that in the past multiple times that companies Lost major amounts of customer data and the fallout is typically not pretty not pretty also, of course ransomware lately there has been a rush of ransomware attacks, especially on public entities like hospitals and city governments and This has in many cases. This has crippled an organization to the point where they could not work for a prolonged amount of time Let's start with how we are going to secure our cloud first of all infrastructure we have to make sure that our Infrastructure and is planned in a way we have This this Needs to be done before you make any decisions on what you are going to buy what you are going to deploy There needs to be a plan and this plan needs to be adhered to Infrastructure building infrastructure physically I've been working in data center system for years and being it being able to cable something in a way that it will be Resistant to somebody pulling the wrong cable or some failure. It's important User recommended bill of materials, which means Specifically do not build too small and do not build too large. I had an extreme example Few weeks ago. Somebody asked me can I build a self cluster out of three nodes with 80 discs each of 8 terabytes each So I had to tell them. Well, yes, you can theoretically But what happens if one of those notes fields you are going to sink until Until the world collapses into a black hole So no building something this isn't this is it brings us right back to build to making the plan Build something in the middle. This is if you build something that is too small or let's say To compact you're going to have a bad time and if you build something that is too big too big You're going to have the same so come up with a solution that is Workable, but it's not so compact that failure of an individual component is going to kill you Then finally Don't use exotic hardware I'm not a big fan of Prescribing somebody what they should use but there are some players in the market who have reputation for Bad quality and I'm not I'm not specifically talking about several manufacturers but component manufacturers and Inform yourself about what you are going to put into that and implement you unusual layouts. This goes for instance for designs where you do edge computing where you Need to come up with a solution that is Can potentially work across An unstable link The you have to come up with a solution that will still survive this So lower as low risk deployments, this is one layer Up from the very inside of the onion When you are deploying open stack deployment Yes, somebody can deploy open stack manually. I have seen it done. I have not seen it succeed in production ever This is if you do not have some sort of automation Could be self-built automation. It could be automation that you buy from a vendor but either way there needs to be some automation around it and you are not only buying the automation itself You're not buying only buying yourself the Reduced amount of work that automation brings you but you're also buying yourself best practices that have been honed across 50 Other customers that you would be if you were to try to do this to yourself You would never be able to get the same amount of expertise that you will get When you are getting an automation and an automation platform Then the next thing is do we really need feature X. This is something that I Also hear do you support masakari or whatever? Yes, a lot of features are necessary useful, but if you Do not have a requirement for a feature. I would leave it out and only implement it when you Actually find a need for it which brings us right back to the deployment automation Which makes it easy to deploy that feature into your cloud but make it start with something that we call in What I call when I go to customers MVP minimum viable product Make something shoot for something that you can achieve that is that make sense for you That is has all the features that you need for initially and then go and make this the whole thing more complicated when you when you actually find the need to do this to do so and Automation is everything if you customize something outside of automation. This is the most dangerous thing that you could possibly do to your cloud a A configuration that cannot be replicated if you for instance upgrade or if you build something I'd add something to your cloud. This is something that should never ever happen another topic is disaster recovery and Also geo return and see that goes with it. You could Build dust disaster recovery cloud in the same data center date that you have Your cloud in the problem with this is that if you get if the earth opens up and you've swallows your data center or if An airplane falls onto it then at your disaster recovery environment is just going to be just as dead as it as your original environment Which is why we typically have disaster recovery environments in a geo redundant scale, but it could be something in the US and a different Disaster zone speak so to speak But something far enough away that the local local disaster is not going to kill you The downside to this of course is that you are going to replicate data across Band link, which means you have limited Capacity on that link and the one thing that you do not want to happen to yourself is that you have a data set and Active data set and a recovery data set and that the link is not big enough to keep the To keep the environments in sync and they diverge and then if something happens You are going to start with a much older data set than then then you do have that The other thing of course to think about our databases make sure that the databases are actually Replicated not only the underlying volumes so Here one more thing When you're building a cloud You could of course test all the things that things that you need to do in your production cloud I personally wouldn't do it because I have seen too many times that this Courses call it 3m and which are not particularly fond of So what we recommend is to build something very small But similar to your existing cloud where you can test changes to automation changes to it changes to For instance features you want to implement Let's say opens like masakari and you want to test this first You have to have a small cloud where you can actually try to try it out And then you can use the same automation to deploy this into your main data center One more thing back up in recovery and then Daniel gets the talk What is disaster recovery and what is backup? disaster recovery Get online very quickly with the absolute minimum that you need to continue your business It is supposed to only be a stopgap until you manage to fix your main environment But it is definitely not as the bit is back up for instance Let's let's guess let's say you are getting hit by a ransomware attack and you have direct data sync to your Disaster recovery environment the same is going to happen to your disaster recovery environment, which is why you also should Have backup of all the mission critical data that you can roll back to a state that was before that catastrophe happened Dr. Is also not high availability You could theoretically have today two environments that cross sync data and you use as a higher availability But this is not what disaster recovery per se is It's not fault tolerance fault tolerance is built into the platform But it is also not disaster recovery and it is also not a backup for the for the reasons we already mentioned So disaster recovery you have to absolutely plan For the catastrophe and you also should test it it should not Implement something and then hope that it will work in when when something catastrophic happens, but actually make sure that it does work so important super important to make sure that you only Identify the data that you actually have to migrate over you have to have a Platform that is similar, but can be much smaller than the main platform that you're talking about and To make sure that you can quickly very quickly separate sever the link and bring up the remote platform as operational This also of course includes some network magic. This is That is Unavoidable, but you definitely can make a system that it can come up in a couple of hours at most Then of course in what order? This goes back to automating the Not only the platform, but also the software that is on it to make sure that the Your applications are built in the right order to properly fire up so Disaster recovery plans preparing for disaster recovery create the documentation. This is something that a lot of people forget Yeah, we know how our disaster recovery environment works Then the guy who knew who knew that leaves the company and all of sudden you don't know that so make sure that it is actually documented Execute the plan make sure that everything that is in the documentation is actually Executed properly and that the documentation and the the real-world implementation match each other and then prepare the individual applications for the very disaster recovery make sure that you can start them in the remote environment and Test this test this occasionally Finally back up a Lot of people say okay, we have a self cluster self cluster that's three times application I do not need to back up what happens if the data if an actual volume becomes corrupted Meaning could be some software bug something running wild could be somebody who hits you with ransom where it could be somebody some malicious operator inside who is Actually just decides to or to erase some volumes randomly, but anyway make sure that the data is backed up in a Concise fashion and please please do not do Incremental backups forever Examples of how this how this really went wrong incremental backups should be for a few days at max and There should be at the next full backup going back for more than week for for a full backup is really bad So I think this was about it on that side Soon now to Daniel. Thank you question. So we we looked at the infrastructure. We looked at this also recovery Definitely selecting the infrastructure code is important We are all at open infrastructure. I'm not going to convince you about the open source the open source model But just some some numbers some facts Based on some research, we see that 89 percent of large companies are using Open source and some of them are using open source across various entities within the within the organization Sometimes they don't know what they are using and where they are using it But that's that's one of the challenges that I'm going to talk to In the next slide the second thing about open The next thing about open source is also the only innovation developers Will be able to innovate our new features Be able to outspace competition by using by using what is available for the open source communities there's no no need to reinvent the wheel and and and but Be part of that community and and contribute Choice is also something which is very very important and and as we are talking about study clouds having the choice of selecting the proper project the proper Deployment tool is also part of the whole the whole strategy around how to build robust Infrastructure so not to get into a locked in locked in situation But obviously open source also comes with its own challenges and these challenges if I can Resume them. We have challenges around the legal compliance and governance part as I said Many of these companies in 89 percent of these Large companies sometimes don't even know what they are using They don't have the exact portfolio of what they are using and each of these projects have various Licensing model so it is important than to tackle that with the proper procedure community and maintainers and Contributors this is also one of the challenge because it is kind of time consuming each and every projects have its own code So being able to push code within like for example open stack or communities or or safe Have different mechanisms complexity support and reliability is is one of the big challenge also in the sense that these large Infrastructure at scale on multiple sites are more and more complex and you also need to optimize that that that that element security vulnerability and risk definitely one of the biggest challenges Across these different these different projects. They should be even at the top of the list as a job zero element Simply because you can't push Open source code without making sure that it is properly hardened and then the last one is skills availability and retention We all and that's why we are also hiring admiratists We it's very very complicated to have People that are properly skilled in each of these open source solutions and and and retain retain them So how do we tackle all this then again? That's a Theme of this presentation is to plan right? We need to build that plan around how to tackle the open source software So have a clear policy of who uses open stack what projects They are using and also who validates all these all these projects then Contribute in a in a sustainable way and what I mean by sustainable way here is how people that are Doing this as their day-to-day job not on there on by the side of their day-to-day job So that they can spend the proper amount of time by and and directly or indirectly Contributing to these to these projects and then the last element is focus on on on where you are or your organization is bringing value and sometimes this also means that invest in Your partners in the sense that they will also help you or you will also help them to contribute into multiple multiple projects Securities is as I said key Avoid hands-on This is really really bad so infrastructural school concept is is something which is Okay here you want to avoid or you want to be able to audit of Who and when Someone did interact with the with the platform to have all the automation around around that Deploy your cluster also with a minimum amount of trust to make sure that You are able to add the proper Privileges around who accesses the platform and why run the penetration testing for sure Secure the perimeter first and then enable so these are the basic the basics around security, right? set up the proper authentication mechanism like to One thing we possibly should mention here open stack has a very granular system of granting access to for people to resources and It's in my opinion pretty underused most people just do what's what's default just look into it. You can actually Have for instance operators that have no access at all that can only see what's happening and so on built built your own roles the way that They are actually needed to make sure that you do not Give more permissions to people than you absolutely have to exactly Applications also is part of that that whole strategy and and We can build a as robust and as sturdy Infrastructure as possible, but you will always have cases where things fail, right? And that's what we we started the presentation with so the applications also need to take the benefits of the infrastructure and be able to cope with Any failures so immutable immutable applications is one of the approach around around this cloud native disaster recovery and applications using communities for example as a cloud operating system across various Environments infrastructure is something which is very very interesting to explore That's also relates to being able to have that reversibility capability and sovereignty capability and Being able to move them in a in an active active one active passive mode is Is important if we can't have that immutable pattern and Containerize the application at least on the VM side. We should be able to Segregate the different layers. So for example have a separate layer for operating system Have a separate layer for the application and then have a separate layer for the user data And for the user data, then we just modularize the data so that we can move them across various various replication replication sites which would also ensure faster faster recovery Yeah, another another note here. So this is this separation between operating system that is Basically comes from an image application that comes from an orchestrator and user data that's actually produced by the By the Users of your platform. It's also important when you do things like disaster recovery if you can Avoid Replicating anything that is not actually user data across your link. You already save yourself 70% of the data capacity that you need For the line between your primary and their DR environment and the same goes for backups You should not ever have to backup operating system or application This should all be be rebuildable automatically from assets that you have The backup should be reserved for actual user data And then rapidly on the operation excellence, and I'm going to focus on Observability here. We could spend the whole 30 minutes on this. So The first thing is define the KPI define how you are measuring your Availability of your platform right like for example 99.9% SLA on API availability is of your of your platform Have also the ability to do real-time logs and metrics To be able to monitor the infrastructure, but also the applications, right? And then as you scale out because you have multiple sites You want to be Joe Joe redundant you want to have a dealer store recovery site So as you scale out you want also to be able to aggregate all these log and metrics and then consume these to be able to build concise dashboards Have long-term retention period to be able to audit and then most importantly For your support organization proactively being able to fix and and do forecasting, right? So with that Thank you very much for your attention Do you have any questions? Yes, we have the mic which is at the back. I think It's a very fundamental problem Okay This is better Okay, so disaster recovery testing disaster recovery is not I'm simply like I'm switching off my main site And we are just going to run the disaster recovery site for a while to see whether whether everything works So there are different strategies to testing and one of them is to test individual applications that you can Prepare in a way that they can run on the disaster recovery sites So they so they do that But of course the downside to that is that you do not know whether all your applications Can do that the second possibility would be to for all your application to use a staging cloud and use and try running this Environment or running these applications Outside of your main cloud with the appropriate Let's say network changes that are necessary to make that visible if you could do that only on the inside So you do not actually have to Open this up to the internet, but this is that's a possibility The true switch of them switch of the main site I have not really ever seen working anyway I think that as much as it pains me to say because I would Love to be able to say okay. We are standing here. We have designed this properly We have tested it it is working In reality you probably will not be able to do that But you can just do your best to make sure that your plan is actually executable and then build it But this is a very very good question The slides will be put up by open infra summit But if you would like him and let's go all the way back to through this You can also email me or Daniel We have our Email addresses here. So if you Would like to or if you cannot find those please let me know or let us know Welcome any more questions. So Be a garden is outside. I think it might be exactly a good idea Was what I was going to say And we have some heads here