 Great. Welcome, guys. I hope you guys had coffee and are fully awake for this afternoon. I hope you are here for the high availability talk, because for this particular session, we've ordered this special room in case of an emergency. There are two exits here, highly available. So make sure you don't get to use them. So let me tell you a little bit about myself. My name is Uthpal Thakrar. I'm a product manager for a company called RightScale. We do cloud management or multi-cloud management. So we'll talk about that during the talk. I don't want to do too much of a vendor picture, but I want to give you the context on what I mean by high availability and what it means to me. So my relationship with high availability, it started way back when I was growing up in India. A little boy. At that time, my dad used to buy me ice cream on Saturday afternoons. That was our thing. So every time he bought me an ice cream, I always used to think, what happens if I drop this one? Why doesn't he buy me two? This is where my relationship with high availability started. Fast forward a few years and now, I was grown up and I was coming out here for my college education to the US. What happened? Well, I'm going to project my voice. Can you still hear me in the back? Awesome. Okay. So guess what? You guessed it right. I was worried about the plane. Does it have enough engines to take me across? If one of them fails, what happens? I'm going to be fish food in the Pacific somewhere, right? I didn't want that. So then fast forward another few years, and now I was working in the telecommunication industry. I get to go to Japan, meet with some of the prestigious names there. SoftBank and KDDI was the one that I was talking to about a messaging product, right? I met with a gentleman there. He was pretty high up there. He was a quiet guy sitting in a room, and he gets up after the meeting was over. I did my product pitch. He didn't ask any questions at all, and the first question he asked is this, how many nines can you do? Later on after I left the meeting and went back to the US, I realized that I'd given him the wrong answer. Anything less than five nines was not acceptable in this country, in that country. So what did he mean by the nines there, right? That's a measure of availability. So as you see here, I'm not going to go through that. Two to five nines, five nines being, it's the gold standard. People always talk about five nines, and that's five minutes of downtime a year, right? That's what people shoot for. As far as their application, and when you think about it, it's like the dial tone. When you pick up the phone, you want to hear the dial tone. Maybe it's down for five minutes in a year, you won't notice it, but the telephone companies actually try to do that, right? So now, leap forward to 2012. The behemoth cloud provider is stamping out virtual machines at full force like Oreo cookies, right? And they're preaching the message, everything fails, be prepared, right? And rightfully so, because we had 27 big outages during that year, right? And as you can see from the pie chart, there's all kinds of players involved. It's public clouds, private clouds, hosting providers, even the SaaS players are involved, right? Everybody had some kind of outage. So it happens, right? Stuff happens. So who's responsible? Well, as you can see up there, the top three or four reasons are pretty obvious. You have power outages, the most common one, network failures, hardware fails, disk drive fails, you know, all kinds of equivalent fail, there's tsunamis coming in or other kinds of natural disasters. And of course, the Homer effect, right? And you get to see these. These things happen in the aftermath of an outage, right? And these are expensive. The outage actually has a huge impact on the business. It's not just about the lost revenue. When I go to a website, the last thing I want to see is the website being down or under maintenance or something. I just never go back again. At least I get discouraged to go back there again. And Netflix had that and quite a few other prestigious names there, right? Cost a lot, brand value, revenues, and so on and so forth. I was just reading a report about something in this area. Computer Associates did some research for last year and the amount of money that was lost or the cost of outages during the last year was $26 billion. So it's a phenomenal number there, right? So not doing, so the cost of an outage is going to be really high and not doing anything about it is going to be even higher, right? So is 100% outage proving possible? I asked this question myself. Is this possible? Is it worth it in some cases? How many of you believe it's possible? Quick show of hands. Wow, skeptics here, come on guys. We can make it happen. Yes, exactly. There is no, this is the thing, right? When we talk in absolute terms, when we talk about the, without the context, it's difficult to define that sort of thing. So you're absolutely right there. Finance is good enough for some application. I would say even four nines, three nines, depending on the application, right? And in some cases, nine fives is good enough, right? So in the old school, what did you do? You built two of them, right? Well, that's going to cost you. And we don't want to do that. We don't like that. So we are fortunate enough, and this is probably the best time to be alive as far as being a geek and a computer guy, is the golden age of cloud computing, right? Things are cheap, things are quick. Do it yourself, pay what you use, phenomenal, right? The same principles you can carry over into the HA and DR scenarios as well. So it's the golden age of fault tolerance as well, right? You can do all of that with extremely low cost. You don't have to build another building that's going to remain empty just for failover, right? Yeah, but there is always a yeah, but, right? So what about my private cloud? Because this is what I own, and I own the infrastructure. I have my data centers, I have my servers. What am I going to do with that, right? So there are two dimensions to HA when it comes to private clouds, and this is my opinion. Florian is sitting right here, he's the expert in that. So I'm going to defer to him for a lot of details there, but my opinion, there are two dimensions. One is the infrastructure itself being high available, right? The private cloud infrastructure. The second is the application part of it, right? With public clouds, you get what you get, and you live with it, right? So you essentially have no real control over that. So private cloud infrastructure, HA, all HA in most cases is defined by eliminating single points of failure in your architecture in the system, right? Now OpenStack, as we all know, there are quite a few things that can go wrong, right? Quite a few single points of failure. There's the API services, there's the MySQL database, there's the RabbitMQ messaging service. All of them can go wrong, there's just points that can break, right? So it's solved in different ways. People have their own ways of solving them. Florian is recommending Pacemaker, others are recommending something else. Galerize is another one that's being used for MySQL, doing a master-master application, and so on and so forth. So there are various ways to skin that cat, and I'm not actually going to go into that because that's not something we do as a right skill. All I can say in this is eliminate single points of failure as best as you can for your private cloud infrastructure. Oh, what about my app there? Okay, that's the good question to ask because that's where I can answer a little more questions there, okay? So the belief system we have is if your application depends on the cloud infrastructure for the HA and for the reliability that it's looking for, then you already lost the game, right? Because now you have to find, let's say you have to evacuate and you were to move to another cloud provider, all of a sudden now we have to find the exact same characteristics in the underlying infrastructure in order to achieve your business goals, right? So you are putting your business continuity goals and requirements into somebody else's hands, right? So we believe that we need to take control of that ourselves and assume that the underlying infrastructure is going to remain fickle and it'll fail, right? So build our application so that we are resilient against that, right? And of course there's the cost part of it, right? How do you build something that is resilient and cost effective? It's a balancing act, of course. And it again depends on, the answer is always depends, right? How much can you afford to, what's the risk tolerance that you have for this, right? It's like buying insurance. If you are, you know that you personally never like to use the insurance, but you still want that security blanket, right? It's exactly that. So as far as the application design goes, you should build for single points of failures. Make sure that there are no single points of failure in your architecture. Make sure that your systems, there's very little state in most of the components, right? So build stateless architectures as much as possible. Push the state down to one level where you can control it, right? Build for server failures. Servers are going to fail, networks are going to fail, zones are going to fail, clouds are going to fail, right? Keep all those considerations in mind when designing your application itself, okay? And last but not least, I'm missing a meeting, it seems. Keep management layers separate from the infrastructure, and I'll talk about that in a little bit. Okay, build for server failure, what does it mean? So this is an example of a three tier architecture where you have your load balancer, your application server, and your database layer at the bottom, right? You typically have a DNS at the top. In this case, it's essentially, the load balancers are assigned some kind of static IP, they're in the DNS. DNS is probably doing a round robin assignment to the application. The database is mirrored, it's all in the same zone. So I'm going to speak in generalities here. Zone is not specific to any particular cloud. Most clouds, actually all clouds have some notion of zone today, including OpenStack. So the replication happens from master to slave. It's always a good idea to take snapshots, backups off-site that are not in the same cloud, possibly in S3 or cloud files that DragSpace offers one of those environments. Take the snapshots there. And when these servers fail internally, make sure that they don't have to be manually configured for anything else. So when you restart them, you have to make sure that they find each other, configure themselves by themselves. So you don't have any manual interaction at all, okay? Build for zone failures. So the way this works is essentially, you're creating two zones, and in case of a private situation or a private cloud, a zone could be as simple as two different drags that have different power supplies and different network switches. So you're separating them out to a point where they don't fail together, hopefully, right? Unless the entire data center goes down. So this is still within a single data center, but there are two separate entities, if you will. Make sure your components are spread across both zones. So in case you lose the entire zone, you still have availability in the other one. As far as the database, which is where the state is, and that's the most difficult, challenging part to deal with, you must make sure that there's a master's or slave application of some sort. So there's data available in both zones as well. No, so I'll talk about that, yeah. So in this case, both of the, this is, in this case, both are active. This diagram shows both are active, but there are various configurations that you can have. You can have active passive, where you're just maintaining sort of a standby situation. It could be minimal in terms of the hardware that is used there, minimal in terms of the data replication that's done. It again, all depends on, there are two other terms I want to introduce here. It's called the RPO and RTO, and those are essentially recovery time and recovery point, right? And those two things are very important when you are thinking about cost and what kind of implementation method you're gonna go for, right? Those objectives have been kept in mind all the time. So here, essentially you're putting static IPs up there so that the DNS just does around Robin. Make sure that there is a slave database in one or more zones. It doesn't have to be just one zone, it could be more, right? More redundancy always good. And then make sure that you back up your database. A lot of times you back up using snapshots, you can stick it into one of the cloud files or this three buckets or something. We also recommend using Cassandra or MongoDB if it's applicable, if the application is conducive to that and it applies to your architecture, it fits into that. That's actually naturally resilient against these sort of things. It avoids the single point of failure architecturally. Right? Now there is another creative deployment model where your private cloud is actually becoming an extension. It's becoming like a zone to another public cloud and I'll use examples. So in case of Rackspace, you can actually stand up a private cloud in their hosting facility in Dallas, for example, and the public cloud in Dallas will become your, it becomes interconnected because they have high speed connectivity between those two, right? Same thing applies in Amazon as well. Amazon has core site facilities, Amazon has certain regions that are close to those core site facilities. If your private cloud is standing up in one of the core site facilities, now you're actually acting almost like a zone to Amazon at that point. But it's your own private zone, right? Built for cloud failure. Now this is kind of interesting because the entire cloud can fail and in this case, in our case, the private cloud could be just a data center and the entire thing can fail, right? So what do you do then? And this is actually the most interesting case where how do we use private and public clouds together to achieve our RPOs and RTOs, right? There are, again, the cost factor is always the case. This particular scenario is called the cold DR, meaning you're actually creating an environment on the right-hand side, which is a standby public cloud. Your main workload and your main servers are actually in your private environment and you're actually not doing anything here other than just being ready, right? And by being ready, I mean, you have the deployment that you have here is exactly cloned here except that nothing is running. So it's not incurring any cost to you, but it's ready to go, right? Now this is the cheapest of all solutions as far as HA goes, but obviously the RPO and RTO requirements are completely different. Yeah, it's not a failover. That's what I'm saying. So there's a finite amount of time, but yeah, sorry. So the question was the data is not replicated. How is it gonna failover? So what you do is you have an external place that you're bringing your backups there, right? And that's where your storage for the data is gonna be. Now in order to stand this up again, you will have to make sure the database is stood up again. So it needs to be recreated. It takes finite amount of time. It could take hours, right? But it's gonna cost you a lot less. So again, it's like a dial, right? Yeah, so that's the next one. And that we're calling the warm dear. I should have paid him for that queue, you know? So here we are actually just replicating the data, which means the data is ready to go. The machines are staged. Staged meaning they are ready to get started in exactly the same configuration that we had on the private side, right? But they're not running yet, right? So if they're not taking any load, they're just collecting data, they're ready to go. And that's a moderately expensive solution because you are now incurring costs of the bandwidth. You're incurring costs of the additional hardware. And if you're paying a replication vendor to take your data across, that's gonna be an extra cost there as well. Now the things to keep in mind here with, anytime you're dealing with a private cloud and a public cloud, the things that you have to keep in mind is the latency that's there between the sites. The cost of the bandwidth actually, and the security. Security is always a concern there, right? Because your data is now going to go through public internet. And if you don't make sure that it's secure and whatnot, we all know what happens there. So, the application is, it depends. So the techniques, there are a lot of different techniques that could be used. If you're using asynchronous application, there is a, agree, agree. So what we are trying to do, again, like I'm saying, it's a dial, right? So the RTO time and the RPO time, our RPO time in the other case, you could actually lose a lot. Because let's say if I took, let me go back to the previous slide. Let's talk about this a little. Okay, here I take my snapshots every 15 minutes or every 30 minutes, right? And it goes to the cloud files. So that's the 30 minute window that I have. If my cloud goes down on this side, I will not have that data at all. So that's my recovery point. Yeah, so data is consistent. So the question was how do we ensure that the data is consistent in the snapshot? So every time you take a snapshot, data is consistent within itself. However, if you haven't taken the snapshot, the data that is sitting there in the database at that point, that is a 30 minute interval or it could be an hour, you'll likely lose that, right? You can never time when the disaster is going to strike. So you have that window. If it is acceptable from your business point of view, this is all about business continuity, right? What is acceptable to your business? And if it is acceptable, then sure, this works for you. If it isn't, then you have something that is replicating asynchronously. So you may lose a few transactions. Again, this is, again, not saying this is guaranteeing anything. This is only guaranteeing that you're reducing the time, the RPO, right? So this may go around to five minutes or one minute or something like that, right? And there is, again, like I said, there's finite delays between two sites. There is, you know, this is geo-redundancy, essentially, right? So you are going to incur some cost. There's latency involved, and there's going to be loss of data as well. Now the third scenario is actually the hot year, where you actually have the machines running there. You're replicating. You're possibly even doing a multi-master type scenario. You're putting more effort into making sure that the data is resilient there. So this is the most expensive where, and some may actually extend this to an active-active architecture, but both of the sites are actually taking the load at the same time, and routing queries across, criss-crossing all the way. There's all kinds of complications tied to that, but it is possible, right? And it gives you a slightly better uptime overall. OK, so that's the dialogue, right? Where cost and availability are always at odds with each other, right? So they go in opposite directions. OK, how are we doing on time? OK, so the second part of this is to make sure that your workload is portable across clouds, right? Because in case you have to evacuate, in case your entire cloud has gone down, you want to make sure that the same environment is replicated to the other cloud where you're taking it, right? And that can be done by, you know, we offer one of the solutions, there are many others, but I mean, I can speak about our solution there, to make sure that the environment that you have in your private cloud is exactly looking like what you have in the public cloud. So you can do a quick switch over, and it becomes extremely seamless there. And of course, automate and test everything, because when disaster strikes, you have no time to screw around with it. It's not going to be at 10 a.m. where you had a nice cup of coffee and everybody's perked up. Nope, it's going to happen in the middle of the night when you least expect it. So make sure you're monitoring, you're alerting yourself, you're monitoring the health of the systems, because a lot of times there are telltale signs that something is going wrong. So before it dies, it tells you that I'm about to die, and you just have to read the signs, right? So you have to look out for those things. And of course, run filers, and this is one area where people plan how to back up, or people plan how to actually do things, but they never exercise that, they never practice the drills, right? And that's where it becomes a problem. So make sure you run fire drills and practice your procedures. And last but not least, separate the management interface, separate the management from the actual payload that you're managing, right? So this is like this dude here who is stuck with his car, the key inside and probably a baby in the backseat, which is not in the picture. But this could happen, and I'll give you an example. If you're managing your private cloud and your software that you're managing it with is also sitting right there, if you lose the cloud, you lost both. And now you have no way of moving your world cloud out of there into somewhere else. So both avenues are gone. So again, like the two exit doors we have here, you need that outlet there, right? So having a SaaS-based offering, and which is what we do, I'm gonna shamelessly plug that, is to have the separation of concern between management of your workload and the actual place where it runs. So there is some resiliency and we're done here, okay? Automating HA and DR. So just high-level principles. Make sure that you never have to actually touch anything manually. Everything has to discover itself and fit into the environment where it belongs, right? So in the three-tier stuff that we were talking about, if an application server goes down, it needs to come upon its own, needs to figure out where the load balancers are, needs to figure out where the databases are, and fit into that environment. That's what we've been by automation there. No manual interaction. The promotion of slave to master, this is a pretty controversial topic overall. People like to think that everything can be automated. We have a slightly different view on that because a lot of times there are more possibilities of having false positives in this case or false negatives or however you wanna look at it, where a slave accidentally lose visibility onto the master or the thing that is monitoring both of them loses visibility into one of them for completely unrelated reasons. They're both healthy, they're both doing their thing. We end up with a split brain effect and the slave decides that it needs to promote itself and now you have a mess on your hands, right? So make sure that you understand the situation and then press the button says, okay, now failover, right? So have a manual intervention there, but at the same time the process of doing it should be automated, does that make sense? Okay, so I was talking about this exotic scenario where you have your own private cloud acting as an extension or an availability zone to a public cloud. I wanna invite our friend Kirk Kim from Samsung SDS. They are doing something similar in that area. Would you like him to speak to that? Kirk? Sure. You have a mic on? I don't know whether it's working, is it working? Can you hear me? Okay, great. My name is Kirk Kim and just wanna introduce a couple of people that I was working with. Ted, can you stand up? I don't know whether Centil is here. Centil? No, he's not here. Anyway, SDS has been working with the rice scale last two few months and we are essentially creating a hybrid environment and we are trying to make sure that the whole concept of a hybrid is working and we are obviously using a rice scale multi-platform to make it work. And I just wanna just press it. Okay, I just wanna introduce the underlying network. First of all, before I get into this, how many people actually attended my session yesterday? I just wanna, okay. Really, I'd like to get into the background but because of time that I have, I just wanted to focusing on the rice scale aspect and the environment that we created to work with the rice scale. So essentially we have a SPCS which is our private cloud and a public cloud which is very nearby. We actually create an environment in Virginia area where the distance is very less than 25 kilometers. So it's essentially two to three millisecond delay. We are creating another available zone from the public cloud perspective so that in case there is a problem with our private cloud, we can actually use public cloud as another DROHA. That's how we actually create it. And this is just a network architecture where if you look at the public private side, essentially we are creating a VMs compute side and then the public side, as you know, public cloud, you can create a VPC and also it has its own public resources such as object storage or compute resources. So we wanna take advantage of both networking and we have a redundant network between the private and the public cloud through the private network. In this case, it's a dedicated line so we have a very, very low latency and essentially no delay. And then we also have access to the public cloud using a BJP router. Now, can you imagine that this is one location and then it can duplicate another region and then we can also access another region through the internet gateway. So actually we can have not only low latency on the region perspective, as well as we can connect to another region. So that's how we created and that's how we are working with RISCale. RISCale is working top of it and running our applications as well as controlling our infrastructure. That's it. All right, thank you very much. So how RISCale makes all this possible? So, again, I'll harp on that one string again. Make the environments reproducible and configurable and cookie car or blueprint-based, right? This is what we do. So we create a blueprint of your application workload and we call it server templates. The server templates are essentially nothing but the base image plus a little bit of our secret sauce and on top of that is a bunch of scripts that are saying, here is how I would like my server to be configured. And once you run that, you will get exactly the same environment that you were hoping to get again and again and again. It's a repeatable process that you need to follow, right? So this is what we do. Beyond that, we have something called the multi-cloud image. This allows you to make sure that you don't need to struggle to get images onto the next cloud where you need to be. All of our server templates, all of your workloads just port across. And that, if you remember, there was a slide earlier saying, portability is key. Portability across clouds is key, right? And this, again, just speaks to that, that's, you know, multi-cloud images. You know, several clouds can have the same images and you can easily replicate your workload across. Okay. So out-of-proofing best practices. And that, you know, this, basically, I'm not going to read through this, but the highlight there is make sure there are no single points of failure. Make sure that you design for all those factors that we talked about. Make sure everything, you know, assume everything will fail underneath you, right? And your application in the end, your application is responsible for your business continuity, right? You are responsible for that, not the cloud provider, right? So that's, if you go by that mantra, then at least you are in control and you can take your workload anywhere at that point without actually having to have specific requirements of HA from the underlying cloud infrastructure. Because the cloud infrastructure, guys, they're going to tell you we support three nines or four nines or 99.95. What does it really mean, right? If something goes down, they'll give you a refund. What are you going to do with the refund, right? You lost your data. That's your business. So worry about the application, worry less about the infrastructure, okay? The HA principle applies generally in life as well, although in some cases, like relationships and all, don't try that, because my wife didn't like it. Okay, thank you very much, guys. RightScale.com, sign up for a free account if you want to try it out. We are also hiring, so check out the job posting as well. Any questions? We have, like, exactly three minutes. Three minutes? Yeah. Well, looks like you guys are all good. Thank you. See you.